You may have heard that artificial neural networks are "universal function approximators," and that this allows them to do many different things, enabling diverse technologies from chat-GPT to self-driving cars. You may have a good intuition for what universal function approximation means as well; there exists some setting of neural networks parameters that can be fit arbitrarily close to any curve. For example, there are neural networks that can map inputs x to outputs y like any of the continuous functions below:
In this post, we’ll provide a visual explanation for why neural networks are universal function approximators.
We’ll focus on the case where the neural network takes 1-dimensional input and returns 1-dimensional output, but a similar argument can be used for multiple dimensions.
Linear Splines
We’ll show that a neural network is a ‘linear spline’, which is just a sequence of line segments, connected end to end. Neural networks can be fit to any curve because these linear splines can be fit to any curve. The fit of a linear spline gets better the more line segments it has.
Here’s an easy way to fit a curve with line segments:
- Place a number of points on the curve you want to approximate, evenly spaced along the x-axis.
- Then connect each of the adjacent points with line segments.
That’s it! These line segments taken together are a linear spline, they draw out a new curve. Observe that as the number of points (and line segments between them) increases, the linear spline approximates the original curve better and better.
A Neural Network with One Hidden Layer
Hopefully it’s clear how a linear spline can approximate any function, because now we’ll show that there is a neural network equivalent to any linear spline.
Let’s consider a neural network with 1 hidden layer of 3 neurons. It maps an input, x, to a "hidden state" of 3 numbers, h = [h1, h2, h3], to a final output, it would typically be diagrammed something like this;, to a ‘hidden state’ of 3 numbers, to a final output, y. it would typically be diagrammed something like this:
I’ve labeled each neuron (o) and connection (—) with its associated weight (w) or bias (b). These are tunable parameters that affect the computation the network performs. A neuron outputs a weighted sum of the inputs to its left, plus a constant bias. For example, the output neuron in the above network computes the following:
A dashed circle, like those in the hidden layer, denotes the neuron applies a ‘relu’ to its output. This means that if the neuron would output a value less than zero, it ‘turns off’ and outputs zero instead. For example, the h1 neuron in our network computes:
Graphically, the relu transforms a line with slope w and intercept b into the ray for that line above the x axis.
A Neural Network as a Linear Spline
Universal function approximation means there is some setting of weights and biases such that our network with 3 hidden neurons can match any linear spline with 3 line segments. To see how this works concretely, lets match our neural network to the following 3 segment spline:
For our network to match this spline, it needs to obey the following equations:
Set Biases to Match the Start of Each Segment
It turns out there are infinite parameter settings of a neural network with 3 hidden neurons that will match this (or any) 3 segment spline, so here we’ll focus on a particular solution. We’ll start by setting the first layer weights, [w1, w2, w3], to 1, as even with this constraint our network is still flexible enough to match any 3 segment spline.
With this constraint, we can think of the hidden neuron’s bias as uniquely determining when the neuron "turns on." If wi = 1, then hi = x + bi, and hi surpasses 0 and turns on when x = -bi.
We want each hidden neuron to "turn on" at the start of each line segment in our spline, which will allow the network to start behaving like different functions at these points. The first, second, and third segments start at -3, -1, and 1 respectively, so we can set the hidden neuron biases to [b1, b2, b3] = [3, 1, -1].
Fit the First Segment
Here’s the critical part: observe that with these biases, when x < -1, both h2 and h3 are fixed at 0. The overall network obeys the equation.
-3 < x < -1 corresponds to the first spline segment, fit by the function y = .5x + .5, so we know that our network fits the spline when:
Another way to think about this; the function for h1 needs to be move down by 1 and have its slope reduced by .5 to match the function for y at the first line segment.
Fitting Additional Segments
At this point there are 2 parameters left to fit w5 and w6, which must cause the network to match the remaining two line segments. The procedure for finding these weights is the same for both – or any number of additional segments.
Observe at the start of the second segment, h2 "turns on," thus our network equation changes from the first segment and (y = .5x + .5) becomes
The second line segment obeys the equation y = x + 1, so we want
The same logic applies to finding the last parameter w6, which must change the equation y = x + 1 to y = -2x + 4:
A simple way to think about this is to recognize each additional neuron simply changes the slope of the line. In our spline the 3 line segments have slopes .5, 1, and -2. When each additional neuron turns on, its weight is added to the overall slope.
Graphically the progression looks like this:
Conclusion
Hopefully it’s clear from this example why neural networks can be fit to any continuous curve. In our example we had 3 hidden neurons, which allowed us to draw 3 line segments. But with more hidden neurons we could add more segments, updating the slope of our line with each time.
Comments