Weighted least-square regression is a method used to consider some data points more strongly when generating coefficients. When writing about non-linear least-square regression, I ran into some problems for ordinary least-square regression. One solution to this problem is to use a weighting term, and the problem function and solution will be explored in this article.
Let us start with the function:
In this form we can apply a technique discussed in an earlier article and use a modified linear regression equation:
So far there is nothing unusual about this equation. Let us try this on a clean set of data:
The graph shows the known data points as dots, and the solid regression line. This is what we would expect to see.
Now let us apply regression to a set of data with small amounts of noise on the signal.
What happened? The introduced noise barely effects the data points, but clearly the coefficients are not correct for the regression line. The deviation from the original data points gets worse and worse toward the right side.
This is one of the problems of finding the coefficients for the linearized version of the function. No account is taken of the non-linearity of the data. So what we end up with are coefficients that minimize the error for the linear form, which does not always translate well back to the non-linear form.
What can be done to fix this? One solution is to use a different curve fitting algorithm, such as Gauss-Newton. Such algorithms suffer from the fact they are iterative, unlike the closed form we have for linear regression.
An other option is to weight the data. For the above graph, we want the data toward the right side to be more strongly considered than the data on the left side when evaluating coefficients. We want this consideration because data toward the right is much more sensitive to error in the coefficients than data to the left. To accomplish this, we need some kind of weighting term. To explore how to weight data, let us start simple and use the mean average.
Consider an average of a sequence of values:
Here a is the average, and yi is our sequence of data. This is the standard representation of the average. As explained in an earlier article, the mean average is actually just polynomial regression of degree 0. So let's write out the average as polynomial regression in expanded matrix form:
This doesn't look like the mean average, but it can be shown to be so if we reduce. Here a is the average. A 1x1 matrix is the same as if it wasn't in a matrix, so we can remove the matrices.
Normally one doesn't see that averages have x values associated with y values, but they do—kind of. The reason they are not seen is because the x values are raised to the 0th power, which means every x value is just 1. And the summation of a sequence of ones is just the count of that sequence (if you add up n ones, you get n). So the series can be reduced:
Now solve for a by moving n to the opposite side:
Thus polynomial regression of degree 0 is the average. Now to add a weighting term. Consider what happens when both sides are multiplied by a constant:
Algebraically W does nothing—it simply cancels out. We can move W inside the matrices:
And move W inside the summations:
Thus far we have assume W is a constant, but that was just to maneuver it into place. Now that W is in where it should be we can stop making that assumption and allow W to become a weighting term. Let W = wi so that there is a sequence of weights that can be applied to every y value. Our equation becomes:
We can drop the matrices, and get rid of the x0 because that is just 1.
Solve for a:
This produces a weighted average with value yi weighted by wi. If you look up the weighted average, this is the equation you will find. This equation is equivalent to the mean average if the weighting term is constant, and we can quickly show this:
Here w represents the fact the weighting term is the same for all values of y. So a normal average can be thought of as a weighted average where the weighting term is constant.
We can apply the same weighting method to linear regression. First, the expanded representation of linear regression:
Now introduce a constant weighting term:
Move this term into the matrix:
And then into the summations:
Once there we no longer make the assumption that the weight value is constant, but is instead a sequence:
Reduce the powers:
And we now have the weighted form of linear regression. What was done here should be pretty straightforward—wi was just placed in all the summations. So let us now apply the weighting term to the linearized version of our problem regression function from the beginning of the article by doing the same procedure:
Again, if the weights are all the same the results will be identical to those of the non-weighted version. We now have a weighting term that can be applied to give unequal consideration to select values of the function. But what should the weights be to help get our curve to fit?
Here is where some thought about the shape of the function along with some trial and error can help. What we can do is place a higher weight on the values to the right side of the graph. This is because the data to the right side is more sensitive to error in the coefficients. Stronger weights for values to the right will have the effect of making data points on the right considered more strongly than those to the left.
Just consider the w sequence a function of i. The simplest weighting term is to use the index if we assume the sequence of x is always increasing (that is, x is sorted starting from the lowest value and moving to the highest).
Now let's try computing the regression curve with this weighting term:
There is a marked improvement, but the regression line still isn't getting the curve quite where it needs to be. Clearly the right side needs a stronger weight.
Let's try squaring the weight:
Which results in this graph:
And that fit is almost perfect. In fact, using this weighting term we can increase the noise and have the regression respond as expected:
Here the noise has been increased by two orders of magnitude, but the weighting term still produces a fairly good fit.
For this particular equation the higher the power, the better the fit. For example:
This weighting term works well for this particular scenario where both the coefficients are positive values. However, it may not work in other situations. The weighting values need to be evaluated for the particulars of the application for which regression is being applied.
The use of a weighting term has allowed curve fitting of a non-linear function to be solved with a pretty simple regression technique. Unlike methods such as Gauss-Newton, this technique is deterministic and fairly easy to compute. So this is an other algorithm available for filtering out noise in real-world scenarios.
The reason polynomial regression works is because it is able to create a system of linear equations. Despite the fact polynomials are not all linear, the method for solving the coefficients generates linear equations. Consider the following polynomial:
The coefficients for this equation (a, b, and c) can be determined if there are three known x and corresponding y values. For a moment let us forget about the regression part and focus on solving the system exactly. We have three unknown coefficients, and thus will require three different points to solve this equation. Let us solve the coefficients for the known points: (5,38); (10,123); and (15,258). First, create 3 equations from the known points by substituting in the known values of x and y:
Expand and arrange:
The three resulting equations are all linear, and therefore we can solve this system of equation using a matrix. Turn the system of equations into matrix form:
Solve using matrix inversion:
The resulting coefficients are a=1, b=2, c=3.
So despite the fact the starting equation is not linear, the known values are for the non-linear terms. We could rewrite the starting equation as follows:
Where all upper case letters are constants. In this case A = x2, B=x. When written this way, the equation is clearly linear. This is why we can find coefficients for a non-linear polynomial equation with polynomial regression. All the non-linear operations are preformed on the data before we solve for the coefficients.
Several non-linear equations can be linearized in this fashion. Take a simple example:
Since we are solving for a and b with known values of x and y, the values relating to x and y become constants. That means the natural log goes away, and the resulting equation is linear. We can demonstrate using two known points: (1, 3); and (e, 7). This will create the following system of equations:
Solve the logs:
Knowing the b = 3, we can quickly find a = 4 by substituting the known b into the second equation. Starting with the original equation one can see it will be linear fairly easily. Again, we can rewrite the starting equation so it looks linear:
Where A = ln x.
Sometimes, however, equations that don't look linear can be linearized. Take for instance:
This does not at all look like it is linear. However we can use some algebra to change things around. It doesn't matter what we do to x and y as they just turn into constants. We need to manipulate this equation doing whatever non-linear operations are required to get a and b in a linear form—or at least close. Here we want to get b and x out of the exponent. This can be accomplished by taking the natural log of both sides.
Split using log identities.
Pull out the exponent using log identities.
And the natural log of e is 1, which results in:
This still isn't completely linear because we need to solve for a. However it is close.
This is linear and we can solve it for some known point. After we solve the system of equations, we need to solve for a using the value of α. Let us demonstrate using the points: (0,5) and (ln(5),125).
Place into a matrix:
Now we have values for α and b. We just need to use the value for A to solve for a.
Take the exponential of each side:
Now fill in the known value of α:
So despite the equation being non-linear, it can be solved in a linear fashion.
This is useful because if we can do this for an ordinary system of equations, we can also do it for an overdetermined system like we do for polynomial regression. Let us take the an earlier example and solve an overdetermined case using least-squares.
First, the equation:
The square of the residual:
Substitute in the function f:
To minimize the square of the residual, take the derivative and set it equal to zero:
Set this to zero:
Factor out the constant 2:
The two goes away when we divide two into both sides. Now break up the summations:
Pull out the constants and reduce:
Rearrange a bit:
And place into matrix form:
We now have a least-square solution to our starting function. This isn't too different from our normal equation for linear regression. In fact, we can place any two coefficient equation that can be linearized into least-square regression. The input must have the form:
Where each of the terms is a function of their lower-case part. We we work out the least-square math, the result will be:
We can use this as a template to solve for an other function we can make linear:
We already saw that this can be turned into:
Now place calculate the variables:
And place this in our matrix:
Note that the matrix when solved will give us the natural log of a. That will have to be undone in order after the matrix has been solved by taking the exponential of ln( a ). Normally I would leave the equation in matrix form, but let us take this one step further so the this last step can be demonstrated.
Take out of matrix form:
Now to solve for a in the second equation, take the exponential of both sides.
Where exp( x ) = ex.
Here is an example plot of the algorithm being used on a set of 50 noisy data points.
So we have a method to linearize some two coefficient non-linear functions and preform least-square regression using them. I was looking for some equations that had more than two coefficients I could demonstrate but did not find any. That will have to wait for an other day.
This section will attempt to explain the math behind polynomial regression. It requires a solid understanding of algebra, basic linear algebra (using matrices to solve systems of equations), and some calculus (partial derivatives and summations). The algebra is worked out, but the computation of matrices and derivatives are not. Someone with access to a computer algebra system (like WolframAlpha) should therefore be able to follow the math knowing only algebra and treating the higher level math functions like a black box.
Polynomial regression is an overdetermined system of equations that uses least squares as a method of approximating an answer. To understand this let us first look at a system of equations that is not overdetermined. We can start by constructing a line function that intercepts two points: (0.1, 0.1) and (0.7, 0.8). First let us solve the problem algebraically. Remember the equation for a line:
There are two unknowns in this equation: m and b. With two unknowns we need two solutions. These we have from the two given data points. So let us create two equations from this data:
Now to solve for the system, we can start with the top equation and solve for one of the unknowns. Let us choose b:
With b known, let us substitute the solution into the bottom equation:
Simplify this a bit:
Now we have a single equation with one unknown, which we can solve for:
Now that m is known, we can choose either of our original equations and solve for b. Let us choose the top equation and solve for b.
So with m and b known, our original equation can be turned into the line function:
This is a system of equations, and this is a pretty textbook example of how to solve it algebraically. Rather than use the algebraic approach we can also solve this system using matrices and linear algebra. The reason we are using matrix math is to simplify higher order polynomials, and we will get to those latter. Let us first rearrange the system of equations and add in the redundant multiplier of 1 in front of the b coefficient.
Now we can place the system into matrix form:
And using linear algebra, we can solve.
Or solve the system using Cramer's Rule.
Regardless the results are the same, but using matrix representation will make things easier when it comes to applying the math to higher order polynomials.
Now onto the overdetermined system. Let us say we now have 3 data points rather than two, and these points are: (0.1,0.1), (0.35,0.45), and (0.6,0.8).
We end up with 3 equations and two unknowns. That is too much data. If these points are on the same line, we could simply ignore one of the equations. However, if they are not we need a way of calculating coefficients that take all our data points into consideration. This is where least squares come in.
To preform a least square approach we need to define a residual function. This is a function that specifies the amount of error between our true data, and that produced by our estimation function. As a very simple example, let us say we have the data point (10,8) and our line function is y = 2 x - 12. Fill this in for x = 10 and we get y = 2 (10) - 12 = 6. The residual is the difference between the observed value (8) and the estimated value (6), or 8 – 6 = 2. For any point (xi, yi) on the line, the residual would be:
Where i is the index into the set of known data points. We can generalize this for any known data points:
In fact, we can generalize it for any function:
Now consider that we have a set of data points. What does the residual tell us? Well, for each data point the residual denotes the amount of error between our estimation and the true data. How about the overall error? To get this, we could just sum up the error for all data points. However error can be positive or negative. So we could end up with a lot of error uniformly distributed between negative and positive values. For example, if our error was (6, -8, 2) then our total error sum would be 0. So this doesn't give a good indication of error. We could sum the absolute value of all the error. The absolute value is equivalent to:
Then our (6, -8, 2) absolute error sum would total 16, and that's more like what we want. Since we just want an indication of error, we can drop the square root as the square function is enough to produce the always positive value we are looking for. There is an other reason to use just the square that will become clear latter.
So now that we have a function that measures residual, what do we do with it? Well, if we are trying to produce a function that models a set of data, we want the residual to be as small as possible—we want to minimize it. This would produce a function that deviates from the observed data as little as possible.
To minimize, we need to be able to modify the function's coefficients as the coefficients are the variables we have to manipulate for finding the best fit. In our line function y = m x + b, the coefficients are m and b. There is an added benefit to squaring the residuals—the square of residual forms a parabola. To see why this is useful, consider a 1st degree polynomial with three known points (10, 8, 11). Let's make the function:
You will notice x isn't used, but this is legitimate. Use this to construct the residual function:
And expand this for our data:
If we multiply everything and collect like terms we end up with:
Now graph this function, but use c0 as the variable:
The graph is a parabola. A parabola always has one and only one lowest point—it's vertex. The lowest point is the point with the lowest amount of error. So if we find coefficients that yield the vertex we have the coefficients that produce the lowest error.
Finding the vertex of a parabola for a function that has a single coefficient is easy. Take the derivative of the function and find where that function is equal to zero. This works because any time the derivative of a function is equal to zero, that point is either a local maximum or minimum. A parabola only has one such point: the minimum.
Using our example start by taking the derivative:
And set this equal to zero:
Now solve for c0:
So the coefficient that best fits this data is c0 = 9 2/3. Our function is then:
We have just solved a 0 degree polynomial using least squares, and our coefficient is the average of our data set:
As stated, the mean average is polynomial regression with a 0 degree polynomial. The reason this example was worked out was because it is easy to visualize. A function that has more than one coefficient produces a multidimensional equation. For example, the line function has two coefficients, m and b. The residual square function will produce a paraboloid.
This is the graph of a simple paraboloid. There is still an overall minimum—the bottom of the bowl shape—but there are two variables involved in finding the minimum. This holds true for functions with three or more coefficients, but we cannot graph these to make a demonstration.
To find the minimum of these problems the derivative can still be used, but we need to use partial derivatives—one with respect to each coefficient. This results in an equation for each coefficient, and that makes sense. To solve a system of equations, we need an equation for each unknown term. And using this method this is exactly what we get.
For the line function, we need the partial derivative with respect to m and with respect to b. Let us run this setup for the residual for some overdetermined function. First the line function:
Now the residual square sum when there are n known points:
Let us work through this using the three known points mention earlier: (0.1,0.1), (0.35,0.45), and (0.6,0.8)
Clean this up:
To minimize we need to take the partial derivative of the function r with respect to the two coefficients, m and b.
Now set these partial derivatives equal to zero:
We now have two equations with two unknowns. With a bit of rearranging we can get this into matrix form. First, the 6 goes away by dividing it into the other side:
Next we'll move the constant to the right side:
Now place the equations into a matrix:
In this example all points have the same slope and intercept. Below is a graph of the 3 data points, and the regression line using the computed slope and intercept m and b respectively.
We could have just dropped one of the equations and arrived at the same result. However, this method will work if each of the points varies slightly from the slope and intercept, and it will compute values for m and b such that they have the smallest amount of residual.
While the steps above work for our one set of points, let us try creating a more generalized version of the regression equation for the line function. We will again start with the sum of the residual squared:
Take the partial derivative of the function r with respect to the two coefficients, m and b.
Set these partial derivatives equal to zero:
It looks a little messy, but we in fact have two equations with two unknowns. We can convert this into a matrix so it can be solved. Some rearranging is needed for this to happen. First, some expansion:
Break up the summations:
Move the constants out of the summations:
We can reduce the summation of the constant 1:
Move all parts dealing with y to the right side of the equation:
From here it is easy to place the equations into matrix form:
The 2 can be factored out and cancels on both sides.
What we are left with is precisely the matrix definition of linear regression. To test this, let us use the three data points from above: (0.1,0.1), (0.35,0.45), and (0.6,0.8). First we need the data calculated out:
Now fill this data into the matrix:
So both methods agree with one an other. We can repeat this process for a polynomial of any number of degrees. Let's try using a 2nd degree polynomial to get the results for quadratic regression.
First, the quadratic function:
Now the sum of squares for n known data points:
There are three partial derivatives as there are three coefficients.
Set the partial derivatives equal to zero.
Eliminate the 2, split the summations, and move y to the right:
Reduce and pull out constants:
And translate into matrix form:
We get the results for quadratic regression in matrix form. There is a pattern forming we can seen more clearly by generalizing the partial derivative:
Where m is the degree of the polynomial, and n is the number of known data points. This, if you follow the math, leads to a generalized matrix:
This is the matrix representation of polynomial regression for any degree polynomial. And that's it. An overdetermined system is solved by creating a residual function, summing the square of the residual which forms a parabola/paraboloid, and finding the coefficients by finding the minimum of the parabola/paraboloid.
Simply put polynomial regression is an attempt to create a polynomial function that approximates a set of data points. This is easier to demonstrate with a visual example.
In this graph we have 20 data points, shown as blue squares. They have been plotted on an X/Y graph. The orange line is our approximation. Such a scenario is not uncommon when data is read from a device with inaccuracies. With enough data points the inaccuracies can be marginalized. This is useful in physical experiments when modeling some phenomenon. There is often an equation and the coefficients must be determined by measurement. If the equation is a polynomial function, polynomial regression can be used.
Polynomial regression is one of several methods of curve fitting. With polynomial regression, the data is approximated using a polynomial function. A polynomial is a function that takes the form f( x ) = c0 + c1 x + c2 x2 %u22EF cn xn where n is the degree of the polynomial and c is a set of coefficients.
Most people have done polynomial regression but haven't called it by this name. A polynomial of degree 0 is just a constant because f( x ) = c0 x0 = c0. Likewise preforming polynomial regression with a degree of 0 on a set of data returns a single constant value. It is the same as the mean average of that data. This makes sense because the average is an approximation of all the data points.
Here we have a graph of 11 data points and the average is highlighted at 0 with a thick blue line. The average line mostly follows the path of the data points. Thus the mean average is a form of curve fitting and likely the most basic.
Linear regression is polynomial regression of degree 1, and generally takes the form y = m x + b where m is the slope, and b is the y-intercept. It could just as easily be written f( x ) = c0 + c1 x with c1 being the slope and c0 the y-intercept.
Here we can see the linear regression line running along the data points approximating the data. Mean average and linear regression are the most common forms of polynomial regression, but not the only.
Quadratic regression is a 2nd degree polynomial and not nearly as common. Now the regression becomes non-linear and the data is not restricted to straight lines.
Here we can see data with a quadratic regression trend line. So the idea is simple: find a line that best fits the data. More specifically, find the coefficients to a polynomial that best fits the data. To understand how this works, we need to explore the math used for computation.