Weighted least-square regression is a method used to consider some data points more strongly when generating coefficients. When writing about non-linear least-square regression, I ran into some problems for ordinary least-square regression. One solution to this problem is to use a weighting term, and the problem function and solution will be explored in this article.
Let us start with the function:
We can make this function semilinear by taking the natural log. After some rearranging, the function becomes:
In this form we can apply a technique discussed in an earlier article and use a modified linear regression equation:
So far there is nothing unusual about this equation. Let us try this on a clean set of data:
The graph shows the known data points as dots, and the solid regression line. This is what we would expect to see.
Now let us apply regression to a set of data with small amounts of noise on the signal.
What happened? The introduced noise barely effects the data points, but clearly the coefficients are not correct for the regression line. The deviation from the original data points gets worse and worse toward the right side.
This is one of the problems of finding the coefficients for the linearized version of the function. No account is taken of the non-linearity of the data. So what we end up with are coefficients that minimize the error for the linear form, which does not always translate well back to the non-linear form.
What can be done to fix this? One solution is to use a different curve fitting algorithm, such as Gauss-Newton. Such algorithms suffer from the fact they are iterative, unlike the closed form we have for linear regression.
An other option is to weight the data. For the above graph, we want the data toward the right side to be more strongly considered than the data on the left side when evaluating coefficients. We want this consideration because data toward the right is much more sensitive to error in the coefficients than data to the left. To accomplish this, we need some kind of weighting term. To explore how to weight data, let us start simple and use the mean average.
Consider an average of a sequence of values:
Here a is the average, and yi is our sequence of data. This is the standard representation of the average. As explained in an earlier article, the mean average is actually just polynomial regression of degree 0. So let's write out the average as polynomial regression in expanded matrix form:
This doesn't look like the mean average, but it can be shown to be so if we reduce. Here a is the average. A 1x1 matrix is the same as if it wasn't in a matrix, so we can remove the matrices.
Normally one doesn't see that averages have x values associated with y values, but they do—kind of. The reason they are not seen is because the x values are raised to the 0th power, which means every x value is just 1. And the summation of a sequence of ones is just the count of that sequence (if you add up n ones, you get n). So the series can be reduced:
Now solve for a by moving n to the opposite side:
Thus polynomial regression of degree 0 is the average. Now to add a weighting term. Consider what happens when both sides are multiplied by a constant:
Algebraically W does nothing—it simply cancels out. We can move W inside the matrices:
And move W inside the summations:
Thus far we have assume W is a constant, but that was just to maneuver it into place. Now that W is in where it should be we can stop making that assumption and allow W to become a weighting term. Let W = wi so that there is a sequence of weights that can be applied to every y value. Our equation becomes:
We can drop the matrices, and get rid of the x0 because that is just 1.
Solve for a:
This produces a weighted average with value yi weighted by wi. If you look up the weighted average, this is the equation you will find. This equation is equivalent to the mean average if the weighting term is constant, and we can quickly show this:
Here w represents the fact the weighting term is the same for all values of y. So a normal average can be thought of as a weighted average where the weighting term is constant.
We can apply the same weighting method to linear regression. First, the expanded representation of linear regression:
Now introduce a constant weighting term:
Move this term into the matrix:
And then into the summations:
Once there we no longer make the assumption that the weight value is constant, but is instead a sequence:
Reduce the powers:
And we now have the weighted form of linear regression. What was done here should be pretty straightforward—wi was just placed in all the summations. So let us now apply the weighting term to the linearized version of our problem regression function from the beginning of the article by doing the same procedure:
Again, if the weights are all the same the results will be identical to those of the non-weighted version. We now have a weighting term that can be applied to give unequal consideration to select values of the function. But what should the weights be to help get our curve to fit?
Here is where some thought about the shape of the function along with some trial and error can help. What we can do is place a higher weight on the values to the right side of the graph. This is because the data to the right side is more sensitive to error in the coefficients. Stronger weights for values to the right will have the effect of making data points on the right considered more strongly than those to the left.
Just consider the w sequence a function of i. The simplest weighting term is to use the index if we assume the sequence of x is always increasing (that is, x is sorted starting from the lowest value and moving to the highest).
Now let's try computing the regression curve with this weighting term:
There is a marked improvement, but the regression line still isn't getting the curve quite where it needs to be. Clearly the right side needs a stronger weight.
Let's try squaring the weight:
Which results in this graph:
And that fit is almost perfect. In fact, using this weighting term we can increase the noise and have the regression respond as expected:
Here the noise has been increased by two orders of magnitude, but the weighting term still produces a fairly good fit.
For this particular equation the higher the power, the better the fit. For example:
This weighting term works well for this particular scenario where both the coefficients are positive values. However, it may not work in other situations. The weighting values need to be evaluated for the particulars of the application for which regression is being applied.
The use of a weighting term has allowed curve fitting of a non-linear function to be solved with a pretty simple regression technique. Unlike methods such as Gauss-Newton, this technique is deterministic and fairly easy to compute. So this is an other algorithm available for filtering out noise in real-world scenarios.