Andrew Que Sites list Photos
Projects Contact

March 08, 2015

Mathematics of Weighted Polynomial Regression

Weighted polynomial regression is a method by which some terms are considered more strongly than others. There are a number of reasons this may be desired. If the regression coefficients are being used to extrapolate additional data it may make for a more accurate prediction if the newest data is considered more strongly than older data. A set of data that is based on measurements might have a margin of error that is unique for each data point. The data points with lower margins of error can be considered more strongly than those of higher error. A set of data points might also represent an average over a different amount of data. In this case, the amount of data each point represents can be used as the weight. There are a number of reasons to weight data points individually.

We will begin with an example of a weighted average. A classic example has to do with considering an average of averages, where each average has a different number of data points. Consider two classes that have taken the same test. One class has 20 students, the other 30.

First class:

71 71 68 68 67 67 70 73 73 65 72 67 68 70 69 74 72 67 74 73                              

Second class:

94 85 86 89 91 94 86 92 89 93 91 89 89 88 94 92 86 91 85 92 91 92 89 85 91 94 94 91 91 86

The average for the first class is 70.0, and for the second class 90.0. The average of all the grades together is 82.0. However, the average of the two averages is 80.0. This is because the average of averages considers both sets to be equal. This can be corrected by considering the number of students.

Here, a weighting value has been added to each term and reproduces the same value as the average of all data points. Let us look at this in a general form:

To prove this, let's plug in the values:

w0 20
w1 30
y0 70
y1 90

To see how we came to the general form, let us consider how to introduce a weighting term into the average. The mean average is actually just polynomial regression of degree 0. So let's write out the average as polynomial regression in expanded matrix form:

This doesn't look like the mean average, but it can be show to be so if we reduce. Here a is the average. A 1x1 matrix is the same as if it wasn't in a matrix, so we can remove the matrices.

Normally one does not see that averages have x values associated with y values, but they do—kind of. The reason they are not seen is because the x values are raised to the 0th power, which means every x value is just 1. And the summation of a sequence of 1 is just the count of that sequence (if you add up n ones, you get n). So the series can be reduced:

Now solve for a by moving n to the opposite side:

And this is how one is use to seeing the average represented. Thus our matrix representation is equivalent. Let's return to the matrix form and introduce a weighting term, W. Consider what happens when both sides are multiplied by this constant:

Algebraically W does nothing—it simply cancels out. We can move W inside the matrices:

And move W inside the summations:

Thus far we have assume W is a constant, but that was just to maneuver it into place. Now that W is in where it should be we can stop making that assumption and allow W to become a weighting term. Let W = wi so that there is a sequence of weights that can be applied to every y value. Our equation becomes:

We can drop the matrices, and get rid of the x0 because that is just 1.

Solve for a:

This produces a weighted average with value yi weighted by wi. This matches our earlier representation of a weighted average. It is also equivalent to the mean average if the weighting term is constant, and we can quickly show this:

We can apply the same weighting method to linear regression. We will use linear regression because the equation is not nearly as large as polynomial regression. First, the expanded representation of linear regression:

Now introduce a constant weighting term:

Move this term into the matrix:

And then into the summations:

Once there we no longer make the assumption that the weight value is constant, but is instead a sequence:

Reduce the powers:

And we now have the weighted form of linear regression. What was done here should be pretty straightforward—wi was just placed in all the summations. We take what we did with linear regression and can quickly apply it to polynomial regression. First start with the expanded form of polynomial regression:

Here, m is the number of coefficients for the polynomial, and n is the number of x/y data points. Now add wi was just placed in all the summations.

Clean this up a little to simplify:

This is the general form of weighted polynomial regression. Again, if the weighting term is constant for each index it will simply reduce out and one will be left with unweighted polynomial regression.

Just as with using polynomial regression, how weighting is actually used is not so easy to generalize. One must carefully consider what phenomena is being modeled before adding a weighting term to the data. However once this is understood a weighting term can be very useful.

This article has been added to the polynomial regression site and is mostly a recap of a previous article.

July 09, 2014

Using Polynomial Regression for Interpolation

One of the reasons I started researching polynomial regression was its use for interpolation. The applications for this are limited as larger degree polynomials can easily oscillate wildly, but there are still times it can be useful. Today I found a use I want to share.

For the past several days our computer ππ has been logging the amount of light (in lux) that falls on our house's roof top. To my dismay it often crashes and I haven't figured out the cause. So gaps exist in my data. Today I got the most complete graph I've yet had, but there was a drop out just after 7:00 pm that lasted until 8:40 pm. The sun was setting during this time and all the days useful data had pretty much been logged. I am still left with a gap in my data.

To fill in the gap we could just do a linear approximation between the two points on either end of the data. When we do, our graph of the data right around the gap looks like this:

Here you can see the interpolation line in red does connect the two points closest to the gap, but doesn't follow the curve. The approximation is alright would probably be functional for what I am trying to do. However, I know I can get a better curve. This is where polynomial regression can be used. Using a polynomial function to approximate the existing data will give us a continuous function that will also fill in the missing data. Since a line won't do a good job, there is no point trying linear regression. So let us next start with quadratic regression (a 3 coefficient polynomial).

This isn't a great representation. Now cubic regression (4 coefficient polynomial).

This looks much better. The curve follows along most of the existing data points and makes a nice transition. We could stop here and calculate all the missing data points using this polynomial, but will a higher degree polynomial with more coefficients make the curve fit even better? Let us try quartic regression (5 coefficients).

This curve now runs through almost all the existing data points and is what we are looking for. At this point, the curve fits the data better than the data itself as the true data has noise. This is the curve I used to interpolate the data.

For the sake of inquiry we can continue to increase the degree of the polynomial. Octic regression (9 coefficients) produces this:

Here we start to see an oscillation being introduced into the data. While it fits and may be what the happened during the missing data period, it isn't any better than the curve produced with quartic regression.

Some strange things happen once more than 9 coefficients are used. At 10 coefficients we get this:

Simply knowing the physical phenomenon taking place, this graph isn't possible. However, it does actually have a better coefficient of determination than the lower degree polynomial. So it does make sense to select which curve to use based on it's visual attributes. The algorithm is designed simply to minimize the residual error with a given degree polynomial, and knows nothing of the restraints imposed by the data source.

This work was all done using my online polynomial regression calculator and only took a couple of minutes. A slightly modified version produced the graphs for this article as I removed the (meaningless) units and allowed larger degree polynomials.

January 29, 2014

Some Non-Linear Functions that can be Linearized for Least-Square Regression

The reason polynomial regression works is because it is able to create a system of linear equations. Despite the fact polynomials are not all linear, the method for solving the coefficients generates linear equations. Consider the following polynomial:

The coefficients for this equation (a, b, and c) can be determined if there are three known x and corresponding y values. For a moment let us forget about the regression part and focus on solving the system exactly. We have three unknown coefficients, and thus will require three different points to solve this equation. Let us solve the coefficients for the known points: (5,38); (10,123); and (15,258). First, create 3 equations from the known points by substituting in the known values of x and y:

Expand and arrange:

The three resulting equations are all linear, and therefore we can solve this system of equation using a matrix. Turn the system of equations into matrix form:

Solve using matrix inversion:


The resulting coefficients are a=1, b=2, c=3.

So despite the fact the starting equation is not linear, the known values are for the non-linear terms. We could rewrite the starting equation as follows:

Where all upper case letters are constants. In this case A = x2, B=x. When written this way, the equation is clearly linear. This is why we can find coefficients for a non-linear polynomial equation with polynomial regression. All the non-linear operations are preformed on the data before we solve for the coefficients.

Several non-linear equations can be linearized in this fashion. Take a simple example:

Since we are solving for a and b with known values of x and y, the values relating to x and y become constants. That means the natural log goes away, and the resulting equation is linear. We can demonstrate using two known points: (1, 3); and (e, 7). This will create the following system of equations:

Solve the logs:


Knowing the b = 3, we can quickly find a = 4 by substituting the known b into the second equation. Starting with the original equation one can see it will be linear fairly easily. Again, we can rewrite the starting equation so it looks linear:

Where A = ln x.

Sometimes, however, equations that don't look linear can be linearized. Take for instance:

This does not at all look like it is linear. However we can use some algebra to change things around. It doesn't matter what we do to x and y as they just turn into constants. We need to manipulate this equation doing whatever non-linear operations are required to get a and b in a linear form—or at least close. Here we want to get b and x out of the exponent. This can be accomplished by taking the natural log of both sides.

Split using log identities.

Pull out the exponent using log identities.

And the natural log of e is 1, which results in:

This still isn't completely linear because we need to solve for a. However it is close.

This is linear and we can solve it for some known point. After we solve the system of equations, we need to solve for a using the value of α. Let us demonstrate using the points: (0,5) and (ln(5),125).

Place into a matrix:

And solve:

Now we have values for α and b. We just need to use the value for A to solve for a.

Take the exponential of each side:


Now fill in the known value of α:


So despite the equation being non-linear, it can be solved in a linear fashion.

This is useful because if we can do this for an ordinary system of equations, we can also do it for an overdetermined system like we do for polynomial regression. Let us take the an earlier example and solve an overdetermined case using least-squares.

First, the equation:

The square of the residual:

Substitute in the function f:

To minimize the square of the residual, take the derivative and set it equal to zero:

Set this to zero:

Factor out the constant 2:

The two goes away when we divide two into both sides. Now break up the summations:

Pull out the constants and reduce:

Rearrange a bit:

And place into matrix form:

We now have a least-square solution to our starting function. This isn't too different from our normal equation for linear regression. In fact, we can place any two coefficient equation that can be linearized into least-square regression. The input must have the form:

Where each of the terms is a function of their lower-case part. We we work out the least-square math, the result will be:

We can use this as a template to solve for an other function we can make linear:

We already saw that this can be turned into:

Now place calculate the variables:

And place this in our matrix:

Note that the matrix when solved will give us the natural log of a. That will have to be undone in order after the matrix has been solved by taking the exponential of ln( a ). Normally I would leave the equation in matrix form, but let us take this one step further so the this last step can be demonstrated.


Take out of matrix form:

Now to solve for a in the second equation, take the exponential of both sides.

Where exp( x ) = ex.

Here is an example plot of the algorithm being used on a set of 50 noisy data points.

So we have a method to linearize some two coefficient non-linear functions and preform least-square regression using them. I was looking for some equations that had more than two coefficients I could demonstrate but did not find any. That will have to wait for an other day.

January 18, 2014

The Math Behind Polynomial Regression

This section will attempt to explain the math behind polynomial regression. It requires a solid understanding of algebra, basic linear algebra (using matrices to solve systems of equations), and some calculus (partial derivatives and summations). The algebra is worked out, but the computation of matrices and derivatives are not. Someone with access to a computer algebra system (like WolframAlpha) should therefore be able to follow the math knowing only algebra and treating the higher level math functions like a black box.

Polynomial regression is an overdetermined system of equations that uses least squares as a method of approximating an answer. To understand this let us first look at a system of equations that is not overdetermined. We can start by constructing a line function that intercepts two points: (0.1, 0.1) and (0.7, 0.8). First let us solve the problem algebraically. Remember the equation for a line:

There are two unknowns in this equation: m and b. With two unknowns we need two solutions. These we have from the two given data points. So let us create two equations from this data:

Now to solve for the system, we can start with the top equation and solve for one of the unknowns. Let us choose b:

With b known, let us substitute the solution into the bottom equation:

Simplify this a bit:

Now we have a single equation with one unknown, which we can solve for:

Now that m is known, we can choose either of our original equations and solve for b. Let us choose the top equation and solve for b.

So with m and b known, our original equation can be turned into the line function:

This is a system of equations, and this is a pretty textbook example of how to solve it algebraically. Rather than use the algebraic approach we can also solve this system using matrices and linear algebra. The reason we are using matrix math is to simplify higher order polynomials, and we will get to those latter. Let us first rearrange the system of equations and add in the redundant multiplier of 1 in front of the b coefficient.

Now we can place the system into matrix form:

And using linear algebra, we can solve.

Here I use matrix inversion to solve for the constant terms. You could setup the equation using an augmented matrix and use Gaussian elimination like this:

Or solve the system using Cramer's Rule.

Or easier still use WolframAlpha to compute the result.

Regardless the results are the same, but using matrix representation will make things easier when it comes to applying the math to higher order polynomials.

Now onto the overdetermined system. Let us say we now have 3 data points rather than two, and these points are: (0.1,0.1), (0.35,0.45), and (0.6,0.8).

We end up with 3 equations and two unknowns. That is too much data. If these points are on the same line, we could simply ignore one of the equations. However, if they are not we need a way of calculating coefficients that take all our data points into consideration. This is where least squares come in.

To preform a least square approach we need to define a residual function. This is a function that specifies the amount of error between our true data, and that produced by our estimation function. As a very simple example, let us say we have the data point (10,8) and our line function is = 2 x - 12. Fill this in for = 10 and we get = 2 (10) - 12 = 6. The residual is the difference between the observed value (8) and the estimated value (6), or 8 – 6 = 2. For any point (xi, yi) on the line, the residual would be:

Where i is the index into the set of known data points. We can generalize this for any known data points:

In fact, we can generalize it for any function:

Now consider that we have a set of data points. What does the residual tell us? Well, for each data point the residual denotes the amount of error between our estimation and the true data. How about the overall error? To get this, we could just sum up the error for all data points. However error can be positive or negative. So we could end up with a lot of error uniformly distributed between negative and positive values. For example, if our error was (6, -8, 2) then our total error sum would be 0. So this doesn't give a good indication of error. We could sum the absolute value of all the error. The absolute value is equivalent to:

Then our (6, -8, 2) absolute error sum would total 16, and that's more like what we want. Since we just want an indication of error, we can drop the square root as the square function is enough to produce the always positive value we are looking for. There is an other reason to use just the square that will become clear latter.

So now that we have a function that measures residual, what do we do with it? Well, if we are trying to produce a function that models a set of data, we want the residual to be as small as possible—we want to minimize it. This would produce a function that deviates from the observed data as little as possible.

To minimize, we need to be able to modify the function's coefficients as the coefficients are the variables we have to manipulate for finding the best fit. In our line function y = m x + b, the coefficients are m and b. There is an added benefit to squaring the residuals—the square of residual forms a parabola. To see why this is useful, consider a 1st degree polynomial with three known points (10, 8, 11). Let's make the function:

You will notice x isn't used, but this is legitimate. Use this to construct the residual function:

And expand this for our data:

If we multiply everything and collect like terms we end up with:

Now graph this function, but use c0 as the variable:

The graph is a parabola. A parabola always has one and only one lowest point—it's vertex. The lowest point is the point with the lowest amount of error. So if we find coefficients that yield the vertex we have the coefficients that produce the lowest error.

Finding the vertex of a parabola for a function that has a single coefficient is easy. Take the derivative of the function and find where that function is equal to zero. This works because any time the derivative of a function is equal to zero, that point is either a local maximum or minimum. A parabola only has one such point: the minimum.

Using our example start by taking the derivative:


And set this equal to zero:

Now solve for c0:

So the coefficient that best fits this data is c0 = 9 2/3. Our function is then:

We have just solved a 0 degree polynomial using least squares, and our coefficient is the average of our data set:

As stated, the mean average is polynomial regression with a 0 degree polynomial. The reason this example was worked out was because it is easy to visualize. A function that has more than one coefficient produces a multidimensional equation. For example, the line function has two coefficients, m and b. The residual square function will produce a paraboloid.

This is the graph of a simple paraboloid. There is still an overall minimum—the bottom of the bowl shape—but there are two variables involved in finding the minimum. This holds true for functions with three or more coefficients, but we cannot graph these to make a demonstration.

To find the minimum of these problems the derivative can still be used, but we need to use partial derivatives—one with respect to each coefficient. This results in an equation for each coefficient, and that makes sense. To solve a system of equations, we need an equation for each unknown term. And using this method this is exactly what we get.

For the line function, we need the partial derivative with respect to m and with respect to b. Let us run this setup for the residual for some overdetermined function. First the line function:

Now the residual square sum when there are n known points:

Let us work through this using the three known points mention earlier: (0.1,0.1), (0.35,0.45), and (0.6,0.8)

Clean this up:

To minimize we need to take the partial derivative of the function r with respect to the two coefficients, m and b.

(Solution for r with respect to b, and r with respect to m)

Now set these partial derivatives equal to zero:

We now have two equations with two unknowns. With a bit of rearranging we can get this into matrix form. First, the 6 goes away by dividing it into the other side:

Next we'll move the constant to the right side:

Now place the equations into a matrix:



In this example all points have the same slope and intercept. Below is a graph of the 3 data points, and the regression line using the computed slope and intercept m and b respectively.

We could have just dropped one of the equations and arrived at the same result. However, this method will work if each of the points varies slightly from the slope and intercept, and it will compute values for m and b such that they have the smallest amount of residual.

While the steps above work for our one set of points, let us try creating a more generalized version of the regression equation for the line function. We will again start with the sum of the residual squared:

Take the partial derivative of the function r with respect to the two coefficients, m and b.

(Solution for b, and solution for m)

Set these partial derivatives equal to zero:

It looks a little messy, but we in fact have two equations with two unknowns. We can convert this into a matrix so it can be solved. Some rearranging is needed for this to happen. First, some expansion:

Break up the summations:

Move the constants out of the summations:

We can reduce the summation of the constant 1:

Move all parts dealing with y to the right side of the equation:

From here it is easy to place the equations into matrix form:

The 2 can be factored out and cancels on both sides.

What we are left with is precisely the matrix definition of linear regression. To test this, let us use the three data points from above: (0.1,0.1), (0.35,0.45), and (0.6,0.8). First we need the data calculated out:

Now fill this data into the matrix:



So both methods agree with one an other. We can repeat this process for a polynomial of any number of degrees. Let's try using a 2nd degree polynomial to get the results for quadratic regression.

First, the quadratic function:

Now the sum of squares for n known data points:

There are three partial derivatives as there are three coefficients.

(Solution for c0, Solution for c1, and Solution for c2)

Set the partial derivatives equal to zero.

Eliminate the 2, split the summations, and move y to the right:

Reduce and pull out constants:

And translate into matrix form:

We get the results for quadratic regression in matrix form. There is a pattern forming we can seen more clearly by generalizing the partial derivative:

Where m is the degree of the polynomial, and n is the number of known data points. This, if you follow the math, leads to a generalized matrix:

This is the matrix representation of polynomial regression for any degree polynomial. And that's it. An overdetermined system is solved by creating a residual function, summing the square of the residual which forms a parabola/paraboloid, and finding the coefficients by finding the minimum of the parabola/paraboloid.