# January 17, 2014

## What is Polynomial Regression?

### Polynomial Regression

Simply put polynomial regression is an attempt to create a polynomial function that approximates a set of data points. This is easier to demonstrate with a visual example. In this graph we have 20 data points, shown as blue squares. They have been plotted on an X/Y graph. The orange line is our approximation. Such a scenario is not uncommon when data is read from a device with inaccuracies. With enough data points the inaccuracies can be marginalized. This is useful in physical experiments when modeling some phenomenon. There is often an equation and the coefficients must be determined by measurement. If the equation is a polynomial function, polynomial regression can be used.

Polynomial regression is one of several methods of curve fitting. With polynomial regression, the data is approximated using a polynomial function. A polynomial is a function that takes the form fx ) = c0 + cx + c2 x2 %u22EF cn xn where n is the degree of the polynomial and c is a set of coefficients.

Most people have done polynomial regression but haven't called it by this name. A polynomial of degree 0 is just a constant because fx ) = c0  x0 = c0. Likewise preforming polynomial regression with a degree of 0 on a set of data returns a single constant value. It is the same as the mean average of that data. This makes sense because the average is an approximation of all the data points. Here we have a graph of 11 data points and the average is highlighted at 0 with a thick blue line. The average line mostly follows the path of the data points. Thus the mean average is a form of curve fitting and likely the most basic.

Linear regression is polynomial regression of degree 1, and generally takes the form y = m + b where m is the slope, and b is the y-intercept. It could just as easily be written fx ) = c0 + c1 x with c1 being the slope and c0 the y-intercept. Here we can see the linear regression line running along the data points approximating the data. Mean average and linear regression are the most common forms of polynomial regression, but not the only.

Quadratic regression is a 2nd degree polynomial and not nearly as common. Now the regression becomes non-linear and the data is not restricted to straight lines. Here we can see data with a quadratic regression trend line. So the idea is simple: find a line that best fits the data. More specifically, find the coefficients to a polynomial that best fits the data. To understand how this works, we need to explore the math used for computation.

# December 30, 2013

## Forced Coefficient Polynomial Regression

### Math, Polynomial Regression

The other day I received an e-mail from someone who was using my Polynomial Regression PHP class. They asked about something I had never thought about—how to force a Y-intercept of zero. It was something my library could not do, but something I wanted to understand. So in this article I will share what I learned, and how the problem was not just solved for the one case but for a whole set of cases.

I have written a fair deal in the past about linear regression, quadratic regression, polynomial least-square regression, and have even made interactive demos. So I will make the assumption in this writing that the reader is familiar with the basics of polynomial regression.

We will start by defining the original request. Consider this graph: Here we have 10 points and a linear regression line. The coefficient of determination (R2) value is fairly low meaning we don't have a great fit. This makes sense because there are not a lot of data points, and the data is quite noisy. The linear regression calculation has calculated a Y-intercept of ≈0.15.

What if these samples were part of a physical measurement, and we know by the nature of this measurement the Y-intercept must be zero? If we know this to be the case the only variable in the linear regression equation is slope. How do we calculate linear regression slope with zero Y-intercept?

We could simply run the linear regression algorithm and ignore the Y-intercept. Here we have calculated the slope and intercept, but ignore the intercept. Notice the trend line is below most of the data points. It is clearly not accurate.

Let us review the equation for linear regression in matrix form: And solved: Remember that the mean average can also be expressed in matrix form by removing a row and column: What this tells us is the intercept term is based on the first row and column. So if we modify our linear regression matrix we can remove this row and column by making it moot. To do this, we set the first row and column to one, and the rest of the first row and column to zero (more on why latter). In addition, we force the b term to zero like this: Now we solve the equation and we get: Looking at the matrix you can tell the first row and column are not contributing to the equation. If we remove the b term, we can solve the matrix for just the slope. We will get the same result. So let us now apply this to the graphed data from above: The linear regression line now much more evenly driven though the data (one point at 0.8 is off the chart). The data set is actually a noisy signal with a slope of one. When using a forced intercept the regression calculates a slope of 0.99—so very close despite the low signal-to-noise-ratio.

Now, how about regression for higher-order polynomials? As I expected the pattern works out the same: The first row and term can be removed to simplify the math, but we are not going to do this because there is more we can do to this equation. Here is an example of a 2nd order polynomial (a quadratic function) with the zero term forced, and the regression line looks as expected. This is the solution to the question in the e-mail I received. However the pattern is too tempting to leave alone. Time for more questions.

If the first term can be forced to zero, can we force the other coefficients? Yes. We can force any term to zero if it is nulled out in the regression matrix. To do this, zero out the row and column of the result to be forced to zero, placing a one where the row and columns intercept. Why will be explained a little latter. Here is an example for a 3rd degree polynomial with the coefficient for the x2 term forced to zero: While that is fine if zero is desired for the value of the forced coefficient, but what about when we want to term forced to something other than zero? We can do that too, but we have to account for the fact we know this term before the calculations start. This can be easily shown with the linear regression calculation: Here what we have done is force the offset coefficient to f. Because the offset is known, we want to remove this impact from the y data. So in the summation involving y we remove the known offset. To expand this to higher-order polynomials we just need to subtract off the effects of the known coefficient. We can use the 3rd order polynomial regression matrix from above do demonstrate this removal: The coefficient being forced is for the x2 coefficient. Thus each summation involving y has this value removed. When this matrix is solved the calculated coefficient c2 will equal f. The other coefficients are now independent of the forced coefficient. The matrix could be simplified: In this simplified the results are the same, but demonstrates how the other coefficients are now independent.

We can force more than one term using the same setup. For each coefficient being forced, null the row and column. Here we have forced the coefficients for x0 (the offset) and x2 using the forcing value f0 and f2 which will end up c0 and c2 respectively. Any term can be removed if y is modified as follows: Where f is a set of forced coefficients, and p is the set of powers of the forced coefficients. So if forcing coefficient 0 and 2: Then any yn in the matrix can be replaced with yn. So the above matrix simplifies to: An other pattern is emerging that explains why we are setting the values in the rows and columns to either one or zero. Force all the coefficients for the above matrix and the result will be: Forcing all the coefficients causes the identity matrix appears. This will naturally cause the coefficients (cn) to be set to the forcing values (fn). Thus why nulling a row out uses this pattern of zeros and ones—we are simply inserting part of the identity matrix into the places corresponding to each coefficient being forced.

There are fewer math operations if we remove the rows/columns for coefficients we do not need to calculate, but then we need to keep track of what coefficients are actually calculated in the matrix. So in my updates to the Polynomial Regression class, I didn't remove the moot rows/columns. This will effect performance slightly, but the class is designed to do most of the lengthy calculations as the data is added. The Gaussian elimination used for solving the matrix should not take too long, even for large matrices used on high order polynomials.

The changes were fairly easy to implement. I did some updates to the polynomial regression class project page as part of the release. Examples now include graphics which help to demonstrate how to use the new fetchers. I am rather pleased the library is getting used. A few months ago I found the class was being used in an open source project for doing analysis on metric data from servers. To the best of my knowledge this is the first use of anything I have written in a project I wasn't involved with, but hopefully just one of many to come.

# February 07, 2014

## Weighted Least-square Regression

### Math, Polynomial Regression

Weighted least-square regression is a method used to consider some data points more strongly when generating coefficients. When writing about non-linear least-square regression, I ran into some problems for ordinary least-square regression. One solution to this problem is to use a weighting term, and the problem function and solution will be explored in this article. We can make this function semilinear by taking the natural log. After some rearranging, the function becomes: In this form we can apply a technique discussed in an earlier article and use a modified linear regression equation: So far there is nothing unusual about this equation. Let us try this on a clean set of data: The graph shows the known data points as dots, and the solid regression line. This is what we would expect to see.

Now let us apply regression to a set of data with small amounts of noise on the signal. What happened? The introduced noise barely effects the data points, but clearly the coefficients are not correct for the regression line. The deviation from the original data points gets worse and worse toward the right side.

This is one of the problems of finding the coefficients for the linearized version of the function. No account is taken of the non-linearity of the data. So what we end up with are coefficients that minimize the error for the linear form, which does not always translate well back to the non-linear form.

What can be done to fix this? One solution is to use a different curve fitting algorithm, such as Gauss-Newton. Such algorithms suffer from the fact they are iterative, unlike the closed form we have for linear regression.

An other option is to weight the data. For the above graph, we want the data toward the right side to be more strongly considered than the data on the left side when evaluating coefficients. We want this consideration because data toward the right is much more sensitive to error in the coefficients than data to the left. To accomplish this, we need some kind of weighting term. To explore how to weight data, let us start simple and use the mean average.

Consider an average of a sequence of values: Here a is the average, and yi is our sequence of data. This is the standard representation of the average. As explained in an earlier article, the mean average is actually just polynomial regression of degree 0. So let's write out the average as polynomial regression in expanded matrix form: This doesn't look like the mean average, but it can be shown to be so if we reduce. Here a is the average. A 1x1 matrix is the same as if it wasn't in a matrix, so we can remove the matrices. Normally one doesn't see that averages have x values associated with y values, but they do—kind of. The reason they are not seen is because the x values are raised to the 0th power, which means every x value is just 1. And the summation of a sequence of ones is just the count of that sequence (if you add up n ones, you get n). So the series can be reduced: Now solve for a by moving n to the opposite side: Thus polynomial regression of degree 0 is the average. Now to add a weighting term. Consider what happens when both sides are multiplied by a constant: Algebraically W does nothing—it simply cancels out. We can move W inside the matrices: And move W inside the summations: Thus far we have assume W is a constant, but that was just to maneuver it into place. Now that W is in where it should be we can stop making that assumption and allow W to become a weighting term. Let W = wi so that there is a sequence of weights that can be applied to every y value. Our equation becomes: We can drop the matrices, and get rid of the x0 because that is just 1. Solve for a: This produces a weighted average with value yi weighted by wi. If you look up the weighted average, this is the equation you will find. This equation is equivalent to the mean average if the weighting term is constant, and we can quickly show this: Here w represents the fact the weighting term is the same for all values of y. So a normal average can be thought of as a weighted average where the weighting term is constant.

We can apply the same weighting method to linear regression. First, the expanded representation of linear regression: Now introduce a constant weighting term: Move this term into the matrix: And then into the summations: Once there we no longer make the assumption that the weight value is constant, but is instead a sequence: Reduce the powers: And we now have the weighted form of linear regression. What was done here should be pretty straightforward—wi was just placed in all the summations. So let us now apply the weighting term to the linearized version of our problem regression function from the beginning of the article by doing the same procedure: Again, if the weights are all the same the results will be identical to those of the non-weighted version. We now have a weighting term that can be applied to give unequal consideration to select values of the function. But what should the weights be to help get our curve to fit?

Here is where some thought about the shape of the function along with some trial and error can help. What we can do is place a higher weight on the values to the right side of the graph. This is because the data to the right side is more sensitive to error in the coefficients. Stronger weights for values to the right will have the effect of making data points on the right considered more strongly than those to the left.

Just consider the w sequence a function of i. The simplest weighting term is to use the index if we assume the sequence of x is always increasing (that is, x is sorted starting from the lowest value and moving to the highest). Now let's try computing the regression curve with this weighting term: There is a marked improvement, but the regression line still isn't getting the curve quite where it needs to be. Clearly the right side needs a stronger weight.

Let's try squaring the weight: Which results in this graph: And that fit is almost perfect. In fact, using this weighting term we can increase the noise and have the regression respond as expected: Here the noise has been increased by two orders of magnitude, but the weighting term still produces a fairly good fit.

For this particular equation the higher the power, the better the fit. For example: Produces: This weighting term works well for this particular scenario where both the coefficients are positive values. However, it may not work in other situations. The weighting values need to be evaluated for the particulars of the application for which regression is being applied.

The use of a weighting term has allowed curve fitting of a non-linear function to be solved with a pretty simple regression technique. Unlike methods such as Gauss-Newton, this technique is deterministic and fairly easy to compute. So this is an other algorithm available for filtering out noise in real-world scenarios.

# June 01, 2009

## Least-square Regression

### Polynomial Regression

I was taking a look at hard drive prices over the 2 years for which I have data. The plots all include linear-regression for helping to predict the trend of where future prices should be. However, the longer the time scale, the less useful the linear regression line is. Regression doesn't have to be linear. In fact, I already wrote about 2nd degree (quadratic) regression back in September. I had to learn a little matrix math to do this, and having a slightly better understanding now, I can expand the regression to the nth degree. In doing so, there's a chance of finding a higher-degree regression that is a better curve fit or longer time frames of hard drive prices.

Let's first revisit the equation of quadratic curve fitting. Here, x and y are the arrays of data. The value n is the number of values in the array and a, b, c are the coefficients.

We can expand this to an arbitrary degree like so: Here, j is the polynomial degree desired and c0 through cj are the coefficients. We can use Cramer's rule to solve this matrix equation. In order to do this, we need a general determinate function. This function must take a matrix of arbitrary size and compute the determinate. This can be done with a recursive function, continually dividing the matrix into 2x2 pieces. For example, a 3x3 and 4x4 matrix can be solved like this:  The above matrices are solved by a row and column subdivide. For example, in both, the first term has a multiplied by the sub-matrix excluding the row and column that a is in. The signs alternate between columns. This can be expressed more generally as: Where Mij is the subdivided matrix. This turns out to be fairly easy to code. I found this example written in C.

Now that we can solve a square matrix of arbitrary size, we have all the components needed to do the regression. I created this graph using a 6th degree regression: Here in red we see the price per gigabyte of 1 terabyte hard drives from Jun of 2008 to June of 2009. In blue is the 6th degree regression curve. It looks to be a pretty good fit. Unfortunately, this fit is a lucky coincidence for this one case. In general I found the trend data never a all-around good polynomial regression curve that fit nicely for all time spans. And as for prediction, even with this curve, which looks like it fits well, the future prediction isn't good—the curve turns upward in about the middle of June. So something else will be needed if I am to make predictions about future prices. Despite the fact my least-square regression curve fitting algorithm failed to produce a good future prediction, it is a good algorithm to have around. I went ahead and created a PHP class for calculating this regression to an arbitrary degree, and started a project page for it.

Above we see a plot with regression plotted at 7 different degrees. The horizontal blue line is the average, or the 0th degree polynomial, or the mean average. The diagonal line is 1st degree, or linear regression. The remaining curves are increasing degrees until the orange line, which is the 6th degree.

The higher the degree, the larger the numbers become in the summations. In a 6th degree polynomial, there is a sum of the values to the 12th power. Because of this, I had to implement the functions using arbitrary precision arithmetic. This slows an already slow process. The above plot takes about 20 seconds to calculate. Higher orders take much longer, each higher degree about twice as long as the previous. There are probably improvements that can be made to my algorithm, but for now, I have something that functions at a general level. I'll have to wait until I take linear algebra before I can dive into this problem more.