The other day I received an email from someone who was using my Polynomial Regression PHP class. They asked about something I had never thought about—how to force a Yintercept of zero. It was something my library could not do, but something I wanted to understand. So in this article I will share what I learned, and how the problem was not just solved for the one case but for a whole set of cases.
I have written a fair deal in the past about linear regression, quadratic regression, polynomial leastsquare regression, and have even made interactive demos. So I will make the assumption in this writing that the reader is familiar with the basics of polynomial regression.
We will start by defining the original request. Consider this graph:
Here we have 10 points and a linear regression line. The coefficient of determination (R^{2}) value is fairly low meaning we don't have a great fit. This makes sense because there are not a lot of data points, and the data is quite noisy. The linear regression calculation has calculated a Yintercept of ≈0.15.
What if these samples were part of a physical measurement, and we know by the nature of this measurement the Yintercept must be zero? If we know this to be the case the only variable in the linear regression equation is slope. How do we calculate linear regression slope with zero Yintercept?
We could simply run the linear regression algorithm and ignore the Yintercept.
Here we have calculated the slope and intercept, but ignore the intercept. Notice the trend line is below most of the data points. It is clearly not accurate.
Let us review the equation for linear regression in matrix form:
And solved:
Remember that the mean average can also be expressed in matrix form by removing a row and column:
What this tells us is the intercept term is based on the first row and column. So if we modify our linear regression matrix we can remove this row and column by making it moot. To do this, we set the first row and column to one, and the rest of the first row and column to zero (more on why latter). In addition, we force the b term to zero like this:
Now we solve the equation and we get:
Looking at the matrix you can tell the first row and column are not contributing to the equation. If we remove the b term, we can solve the matrix for just the slope. We will get the same result.
So let us now apply this to the graphed data from above:
The linear regression line now much more evenly driven though the data (one point at 0.8 is off the chart). The data set is actually a noisy signal with a slope of one. When using a forced intercept the regression calculates a slope of 0.99—so very close despite the low signaltonoiseratio.
Now, how about regression for higherorder polynomials? As I expected the pattern works out the same:
The first row and term can be removed to simplify the math, but we are not going to do this because there is more we can do to this equation.
Here is an example of a 2nd order polynomial (a quadratic function) with the zero term forced, and the regression line looks as expected. This is the solution to the question in the email I received. However the pattern is too tempting to leave alone. Time for more questions.
If the first term can be forced to zero, can we force the other coefficients? Yes. We can force any term to zero if it is nulled out in the regression matrix. To do this, zero out the row and column of the result to be forced to zero, placing a one where the row and columns intercept. Why will be explained a little latter. Here is an example for a 3rd degree polynomial with the coefficient for the x^{2} term forced to zero:
While that is fine if zero is desired for the value of the forced coefficient, but what about when we want to term forced to something other than zero? We can do that too, but we have to account for the fact we know this term before the calculations start. This can be easily shown with the linear regression calculation:
Here what we have done is force the offset coefficient to f. Because the offset is known, we want to remove this impact from the y data. So in the summation involving y we remove the known offset. To expand this to higherorder polynomials we just need to subtract off the effects of the known coefficient. We can use the 3rd order polynomial regression matrix from above do demonstrate this removal:
The coefficient being forced is for the x^{2} coefficient. Thus each summation involving y has this value removed. When this matrix is solved the calculated coefficient c_{2} will equal f. The other coefficients are now independent of the forced coefficient. The matrix could be simplified:
In this simplified the results are the same, but demonstrates how the other coefficients are now independent.
We can force more than one term using the same setup. For each coefficient being forced, null the row and column.
Here we have forced the coefficients for x^{0} (the offset) and x^{2} using the forcing value f_{0} and f_{2} which will end up c^{0} and c^{2} respectively. Any term can be removed if y is modified as follows:
Where f is a set of forced coefficients, and p is the set of powers of the forced coefficients. So if forcing coefficient 0 and 2:
Then any y_{n} in the matrix can be replaced with y_{n}. So the above matrix simplifies to:
An other pattern is emerging that explains why we are setting the values in the rows and columns to either one or zero. Force all the coefficients for the above matrix and the result will be:
Forcing all the coefficients causes the identity matrix appears. This will naturally cause the coefficients (c_{n}) to be set to the forcing values (f_{n}). Thus why nulling a row out uses this pattern of zeros and ones—we are simply inserting part of the identity matrix into the places corresponding to each coefficient being forced.
There are fewer math operations if we remove the rows/columns for coefficients we do not need to calculate, but then we need to keep track of what coefficients are actually calculated in the matrix. So in my updates to the Polynomial Regression class, I didn't remove the moot rows/columns. This will effect performance slightly, but the class is designed to do most of the lengthy calculations as the data is added. The Gaussian elimination used for solving the matrix should not take too long, even for large matrices used on high order polynomials.
The changes were fairly easy to implement. I did some updates to the polynomial regression class project page as part of the release. Examples now include graphics which help to demonstrate how to use the new fetchers. I am rather pleased the library is getting used. A few months ago I found the class was being used in an open source project for doing analysis on metric data from servers. To the best of my knowledge this is the first use of anything I have written in a project I wasn't involved with, but hopefully just one of many to come.
Next in our series of multiplication of large values we are going to add template specialization and inline assembly to improve performance.
In C++ template specialization is redefining a template with specific types and an implementation specific for that type. This works out well for our multiplication function which currently does four multiplications for each word it has to multiply.
The C standard library <stdlib.h>has a function called div. This function returns a structure that has both the quotient and modulus remainder. It makes sense to have this function because both are usually computed at once by the processor. The Intel x86 instruction set also computes the upper and lower word when preforming multiplication. I wondered why there wasn't a multiplication function in the C library, but as I came to find out Intel processors are somewhat unique with this fact. Many other processors, such as the ARM family, do not compute the upper word of a multiplication. However, this does allow us to implement an Intelspecific acceleration using template specialization and some inline assembly.
While I am a big fan of the GNU compiler, one thing I really don't like is the use of AT&T assembly for Intel processors. The orders of operations are often reversed and it is much more cluttered than Intel's assembly code. However, it is the inline assembly offered so we will work through our example using it.
Let us first examine the code for a 64bit multiply.
Here one can see we have 4 instructions, 3 of which move data around. The multiplication is one instruction and the results of the multiplication are stored in two words, the lower word in rax and the upper word in rdx. Intel only allows multiplication by the ax register, and the results are always stored in ax and dx. The SIMD instructions allow more options, but we are not going to get into those in this article.
A bit on the template itself since template specialization is one of those areas of C++ not often discussed. It is very simple. We have defined the template leaving the type blank on the template line, but have filled in the type on the function line. Basically we are saying that this is an implementation of multiplyWords< uint64_t >. All the types have been filled in so that this template could work as a standalone function. This is a great system for creating acceleration.
So about the speed improvement, how much does it improve performance? Well our default template must do four multiplications, several additions and several bit operations. Unless optimization comes up with something pretty impressive it will be hard to beat 4 instructions. When compiled there was about a 5.7% speed improvement without optimization turned on, but only a 1.2% improvement when it was turned on.
Since we are using GNU C specific syntax, we may as well use a bit more. This page is a great resource for processor identification using compiler specific macros. For our implementation we are using 64bit specific instructions, thus we can check to see if the macro __x86_64__. If it is define, we have 64bit instructions available. In this way we can implement template specialization for a variety of processors and determine which optimizations can be used at compile time.
Only a 5.7% speed improvement though?! It is true. Perhaps one day I will implement an entire specialized template for multiplyArray. Doing so in pure assembly should greatly improve speeds because we can do much of the work in registers. However, I have no project that really needs such work and I am mostly done with my academic implementation.
This concludes my set of articles on large multiplications. I hope you have enjoyed them.
The other day I wrote about doing multiplication of two 64bit numbers. What about doing multiplication of arbitrary long numbers? Typically when this is the case, one uses an arbitrary precision math library. There are several available and they are usually the best way to take care of the times when large numbers are needed. However, for academic reasons I thought I'd explore this topic.
Multiplying two large number means one first has to define how the numbers are stored. Naturally arrays will be used. But how are the arrays arranged? They could be made of any bit width. And there is the question of endianness—do we store the most significant word first or last? What if we desire all of this to be variable?
To accomplish this we can use C++ templates. We need templates because we do not know the data types of the arrays, but by using template we can allow this to be figured out at compile time.
Let's examine the code. The primary function is multiplyArray. This template function can except any type that handles math operation and bit operations. Like all template function, it must be fully define in the header file so the compiler to create the correct code when the function is used. There are three loops nested in one an other. The first two loop through each word of their respective array. Inside the first two loops is where the multiplication of the two selected words take place. Note there is a call to a multiply template function. This template function is almost identical to the one I wrote about yesterday with the difference being it is now a template function that can handle whatever data type is specified. In the future I will write about why this is useful.
Once multiplied, the results must be accumulated into the running sum. This is done in the last loop. One item to pay attention to is the accumulator, and that test to see if the accumulator is less than the lower word. C provides no way to tell if adding two numbers has overflowed—there is no “add plus carry”. To check for overflow, the result will be less than either of the values added together. If that happens the code sets the upper word to account for the carry.
It should noted that no matter what values are being multiplied the upper word will never overflow by adding one. The upper word only starts with any overflow from the multiplication, which is always less than the total the upper word could be. After that, the upper word is either one or zero.
The design of this code is setup to be either byte endian. If the endianness of the words is setup the same as the machine endianness, one can treat the arrays as unions—an array of 32bit integers could also be 8bit bytes. The design is also completely portable and machine independent. Any C++98 compliant compiler will work.
For testing this unit I created some functions that turn the array into a string, and came up with some test vectors I compared against values computed with GP/PARI Calculator. Maybe one day I will make a simple library that includes the string functions and test vectors. For now my test code is not available.
I have released this source code under the MIT license. I chose this license because I feel the work I am doing is so basic it doesn't deserve to be maintained as opensource. At the same time it is complete enough to be used as a library. The MIT license lets people use the source by itself, or in a derived work, open or closed source, and keeps my name in the credits. Why is that important? While there is some pride in having my name attached to my work, it's also a matter of promotion. I am a contract software engineer, and if someone sees my work and wants to hire me to do more they will know who to contact if my name is in the license. It is a longshot, but I would select a programmer for hire based on software I saw they had written.
In C multiplication overflow is ignored. If one multiplies two 32bit numbers, and the result is larger than 32bits the upper bits go into the bit bucket. As long as the result will fit into 32bits, nothing is lost. If the result will be larger than 32bits, one must use 64bit data types. However, as of 2011 the C standard has no integer data types larger than 64bit. So if larger numbers are desired one either has to use floatingpoint, or make their own multiply. At work this month I ran into a problem where I needed to multiply two 64bit numbers that would overflow a 64bit result. After the work day was complete, I wrote a simple function to do this, and we will examine how it works in this article.
When doing multiplication of binary numbers, the result will never be larger than twice the number of bits of the data types used. That is, two 32bit numbers multiplied together will never have a result larger than 64bits. The reason can be expressed mathematically.
Where n is the number of bits.
So when multiplying two 64bit values it is known the result will always fit into a 128bit. We can construct this result using two 64bit values—an upper word and a lower word.
Doing the multiplication one has to think back to their grade school days and long multiplication. Let us first solve a simple problem doing long multiplication to solve 12x34.
1 
2 

3 
4 

4 
8 

3 
6 
0 
4 
0 
8 
The steps are to first multiply the leastsignificant digits of the 34 by the leastsignificant digit of 12 (in this case 4 and 2) and place the result in the left most column. Then the mostsignificant digits of 34 by the leastsignificant digit of 12 (3 and 2) and place that in the second column from the left. Next is the mostsignificant digit of 34 by the leastsignificant of 12 (3 and 2) which is stored in the second column from the left, and lastly the mostsignificant digit of 34 by the mostsignificant digit of 12 (3 and 1) which is stored in the third column from the left. Sum the two results with carries and we have our end result.
These steps are precisely what is needed in order to multiply two 64bit numbers. But rather than digits, each column will represent 32bits. If you still think of each column as a digit, the following chart is an algebraic representation of the above. To change to 32bits just consider that rather than each column counting from 0 to 9 before carrying the digit to the next column, it counts from 0 to 2^{32} – 1 before carrying to the next column.
au 
al 

bu 
bl 

bl*au 
bl*al 

bu*au 
bu*al 
0 
Here au is the upper 32bits of the first word (a), and al is the lower 32bits. The only problem with looking at the setup this way is that it does not show the word sizes. So let us arrange things like this:
au 
al 

bu 
bl 

bl*al 

bl*au 

bu*al 

bu*au 
Now we can see that the results of the multiplications are twice the width of the two values being multiplied, and how that lines up with the other values in the finial summation. Again, each column is 32bits and the four total columns means we have a 128bit result. We split the input into two 32bit numbers so that each individual multiplication never has a result larger than 64bits. This way there is never an overflow. The trade off is that we now have to do 4 multiplications and several additions.
So here is the source code for doing the multiplication.
There you have it. The code is easily adapted to other integer sizes. By changing the types to uint32_t, and the mask and shift values this function could produce a 64bit result. C already has built in support for 64bit types, but it could be done nonetheless.
The source code is here.
Here's a little math coding snip. Say you need to create a string of an integer number. How many digits will it have?
The primitive C integer types are 8, 16, 32, and 64 bits. The largest number that can be represented by this number of bits is 2^{n} – 1 where n is the number of bits. The subtract one is due to the fact the count starts at zero.
How does this help determine the number of digits the string will contain? To understand that, let us first examine how many digits various numbers have.
We use HinduArabic numerals to represent numbers. Most people say Arabic numerals, but it was actually ancient Indian mathematicians who developed the system. Arab traders introduced the numbering system to Europe, and hence the name credited the number system to them. I think they both deserve credit hence I refer to them as HinduArabic numerals. These numerals are in base 10 as we have 10 digits, 0 to 9.
Logarithms will tell us how many characters are needed to represent any number in any number base. For example, 1000 has 4 digits. We can also use the log base 10 to calculate the same thing.
Where d is the number of digits. The log base 10 of 1000 has an even output. To get the number of digits for any number will require the floor function to be added:
For example, 12345 has 5 digits. The log of 12345 is ≈4.091. The floor function is just the integer part of the number, so 4. Add one to that and we have our 5 digits.
How does this help with our original problem? Well computers store numbers in powers of two. Most understand this as binary, but it is really powers of two. For example the binary value 1101 = 2^{3} + 2^{2}^{ }+ 2^{0} = 13. This binary value is 4 bits, and the largest value that can be represented using 4 bits is 2^{4} – 1 = 15. If we know how many digits are in 2^{4} then we know any the maximum number of digits any 4 bits value can have.
This can be used for any number of bits by substituting the exponent of 4 for n:
This can be simplified using logarithm identities:
And the log base 10 of 2 is just a constant, and happens to be ≈0.30103.
That is a fairly simple equation to calculate. Using 0.30103 as the constant produces accurate values up to around 300,000 bits—far more than is needed for any C primitive. Using 0.3 is good enough for all C primitives and works up to 255 bits.
To stick with integer types to avoid floating point values we can add in a multiply by 10.
That's great for C primitive but not too useful—there are plenty of string functions and so few types to worry about. What about very large numbers with an arbitrary number of bits? For this we can use fixedpoint math.
Here c is the constant representing log_{10}( 2 ), and f is the fixedpoint shift. Any algebra student will tell you that the 2^{f} will cancel, which is true. The reason we use this is because the fixed shift and the constant will be combine to form a new constant: C = 2^{f}c = 2^{f} * log_{10}( 2 ). So the equation becomes:
Remember that a divide by any power of two is actually just a bit shift to the right, and that is very cheap to compute. Now we just need a value for C for as many bits as we would like to have accuracy.
Shift (f)  Value (C)  Accuracy (bits)  Accuracy (bytes) 
32  10,343,311,892  274,877,906,936  34,359,738,367 
30  2,585,827,973  274,877,906,936  34,359,738,367 
24  40,403,562  536,870,904  67,108,863 
16  157,826  262,136  32,767 
8  617  65,528  8,191 
So there it is. Even using an 8bit shift will allow the calculation to be accurate to number over 8,000 bytes in length—more than 19,000 digits long. Using a 16bit shift is good for 32,000 byte number—79,000 digits in length. And 30 bits enough for 34 billion bytes—almost 662 billion digits if you ever had the desire to print that many.