I've been working on a graphing project and I needed some data to plot on a chart. One of the items I wanted to demonstrate was linear regression and I wanted a set of data that was centered around a line. The simplest way for this was to run a line function (y = mx + b) and make b a small random number for each point. That will give a uniform scatter from whatever the range of b is.
f(x) = m x + b + s ( r - 0.5 )
The above chart shows an example of uniform scatter. It is easy to discern the boundaries of the scatter coefficient.
That's alright, but I wanted something a little better--more centered around the line itself. I first tried a sine-based function. The idea is to weight the random data in a non-linear fashion, and the sine function will do this. This just ended up giving a more narrow distribution band, but the band was clearly discernible... these are not the droids we're looking for.
This chart displays an example of sine-based scatter. It offers a little better distribution, but it is still easy to to discern the boundaries of the scatter coefficient.
For my next idea, I thought of a function with a strong curve and asyntope. I came up with a inverted square root, and this turned out to be exactly what I wanted.
Here is the function:
is the slope, b
is the Y-intercept, s
is the scatter coefficient, and r
is a random number such that 0 < r <= 1
This chart shows an example of inverse square scatter. The majority of the points tend toward the line function, but in theory the points can actually by out to infinity. In practice, the range is much more limited. For one, floating point numbers are limited in size. Assuming a 32-bit floating point, the smallest number is 8.51E-38 which would result in an outer boundary of 9.2E+18 (9 quintillion) times the scatted coefficient. The likely hood of this is 1:2,147,483,648 points. However, the random number generator is sure to produce numbers with a more limited range.
What's nice about this function is that with a fairly large scatted coefficient, the graph looks quite messy, but the mean average, slope and Y-intercepts still end up being relatively close.