Simple Linear Regression from scratch!

Rodrigo Dutcosky
8 min readJan 20, 2021

Hello there!

If you ever read some of my posts so far, you probably know I'm developing a special series called Practical Implementation.

In this series, it's quite obvious the amount of details given on the pre-processing stage for each algorithm I put in practice. Not because the theory behind these models are not important, they are. A lot.

But at the end of the day, knowing when to use the algorithm X or Y is enough to put your model in production. You just don't have to know every math formula going on behind the curtains, they already were all previously coded from someone, somewhere.

Which basically makes you only have to call the function, set the desired parameters, and you're good to go!

My purpose on starting this series (besides being a pretty cool hobby on quarantine times) is all wrapped up around that thought. I wanted to show how possible is to create machine learning models even if someone doesn't come from the academic field.

I'm gonna steal a sentence I looked up in other Medium post the other day. Really wanted to give out the credit but I just can't find where I saw it.

You have to program better than a mathematician and know math better than a programmer.

It made total sense to me when I read that..

But I'm loosing the subject of this post! My bad.

I also received some cool feedbacks from my posts saying they like the way I write, and because I make easier to understand. To be honest, I feel this is because I started on the Analytics world from the other end. I reached Machine Learning classes waaaay after I knew how to play with data.

I wouldn’t be able to fancy up my posts with heavy math not if I wanted to!

This time I propose something different. I'm going to focus more on the theoric part of the algorithm. Also, no sklearn!

Why Simple Linear Regression?

I'm not gonna lie here and tell you I can build up a Neural Networks Algorithm from scratch.. way far from that.

Simple Linear Regression is actually so simple and should be the first model everyone should start this journey on. It takes nothing but linear algebra to understand whats going on when you run that good old sklearn function.

But don't underestimate it. If you have absolute no idea of how a model make predictions, it's also a perfect way to switch that button that will make you say: AAAAAH!

Sorry for the long introduction. Let's get to it.

Simple Linear Regression

Imagine you have only two numeric features on your dataset. Do you think you can predict the value of one, based on the other?

This model is no different from others when it comes to the workflow matter. Same old story. Give historical data to your model, it will learn from it, and you'll start making predictions.

What makes it so simple is the fact you're only working with two features!

So basically, the model is trying to understand the behavior of one based on other. Let's call this one/others as dependent and independent variables.

If you want to apply this model, the features need to be correlated.

Saying two variables are correlated is the same as saying they have a statistical relationship. To every pair of continuous features, there's a positive, negative or non existing correlation.

Quick example:

The higher someones salary, the more someone will spend on junk food.

Do you think these events are correlated to each other? We don't have to guess that. We can let the numbers speak for itself. The correlation between two features can range from -1 to 1.

If the dependent variable moves to the same direction of the independent one, they have a positive correlation (closer 1).

If they move on opposite directions, they have a negative correlation (closer to -1).

Either positive or negative correlations are valid to apply Simple Linear Regression. The only thing they can't be is non correlated. Correlation values closer to 0 will mean the independent variable does a terrible job on explaining the dependent variable.

I'll show you a super easy way to check if variables are correlated or not when we get on Python, but on the mean while, we're talking about two numeric features right? A good way to find that out visually is making a scatter plot.

Every pair of data will have your own spot on a two dimensional space. They can represent the values for x and y axis. If salary and junk food expenses were not correlated, that's how it would look like:

On the other hand, if these two variable had a positive correlation, it would be looking something more like this:

One last important thing to tell you about correlation…

Correlation does not imply causality.

Don't forget to never say something like that in front of a statistician (They hate it, apparently).

Time to start getting into a little bit of linear algebra!

How the predictions happen?

Allow me to help you remind of some of that boring, never gonna use it for anything in my life math formula.

This is how we represent a continuous line in a cartesian plane:

You can see we were just talking about y and x. They were our two discussed features and we plot them on a scatter plot, so you're already pretty familiar with those..

Now what I want to focus on is the a and b from this formula.Actually, slope is what you call a, and the fancy name for b is intercept.

The b is the value where the line intercepts axis y. Slope is responsible for changing the angle of the line, based on x axis. Which means, you can basically put a line anywhere you want if you have the control of these two guys right?

And why the hell am I talking about moving lines somewhere?

That's because Simple Line Regression predictions will come from a line equation!

(AHHHH! moment happens here, I guess..)

Ok, so we have x and y values, they're the feature pairs we will train the model with. What we need to do now is find the values that best suit the slope and intercept on our equation.

How? Keep up with me on this next topic.

Minimizing the Errors

Every machine learning model is trained based on calculated errors to find maximum accuracy. Anywhere we decide to place our line on that plot will make us have a certain amount of error.

Let me place a line somewhere on that plot to show how this error is calculated.

If I place my line like this, the sum of all points distances from the line is what we can call the error of the equation. These distances are represented by the red lines on the plot.

So, summing all distances up is going to give us a number right? What if I change the line a little bit to the right? Or a little bit to the left maybe?

We know the place where the line is going to be is a result of the slope and intercept values. The algorithm finds where to put the line in order to have the least amount of error.

Finding that perfect spot makes us having everything we need to start making predictions!

Coefficient of Determination

To have some type of measure based on how much the variation of the dependent variable can be explained by the variation of the independent one, we calculate the Coefficient of Determination, also known as R Square.

On the following there's a 30 data records regarding the salary from a given professional based on the years of experience it has on the market.

The linear regression formula can be coded from scratch like the following lines of code,

Coefficients a and b will be found minimizing the error on the function.

Both coefficients can be replaced on the simple linear regression formula in order to use a known value of X in order to predict the value for Y.

Our model is predicting a salary income of 214k for professionals that have 20 years of experience, based on the 30 data points used as input to train the model.

Here's the bad part..

This is way too simple. I'm planing to make an exclusive post to discuss the topic of algorithms complexity and their pros/cons, but given the business problem of "predicting someone's salary" we can use this code for example purposes only.. We know someone's salary depends over much more external variables than just the years of experience on professional market (and a bit more then just 30 records would help too..).

What other features should be used as input for a predictive model to get a higher accuracy on these predictions?.. and by the way, what is the interpretability of a "high accuracy"?

Most of these questions don't have a straight answer, and that's why if I had to train a model with the purpose to predict salaries, one of the features I wouldn't miss out would be: is_data_scientist_flg

Thanks!

Rodrigo Deboni Dutcosky

--

--