Linear Regression : A Beginner’s Guide

6 min readJun 29, 2020

Today we will talk about Linear Regression in a very simple intuitive way to learn the internal mechanism of it

To Better understand most of the concept of Linear Regression you must understand how Logistic Regression works. You can refer to my Blog of Logistic Regression.

First thing about Linear Regression is, that unlike Logistic Regression, it is actually a regression technique.

Henceforth, I will refer Linear Regression as LiR and Logistic Regression will be referred as LR.

As like LR, LiR also tries to find the best Hyper Plane which reduces the error by which we can estimate the real values of the data.

Let us assume we have a dataset of predicting some real values, like price of cars or price height of a person given other parameters. The data points are spread across the spectrum along the Dimension, we need to find such a line / plane / hyper plane, which when used for prediction, gives the lowest error among all.

Let’s imagine that our data points looks like below, when we try to plot them.

Now, our task is to create a Hyper Plane which will cover most of the data, and will keep the error minimum. By Minimum error I mean, when the data points will be projected on the Plane it gives the lowest collective error.

We need to understand one thing, that is, we can only predict the values or points which is within the Hyper Plane and not outside, let me show u an example for that.

Now, you will get a clear picture of what’s happening, imagine the model determines ∏ to be the best Hyper Plane, that estimates the best possible data with minimum error, now, not all points if u see are on the hyper plane, so the perpendicular distance from the plane to the point is considered as error, now, from LR, we know that eventually everything boils down to optimization problem, where we have to optimize something, or let’s say we have to maximize or minimize an entity, here we minimize the error i.e. the distance of the point from the plane.

How the Hyper Plane is determined?

Well for that you need to reference my previous blog mentioned above. It uses something called as a Optimization equation to find the best hyper plane which fits our data.

Now as the name suggest we have to optimize something, in this case we have to minimize the error as mentioned above.

How do we minimize the error?

By choosing the best hyper plane.

How do we choose the best hyper Plane?

Let’s understand this using simple geometry.

As you can see I have drawn a green boundary along all the points in the plane, now with normal intuition we can say which hyper plane best fits our data.

If we see ∏1 it covers some of the data but a majority of our points are far away from the hyper plane which will result into high value of loss.

For ∏2, the plane well divides the data, it is better than ∏1 but still there is a scope of improvement.

For ∏3, if we see this can be considered as the best plane which covers all the points, and consist a majority portion of the data points.

Now if we take the Ellipse as an estimation of our data spread, which hyper plane passes through majority of the Ellipse, It is ∏3, and hence it will be best among the rest.

This is how our model tries to predict the best hyper plane in an intuitive way.

What loss function does the Model Optimize?

It’s called the Root Mean Squared Error

Root(∑ (yᵢ— yₚ)² / n)

Where

yᵢ : is the actual value for iᵗʰ point

yₚ : is the predicted value for iᵗʰ point

n : is the total Number of points

Now, we know that

Error = yᵢ — yₚ

Now, why the square and square root?

This can be best explained with a simple example :

Imagine I have 2 data points one being 0.5 and one being -0.5 and my model predicts 0 for both.

So the actual loss here is 0.5–0 = 0.5 and -0.5–0 = -0.5.

Hence if we add the errors up then -0.5 + 0.5 makes it 0, which means our model is perfect, but in actual this is not the case, hence we take the Square of the term to get rid of the sign, also at the end dividing it with n makes it mean value and then take square root to normalize the square terms.

The key difference between Logistic Regression and Linear Regression is the Loss function, rest remains the same.

And that’s how, we get an intuitive explanation of how Linear Regression Works.

What is the Hyper Plane equation looks Like ?

As mentioned in LR, the Equation for the Hyper Plane almost stays the same.

y = WᵀX + b + e

where “e” is the error term.

And we have to reduce the Error here.

Let’s consider our equation above :

we know form LR that WᵀX + b represents hyper plane, lets consider it passes though origin, hence b=0

Therefore, WᵀX gives us a value or predicted y value, form our equation above, we can say that, Yi = WᵀX + e (that is Actual = Predicted + error)

now Yi — WᵀX = e

hence this states, that Actual — Predicted gives us error, and hence we need to reduce that.

That is how our Optimization Equation works for different values of the hyper plane, and as the Hyper plane is a vector representation of just coefficients, we can pretty much find it via our loss function.

Code from Scratch :

Let’s Understand it with Code :

For this we will use the dataset which has 2 columns :

Head Size (cm³)
Brain Weights (grams)

Basically, given any head size we will predict the brain weight of that particular person, so we will only use 2 dimensions this way its easy for us to understand how LiR works.

Now, the Hyper plane will be represented with 2 coefficients, as we only has 2 dim. Hence, it’s a Line in this case.

Let’s call the coefficients “w” and “b”

So our Line is represented using the 2 coefficients as follows :

y = w*x + b

How to calculate w and b ?

w = [ ∑ (x — µx) — (y — µy) ] / ∑ (x — µx)²

b = µy — w*µx

where,

µx : mean of x

µy : mean of y

So any point will be calculated as y = w*x + b

Implementation :

Reading the Dataset :

Calculating the Mean:

Calculating the Coefficients:

Plotting the data and the Learned Hyper Plane (Line), with the help of coefficients :

Error :

Root Mean Square Error, calculated as mentioned above.

The code and the dataset can be found here. Thanks to this guy it was trivial for me to explain the Essence here.

Stay updated with all my blogs & updates on Linked In. Welcome to my network. Follow me on Linked In Here — -> https://www.linkedin.com/in/bishalbose294/