Computing The Least-Squares Regression Line for Given Data from Scratch using Python

Curve fitting is a technique to find a best function that represents the given data points. Least-Squares Regression is a method of curve fitting which is commonly used over-determined equations (when there are more equations then unknown).

In this article, I will show finding the best-fit line for given data points using least-square formula. After visualizing the found linear line on data points, I will compare the results using a dataset which cannot be well-represented using linear line.

Mathematical Formula

y = f(x) = αx + β (1)

In equation 1, α represents the slope of the line and β represents the y intercept; x is the data and y is the dependent result. The key point and where the name of method comes is the objective function which guides the optimization for the minimizing following formula:

Objective|cost|loss function = ∑ᵢⁿ(yᵢ-pᵢ)²

Pᵢ represents the approximated point by regressor(f(xᵢ) = αxᵢ + β). The partial derivative of objective function with respect to x will give us the optimal slope (α). I am not going to prove; however, there is a fact that the optimal line has to pass through the the point (mean x, mean y). Knowing these, we can construct following systems to find slope (α)and the y intercept(β):

α = ∑ᵢⁿ(xᵢ-μˣ)(yᵢ-μʸ)/∑ⁿᵢ(xᵢ-μˣ)² (2)

β = μʸ — αμˣ (3)

We have all the mathematical formulas to make calculation, so let’s get our hands dirty with some coding.

Applying Solution in Python

A trivial dataset for the sake of this article will be used.

The following function represents the equation 2. Python’s multiplication operator lets us to perform element-wise multiplication when used with arrays. It makes easy to express mathematical functions in vectorized way.

After finding slope, having the knowledge that the mean of y values and x values have to be on the regression line, the y interceptor can be found easily as follows:

All information to form a specific line is now available. The lambda expression can be written as:

We calculated the variables now it is time to visualize the line on data points. I used the numpy package which is into the pandas package to produce x values between a range; however, this usage is deprecated in latest version. Then generated points are put into the line function to see corresponding f(x). The following function performs the plotting the data points, line and mean point:

Plots & Comments

Plot 1: Least square regression line on linearly representable data points

As it can be seen from Plot 1, the approximated line looks quite appropriate for the data points and optimal solution. If the data has a linear correlation the least square regression can be an option to find optimal line.

Plot 2: Least square regression line on non-linearly correlated data

Plot 2 shows the limitation of linear least square solution. Because we targeted to find a linear line such as αx + β, a non-linear line such as αx² + βx+ c cannot be calculated by linear least square method. Due to the non-linear relationship between x and f(x) in second data set, the optimal line cannot be calculated. For these cases there is polynomial least square solution which aims to find coefficient in polynomial with a degree d. The polynomial solution is a topic for another article.

In conclusion, I tried to show the mathematical background of linear least square solution with a computational application in one of the most popular programming language Python. I hope it helps you to understand it better.

The notebook file for calculations and data files can be found from my github: .



MSc. Student @ITU | Software Engineer & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Furkan Artunç

MSc. Student @ITU | Software Engineer & Machine Learning Engineer