Computing The Least-Squares Regression Line for Given Data from Scratch using Python
Curve fitting is a technique to find a best function that represents the given data points. Least-Squares Regression is a method of curve fitting which is commonly used over-determined equations (when there are more equations then unknown).
In this article, I will show finding the best-fit line for given data points using least-square formula. After visualizing the found linear line on data points, I will compare the results using a dataset which cannot be well-represented using linear line.
Linear solution for a least-square regression is formulated as following:
y = f(x) = αx + β (1)
In equation 1, α represents the slope of the line and β represents the y intercept; x is the data and y is the dependent result. The key point and where the name of method comes is the objective function which guides the optimization for the minimizing following formula:
Objective|cost|loss function = ∑ᵢⁿ(yᵢ-pᵢ)²
Pᵢ represents the approximated point by regressor(f(xᵢ) = αxᵢ + β). The partial derivative of objective function with respect to x will give us the optimal slope (α). I am not going to prove; however, there is a fact that the optimal line has to pass through the the point (mean x, mean y). Knowing these, we can construct following systems to find slope (α)and the y intercept(β):
α = ∑ᵢⁿ(xᵢ-μˣ)(yᵢ-μʸ)/∑ⁿᵢ(xᵢ-μˣ)² (2)
β = μʸ — αμˣ (3)
We have all the mathematical formulas to make calculation, so let’s get our hands dirty with some coding.
Applying Solution in Python
I will use pandas functions to load data and matplotlib for basic plotting. I am skipping how to install these libraries and importing them since they are not the main topic of this article.
A trivial dataset for the sake of this article will be used.
The following function represents the equation 2. Python’s multiplication operator lets us to perform element-wise multiplication when used with arrays. It makes easy to express mathematical functions in vectorized way.
After finding slope, having the knowledge that the mean of y values and x values have to be on the regression line, the y interceptor can be found easily as follows:
All information to form a specific line is now available. The lambda expression can be written as:
We calculated the variables now it is time to visualize the line on data points. I used the numpy package which is into the pandas package to produce x values between a range; however, this usage is deprecated in latest version. Then generated points are put into the line function to see corresponding f(x). The following function performs the plotting the data points, line and mean point:
Plots & Comments
There are 2 different data set. They are used to show the capability and limitation of linear least square solution.
As it can be seen from Plot 1, the approximated line looks quite appropriate for the data points and optimal solution. If the data has a linear correlation the least square regression can be an option to find optimal line.
Plot 2 shows the limitation of linear least square solution. Because we targeted to find a linear line such as αx + β, a non-linear line such as αx² + βx+ c cannot be calculated by linear least square method. Due to the non-linear relationship between x and f(x) in second data set, the optimal line cannot be calculated. For these cases there is polynomial least square solution which aims to find coefficient in polynomial with a degree d. The polynomial solution is a topic for another article.
In conclusion, I tried to show the mathematical background of linear least square solution with a computational application in one of the most popular programming language Python. I hope it helps you to understand it better.
The notebook file for calculations and data files can be found from my github: https://github.com/artuncF/Linear-Least-Square-Regression .