Nearly 80% of the people build linear regression models without checking the basic assumptions of linear regression.
Just hold for a second and think. How many times have you built linear regression models without checking the linear regression assumptions?
If you are not aware about the linear regression algorithm. It is a famous supervised machine learning algorithm that represents the linear relationship between a dependent variable and independent variables.
It is easy to understand and implement. However, just writing a few lines of code won’t work as expected.
Because before implementing the linear regression, we have to take care of certain assumptions made by linear regression.
Learn the 5 key linear regression assumptions, we need to consider before building the regression model. #datascience #machinelearning #ai #regression #python
It is important to understand these assumptions to improve the regression model’s performance.
So In this article, we are going to discuss these assumptions in-depth and ways to fix them if violated. After gaining proper knowledge of linear regression assumptions, you can bring excessive improvement in regression models.
Before we dive further, let’s look at the topic you are going to learn in this article.
Linear Regression Algorithm
Before explaining the algorithm, let’s see what regression is.
Regression is a method used to determine the degree of relationship between a dependent variable(y) and one or more independent variables (x).
Linear regression determines the relationship between one or more independent variable (s) and one target variable.
In machine learning, linear regression is a commonly used supervised machine learning algorithm for regression kind of problems. It is easy to implement and understand.
Supervised means that the algorithm can make predictions based on the labeled data feed to the algorithm.
Mathematically, linear regression can be represented as
Y = mx+c
Here,
- y = dependent variable (Target variable)
- x = independent variable
- m = regression coefficient
- c = intercept of the line
In linear regression, the target variable has continuous or real values.
For example,
We are predicting the price of houses based on certain features. Here, the houses’ prices are the target(dependent) variable, and the features determining the price are independent variables.
When the target variable can be determined using one independent variable, it is known as simple linear regression.
When it’s(target) dependent on multiple variables, it is known as multiple linear regression.
I hope we have given a high-level overview of the linear regression algorithm. If you want to know more, you can refer to the below articles.
Generally, most people don’t check the linear regression assumption before building any linear regression models. But we need to check these assumptions.
Let me list down the linear regression assumptions we need to check, and then we can discuss each of these in detail.
- Linear Relationship
- Normal Distribution of Residuals
- Multicollinearity
- Autocorrelation
- Homoscedasticity
Ideally you need to check these for Lasso regression and Ridge regression models too.
Linear Relationship
This is the first and most important assumption of linear regression. It states that the dependent and independent variables should be linearly related. It is also necessary to check for outliers because linear regression is sensitive to outliers.
Now the question is
How to check whether the linearity assumption is met or not.
For determining this, we can use scatter plots. Scatter plots help you to visualize if there is a linear relationship between variables or not. Let me take an example to elaborate on it.
Suppose you have to check the relationship between the student’s marks and the number of hours they study.
From the above plot, we can see that devoting more hours does not necessarily increase marks, even though the relationship is still a linear one.
Let’s take another example where the linear relationship doesn’t hold.
In the given plot (Ozone vs. Radiation), we can see that the linear relationship isn’t held between ozone and radiation.
Here, you can see there is no linear relationship between ozone and radiation.
It is important to check this assumption because if you fit a linear model to a non-linear one, the regression algorithm would fail to capture the trend.
Hence, it will result in an inefficient model. Also, this will lead to erroneous predictions on the unseen data sets.
Now comes the question
What to do if the features and target relationship is not linear?
Let’s learn this.
What to do if linear relationship assumption isn’t met
Let us discuss the options you can go with.
- You can apply nonlinear transformations to the independent and dependent variables.
- You can add another feature to the model.
- For example, if the plot of x’ vs. y’ has a parabolic shape, then it might be possible to add x2 as an additional feature in the model.
Normal Distribution of Residuals
The second assumption of linear regression is all the residuals or error terms should be normally distributed. If residuals are non-normally distributed, the estimation may become too wide or narrow.
If there is non-normal distribution in residuals. You can conclude that there are some unusual data points that we have to observe closely to make a good model.
Ways to Check Normal Distribution
To check the normal distribution, we can leverage the help from the two plots
- Distribution Plots
- Q-Q Plots
Distribution Plot
We can use the distribution plot on the residuals to check if it is normally distributed.
Here, the black line is showing the normal (standard) distribution, and the blue line is showing the current distribution.
We can see that there is a slight shift in the normal and current distribution. We can use the non-linear transformation of the given features if the residuals are not normally distributed.
Q-Q Plot
Which stands for “quantile-quantile” plot, can also be used to check if the residuals of a model follow a normal distribution or not.
If the residuals are normally distributed, then the plot will show a straight line. However, the deviation in the straight line shows the absence of normality.
Normality can be checked by doing statistical tests, too, like – the Kolmogorov-Smirnov test, Jarque-Barre, or D’Agostino-Pearson.
Multicollinearity
The next assumption of linear regression is that there should be less or no multicollinearity in the given dataset.
This situation occurs when the features or independent variables of a given dataset are highly correlated to each other.
In a model having correlated variables, it becomes difficult to determine which variable is contributing to predict the target variable. Another thing is, the standard errors tend to increase due to the presence of correlated variables.
Also, when independent variables are highly correlated, the predicted regression coefficient of a correlated variable depends on other variables that are available in the model.
If you drop one correlated variable from the model, its predicted regression coefficients will change. It can lead to wrong conclusions and poor performance of our model.
How to Test Multicollinearity
We can test multicollinearity by using the following approaches.
- Correlation Matrix
- Tolerance
- Variance Inflation Factor
Let’s discuss the above in detail.
Correlation matrix
Correlation represents the changes between the two variables. While calculating Pearson’s Bivariate Correlation matrix, it is recommended that the correlation coefficient among all independent variables should be less than 1.
Let us check the correlation of the variables in our student_score dataset.
In this dataset, we are having one independent variable(hours) only to determine our target variable (score). We can see that hours devoted are highly correlated with marks scored by the student.
Tolerance
Tolerance helps us to determine the effect of one independent variable on all other independent variables.
Mathematically, it can be defined as T = 1-R², where R2 is computed by regressing the independent variable of concern onto the remaining independent variables. If the value of T is less than 0.01, i.e., T<0.01, then your data has multicollinearity.
Variance Inflation Factor
VIF approach chooses each feature and regresses it against the remaining features. It is calculated by using the given formula
VIF = 1 / 1 – R^2
- If VIF value <=4, it implies no multicollinearity
- If VIF value>=10, it implies significant multicollinearity
Methods to handle Multicollinearity
- You can drop one of those features which are highly correlated in the given data.
- Derive a new feature from collinear features and drop these features (used for making new features).
Autocorrelation
One of the analytical assumptions of linear regression is that the given dataset should not be autocorrelated. This phenomenon occurs when residuals or error terms are not independent of each other.
In simple terms, when the value of f(x+1) is not independent of the value of f(x). This situation usually occurs in the case of stock prices, where the price of a stock is dependent on its previous one.
How to Test Autocorrelation Assumption is met?
The easiest way to check if this assumption is met to look at a residual time series plot. This is a plot of residuals vs. time.
Usually, most of the residual autocorrelations should fall within the 95% confidence intervals around zero. Which are located at about +/- 2-over the square root of N, where N is the dataset’s size.
It can also be checked using the Durbin-Watson test.
Durbin-Watson test statistics can be implemented using statsmodels.durbin_watson() method.
Formula:
Output : 0.07975460122699386
- If the value of durbin_watson = 2, it implies no autocorrelation
- If the value of durbin_watson lies between 0 and 2, it implies positive autocorrelation.
- If the value of durbin_watson lies between 2 and 4, it implies negative autocorrelation.
Methods to Handle Autocorrelation
- Include the dummy variables in the data.
- Predicted Generalized Least Squares
- Include a linear sequence, if the residuals showing a consistent increment or decrement in pattern
Homoscedasticity
The fifth assumption of linear regression analysis is homoscedasticity. Homoscedasticity depicts a circumstance in which the residuals (that is, the “noise” or error terms in between the independent variables and the dependent variable) is the same across all values of the independent variables.
Simply put, residuals should have constant variance. If this condition is not followed, it is known as heteroscedasticity.
Heteroscedasticity leads to the unbalanced scatter of residuals or error terms. Generally, non-constant variation arises in the presence of outliers.
It seems like these values get too much importance, thereby disproportionately impact the model’s performance. The presence of heteroscedasticity in a regression analysis makes it difficult to trust the results of the analysis.
How to Test if Homoscedasticity Assumption is met?
The most basic approach to test for heteroscedasticity is by plotting fitted values against residual values.
The plot will show a funnel-shaped pattern if heteroscedasticity exists.
The presence of heteroscedasticity can also be computed using the statistical approach. They are as following:
The Breush – Pegan Test:
It determines whether the variance of the residuals from regression depends on the values of the independent variables. If it is so then, heteroscedasticity is present.
White Test:
White test determines if the variance of the residuals in a regression analysis model is fixed or constant.
Methods to handle Heteroscedasticity
We are having two ways to handle the Heteroscedasticity, let’s understand both.
Transform the Dependent Variables
We can transform the dependent variables to avoid heteroskedasticity. The most commonly used transformation is taking the log of dependent variables.
For instance,
If we are using independent variables(input features) to predict the number of cosmetic shops in a city (target variable). We may try to use input features to predict the log of the number of cosmetic shops in a city.
Using the log of the target variable helps to reduce the heteroskedasticity. To some extent.
Use weighted regression
Another approach to deal with heteroskedasticity is by using weighted regression. In this method, a weight is assigned to each data point based on the variance of its fitted value.
Conclusion
This is the end of this article. We discussed the assumptions of linear regression analysis, ways to check if the assumptions are met or not, and what to do if these assumptions are violated.
It is necessary to consider the assumptions of linear regression for statistics. The model’s performance will be very good if these assumptions are met.
The classical linear regression model is one of the most systematic predictors if all the assumptions hold.
The best thing about this concept is that the efficiency increases as the sample size increases to infinity.
What next
After reading the article, please take any of the regression algorithm you have developed in the past and check these linear regression assumptions.
For implementing and understanding the linear regression concepts. I would suggest reading this article to understand the linear regression concept in a more practical way.
Also, explore remaining machine learning algorithms on our platform to enhance your knowledge.