Why insights are needed?
First thing first. Why is linear regression important?
Linear regression is the fundamental technique, which is rooted strongly in the time-tested theory of statistical learning and inference, and powers all the regression-based algorithms used in the modern data science pipelines. And, for the majority of data analytics work – other than problems dealing with high-dimensional data like image, audio, or natural language – such regression techniques are still the most widely used tools.
They are simple to implement and more importantly, simple to follow and explain. And, explainability is becoming more and more important.
However, the success of a linear regression model also depends on some fundamental assumptions about the nature of the underlying data. One cannot emphasize enough how important it is to verify whether these assumptions were "reasonably" satisfied. That check is the only guarantee of ensuring the quality of your linear regression model.
I explored these issues in a previous story,
How do you check the quality of your regression model in Python?
For all of us using Python as the language of choice for data science, the go-to package for machine learning is Scikit-learn. Although the estimators of Scikit-learn are highly optimized and thoughtfully designed, they do not provide many statistical insights or checks for regression tasks. For example, they can give you the _R_² score and regression coefficients, and not much else.
But, what if you wanted – from the same estimator – the following insights/ plots?
- Residuals vs. predicting variables plots
- Fitted vs. residuals plot
- Histogram of the normalized residuals
- Q-Q plot of the normalized residuals
- Shapiro-Wilk normality test on the residuals
- Cook’s distance plot of the residuals
- Variance inflation factor (VIF) of the predicting features
Each of these plots is critical to tell you if the regression problem (i.e. the incoming data) was of good quality or the modeling assumptions have been satisfied. Essentially, the problem has to satisfy these,
In this article, we will explore a simple, lightweight Python package called mlr to demonstrate how to gain some insights into a regression problem with a minimal amount of code.
Install and basic fitting
Basic pip install.
pip install mlr
We can generate some random data for the demo.
num_samples=40
num_dim = 5
X = 10*np.random.random(size=(num_samples,num_dim))
coeff = np.array([2,-3.5,1.2,4.1,-2.5])
y = np.dot(coeff,X.T)+10*np.random.randn(num_samples)
The feature vector has 5 dimensions. Note the random noise added to the data. How does it look like? We plot the response variable w.r.t. each dimension of the feature vector,
At this point, we can create a model instance using the mlr library.
model = mlr()
What is it? We can probe 🙂
model
>> I am a Linear Regression model!
The next step is to ingest the data. Not fit, yet, but just to ingest.
model.ingest_data(X,y)
At this point, the data has been ingested but not fitted.
model.is_ingested
>> True
model.is_fitted
>> False
Immediately, the correlation matrix is available for visualization. Even before fitting a regression model, you have to check for multi-collinearity, don’t you?
model.corrplot()
All the correlation data and the full covariance matrix is also available. We won’t print them just to save space,
model.corrcoef()
model.covar()
Then, just fit.
model.fit()
A bunch of stuff at your disposal
Once fitted, a lot of internal action happens and the model object is primed with a host of metrics and visualizations for you to gain insight about.
Simple R² score (and all the associated metrics)
We can print the simple R² coefficients,
model.r_squared()
>> 0.8023008559130889
model.adj_r_squared()
>> 0.7732274523708961
Or, we can print all of them in one shot!
model.print_metrics()
>>
sse: 3888.1185
sst: 19666.8452
mse: 97.2030
r^2: 0.8023
adj_r^2: 0.7732
AIC: 308.5871
BIC: 318.7204
For the sake of brevity, I won’t go through all these metrics but these are the basics of linear regression problems that you should already be familiar with 🙂
Perform the F-test of overall significance
It returns the F-statistic and the p-value of the test. If the p-value is a small number you can reject the Null hypothesis that all the regression coefficient is zero. That means a small p-value (generally < 0.01) indicates that the overall regression is statistically significant.
model.ftest()
>> (27.59569772244756, 4.630496783262639e-11)
t-test statistics, and standard errors of the coefficients
Standard errors and corresponding t-tests give us the p-values for each regression coefficient, which tells us whether that particular coefficient is statistically significant or not (based on the given data).
Again, all of these, one function call away,
print("P-values:",model.pvalues())
print("t-test values:",model.tvalues())
print("Standard errors:",model.std_err())
We get,
P-values: [1.37491834e-01 8.39253557e-06 1.62863484e-05 1.64865547e-03
1.23943729e-06 1.97055499e-02]
t-test values: [-1.5210582 5.23964721 -5.017859 3.41906173 5.87822303 -2.44746809]
Standard errors: [7.35200748 0.65528743 0.58944149 0.58706879 0.55966142 0.7806859 ]
Confidence intervals for each feature?
One line of code,
model.conf_int()
>>
array([[-26.12390808, 3.75824557],
[ 2.10177067, 4.76517923],
[ -4.15562353, -1.75984506],
[ 0.81415711, 3.20029178],
[ 2.15244579, 4.42718347],
[ -3.49724847, -0.3241592 ]])
Visual analysis of the residuals
Residual analysis is crucial to check the assumptions of a linear regression model. mlr helps you check those assumptions easily by providing straightforward visual analytics methods for the residuals.
Fitted vs. residuals plot
Check the assumption of constant variance and uncorrelated features (independence) with this plot.
model.fitted_vs_residual()
Fitted vs features plot
Check the assumption of linearity with this plot,
model.fitted_vs_features()
Histogram and Q-Q plot of standardized residuals
Check the normality assumption of the error terms using these plots,
model.histogram_resid()
model.qqplot_resid()
Feature selection
One of the good things about a simplified API is that we can use it to do complicated tasks with only a few lines of code. For example, we may want to use mlr to check how the regression metrics change as we start with only one explanatory variable and gradually add more. This is a common task for the feature selection method.
For this, assume that we have already put the data into a Pandas DataFrame and using the fit_dataframe method.
for i in range(1,6):
m = mlr() # Model instance
# List of explanatory variables
X = ['X'+str(j) for j in range(i)]
# Fitting the dataframe by passing on the list
m.fit_dataframe(X=X,y='y',dataframe=df)
print("nRegression model built with feature vector", X)
print("Metrics are as follows...")
print("-"*80)
m.print_metrics()
Something like the following gets printed,
We can also plot the gradual decrease in the value of AIC and BIC as the variables are added.
What kind of insights you can get from the plot above? Perhaps, if you are resource-constrained for model training, you may stop with three features to build the model.
And explain to your customer, why you chose those three features only.
A full list of methods is available here.
Summary
It is not enough to just fit the data and predict. There are many statistical significance tests and residuals check that should be performed for rigorous regression analysis. Having a single lightweight library can help achieve that with a minimal amount of coding. That is what we demonstrated in this article.
Read the full documentation of this library (mlr) here: Documentation.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.






