Linear Regression in Python

  8 minute read   Follow @GaryLi

What is Linear Regression (LR)?

  • Linear regression (LR) models the relationship between the explanatory/independent/predictor/regressor/exogeneous (X) variable with that of dependent/response/criterion/endogeneous variable (Y)
  • For example, how the likelihood of blood pressure is influenced by a person’s age and weight. This relationship can be explained using linear regression
  • In LR, the Y variable should be continuous whereas the X variable can be continuous or categorical. If both X and Y are continuous, the linear relationship can be estimated using correlation coefficient (r) or the coefficient of determination (r-squared)
  • LR is useful if the relationships between the X and Y variables are linear
  • LR is helpful to predict the value of Y based on the value of the X variable

Types of Linear Regression (LR)?

  • Univariate LR: Linear relationships between Y and X variables can be explained by single X variable

\( Y = a + bX + \epsilon \)
Where, a = y-intercept, b = slope of the regression line and \( \epsilon \) = error term (residuals)

  • Multiple LR: Linear relationships between Y and X variables can be explained by multiple X variables

    \( Y = a + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_nX_n + \epsilon \)
    Where, a = y-intercept, b = slope of the regression line and \( \epsilon \) = error term (residuals)

  • The y-intercept (a) is a constant and slope (b) of regression line is a regression coefficient.
  • How to perform multiple linear regression?

Linear Regression (LR) Assumptions

  • Relationship between the X and Y variables should be linear
  • Errors (residuals) should be independent of each other
  • Errors (residuals) should be normally distributed with mean of 0
  • Errors (residuals) should have equal variance (Homoscedasticity)

Linear Regression (LR) Outputs

Correlation coefficient (r)

  • r describes a linear relationship between X and Y variables
  • r > 0 indicates a positive linear relationship between X and Y variables. As one of the variable increases, the other variable also increases. r = 1 is a perfect positive linear relationship
  • Similarly, r < 0 indicates a negative linear relationship between X and Y variables. As one of the variable increases, the other variable decreases, and vice versa. r = -1 is perfect negative linear relationship
  • r = 0 indicates, there is no linear relationship between the X and Y variables

Coefficient of determination (r-squared)

  • r-squared is a square of correlation coefficient (r) and usually represented as percentages
  • r-squared explains the variation in the Y variable that is explained by the fitted regression line
  • r-squared can range from 0 to 1 (0 to 100%). r-squared = 1 (100%) indicates that the fitted regression line explains all the variability of Y variable around its mean.

Residuals (regression error)

  • Residuals or error in regression represents the distance of the observed data points from the predicted regression line

\( residuals = predicted \ Y \ (\hat{y}_i) - actual \ Y (y_i) \)

Root Mean Square Error (RMSE)

  • RMSE represents the standard deviation of the residuals. It gives an estimate of the spread of observed data points across the predicted regression line

Linear Regression (LR) in Python

Perform Linear Regression (LR)

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
>>> from bioinfokit.analys import stat, get_data
# load dataset as pandas dataframe
# should not have missing values (NaN)
>>> df = get_data('slr').data
>>> df.head()
   X1    Y
0  25  670
1  30  690
2  18  635
3  15  625
4  20  640

# LR with one independent variable
>>> reg= stat()
>>> reg.lin_reg(df=df, x=["X1"], y=["Y"])

# output
Regression equation:

569.0916 + (3.7515*X1)

Regression Summary:


----------------------------------------  --------
Dependent variables                       ['X1']
Independent variables                     ['Y']
Coefficient of determination (r-squared)  0.918
Adjusted r-squared                        0.9141
Root Mean Square Error (RMSE)             6.6718
Mean of Y                                 639.3913
Residual standard error                   6.9808
No. of Observations                       23

Regression Coefficients:

Parameter      Estimate    Std Error    t-value    P-value Pr(>|t|)
-----------  ----------  -----------  ---------  ------------------
Intercept     569.092         4.8094   118.329           3.7802e-31
X1              3.75149       0.2446    15.3373          7.0019e-13


ANOVA Summary:

Source      Df    Sum Squares  Mean Squares    F         Pr(>F)
--------  ----  -------------  --------------  --------  ----------
Model        1       11462.1   11462.1214      235.2108  7.0080E-13
Error       21        1023.36  48.7313
Total       22       12485.5

# you can check predicted response (y_hat)
>>> reg.y_hat

Interpretation

  • The regression line with equation [Y = 569.0916 + (3.7515*X1)], is helpful to predict the value of the Y variable from the given value of the X1 variable. In general, a regression can be useful in predicting the Y of any value within the range of X1. It can be also used to predict Y from X outside the given range, but such extrapolation may not be useful
  • The slope (3.7515) represents the change in the Y per unit change in the X1 variable. It means that the value Y increases by 3.7515 with each unit increase in X1
  • The intercept (569.0916) represents the value of Y at X1 = 0. Here need to be cautious to interpret intercept as sometimes the value (X1=0) does not make any sense (e.g. speed of car or height of the person). In such cases, the values within the range of X1 should be considered to interpret the intercept
  • The coefficient of determination (r-squared) is 0.918 (91.8%), which suggests that 91.8% of the variance in Y can be explained by X1 alone. Adjusted r-squared is useful where there are multiple X variables in the model (multiple linear regression)
  • The correlation coefficient (r) is 0.9581, which suggests that there is a positive relationship between X and Y variables. As X1 increases, Y also increases
  • From Regression Coefficients, the P-value obtained for X1 independent variable is significant (<0.05), it suggests that X1 significantly influence the response variable Y (there is a significant relationship between X1 and Y)
  • From ANOVA, the P-value is significant (<0.05), which suggests that there is a significant relationship between X1 and Y. The X1 variable can reliably predict the Y variable.

Linear Regression (LR) plot

Generate regression plot,

>>> import pandas as pd
>>> from bioinfokit import visuz
# get predicted Y and add to original dataframe
>>> df['yhat']=reg.y_hat
>>> df.head()
   X1    Y        yhat
0  25  670  662.878924
1  30  690  681.636398
2  18  635  636.618460
3  15  625  625.363976
4  20  640  644.121450

# create regression plot with defaults
>>> visuz.stat.regplot(df=df, x='X1', y='Y', yhat='yhat')
# plot will be saved in same dir (reg_plot.png)
# set parameter show=True, if you want view the image instead of saving

Check Linear Regression (LR) Assumptions

Residuals vs fitted (y_hat) plot: This plot used to check for linearity, variances and outliers in the regression data

# get residuals and standardized residuals and add to original dataframe
>>> df['res']=reg.residuals
>>> df['std_res']=reg.std_residuals
>>> df.head()
   X1    Y        yhat       res   std_res
0  25  670  662.878924  7.121076  1.017297
1  30  690  681.636398  8.363602  1.194800
2  18  635  636.618460 -1.618460 -0.231209
3  15  625  625.363976 -0.363976 -0.051997
4  20  640  644.121450 -4.121450 -0.588779

# create fitted (y_hat) vs residuals plot
>>> visuz.stat.reg_resid_plot(df=df, yhat='yhat', resid='res', stdresid='std_res')
# plot will be saved in same dir (resid_plot.png and std_resid_plot.png)
# set parameter show=True, if you want view the image instead of saving

From the plot,

  • As the data is pretty equally distributed around the line=0 in the residual plot, it meets the assumption of residual equal variances. The outliers could be detected here if the data lies far away from the line=0.
  • In the standardized residual plot, the residuals are within -2 and +2 range and suggest that it meets assumptions of linearity

Quantile-quantile (QQ) plot: This plot used to check the data normality assumption

# you can use interactive python interpreter, jupyter notebook, spyder or python code
# I am using interactive python interpreter (Python 3.7)
>>> import statsmodels.api as sm
>>> import matplotlib.pyplot as plt
# check dataframe for required variables (we will need std_res variable for QQ plot)
>>> df.head()
   X1    Y        yhat       res   std_res
0  25  670  662.878924  7.121076  1.017297
1  30  690  681.636398  8.363602  1.194800
2  18  635  636.618460 -1.618460 -0.231209
3  15  625  625.363976 -0.363976 -0.051997
4  20  640  644.121450 -4.121450 -0.588779

# create QQ plot
# line=45 option to plot the data around 45 degree line
>>> sm.qqplot(df['std_res'], line='45')
>>> plt.xlabel("Theoretical Quantiles")
>>> plt.ylabel("Standardized Residuals")
>>> plt.show()

From the plot,

  • As the standardized residuals lie around the 45-degree line, it suggests that the residuals are normally distributed

Check for How to perform multiple linear regression?

Linear Regression with PyTorch

Let’s perform a LR with PyTorch with the same dataset

>>> from bioinfokit.analys import stat, get_data
>>> import torch as th
>>> df = get_data('slr').data
>>> df.head()
   X1    Y
0  25  670
1  30  690
2  18  635
3  15  625
4  20  640

# convert to PyTorch tensor 
# variable shape should be (samples, features)
>>> X1 = th.tensor(df[['X1']].values, dtype=th.float32)
>>> Y = th.tensor(df[['Y']].values, dtype=th.float32)

# LR model using PyTorch
>>> in_features = 1 # number of independent variable
>>> out_features = 1 # dim of predicted variable
>>> lr_model = th.nn.Linear(in_features, out_features)

# define loss function (regression error)
>>> mse_loss = th.nn.MSELoss()
# optimize to minimize the loss function and find optimal LR parameters (Regression Coefficients)
>>> optimizer = th.optim.SGD(lr_model.parameters(), lr=0.002)  
# set number of iterations until you see the convergence in the loss function
# when you see similar minimum values for a large number of ending iterations
# it will give you the best values of LR parameters
>>> n_iter = 20000
>>> for i in range(n_iter):
        # predict model with current LR parameters
        y_pred = lr_model(X1)
        # calculate loss function
        step_loss = mse_loss(y_pred, Y)
        
        # Backward to find the derivatives of the loss function with respect to LR parameters
        # make any stored gradients to zero
        optimizer.zero_grad()
        step_loss.backward()
        # update with current step LR parameters 
        optimizer.step()
        
        print ('i [{}], Loss: {:.2f}'.format(i, step_loss.item()))

# see the last few iterations
i [19986], Loss: 44.51
i [19987], Loss: 44.51
i [19988], Loss: 44.51
i [19989], Loss: 44.51
i [19990], Loss: 44.51
i [19991], Loss: 44.51
i [19992], Loss: 44.51
i [19993], Loss: 44.51
i [19994], Loss: 44.51
i [19995], Loss: 44.51
i [19996], Loss: 44.51
i [19997], Loss: 44.51
i [19998], Loss: 44.51
i [19999], Loss: 44.51

Now, get the best LR parameters from this trained model,

# intercept (a)
>>> lr_model.bias.item()
568.7089233398438
# slope (b)
>>> lr_model.weight.item()
3.7700467109680176

Generate regression plot,

>>> from bioinfokit.visuz import stat
# detach will not build a gradient computational graph (no backpropagation) 
>>> y_pred = lr_model(X1).detach()
>>> df['yhat']=y_pred.numpy()
>>> stat.regplot(df=df, x='X1', y='Y', yhat='yhat')

Now, let’s predict the Y from some random X1,

>>> X1_data = 28
# predict Y value when X1 is 28
>>> y_pred = lr_model(th.tensor([[X1_data]], dtype=th.float32)).detach()
>>> y_pred.item()
674.2701416015625
# Y=674.27 when X1 is 28

Last updated: July 12, 2020