Exercise 3.9

%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm

sns.set(style="white")
plt.style.use('seaborn-white')
df = pd.read_csv('../data/Auto.csv')
df = pd.read_csv('../data/Auto.csv', na_values='?').dropna()
df.head()
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 1 ford torino

(a)

# http://seaborn.pydata.org/generated/seaborn.PairGrid.html

g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3, legend=False);

png

(b)

# pandas' corr() function takes care of excluding non numeric data: 
# https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/frame.py#L4721

df.corr()
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541 0.565209
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647 -0.568932
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855 -0.614535
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361 -0.455171
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120 -0.585005
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316 0.212746
year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000 0.181528
origin 0.565209 -0.568932 -0.614535 -0.455171 -0.585005 0.212746 0.181528 1.000000

Extra

Why not a correlation heatmap as well?

http://seaborn.pydata.org/examples/network_correlations.html

corrmat = df.corr()
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 9))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=.8, square=True)

f.tight_layout()

png

(c)

reg = smf.ols('mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin', df).fit()
reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.821
Model: OLS Adj. R-squared: 0.818
Method: Least Squares F-statistic: 252.4
Date: Fri, 08 Dec 2017 Prob (F-statistic): 2.04e-139
Time: 09:49:08 Log-Likelihood: -1023.5
No. Observations: 392 AIC: 2063.
Df Residuals: 384 BIC: 2095.
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -17.2184 4.644 -3.707 0.000 -26.350 -8.087
cylinders -0.4934 0.323 -1.526 0.128 -1.129 0.142
displacement 0.0199 0.008 2.647 0.008 0.005 0.035
horsepower -0.0170 0.014 -1.230 0.220 -0.044 0.010
weight -0.0065 0.001 -9.929 0.000 -0.008 -0.005
acceleration 0.0806 0.099 0.815 0.415 -0.114 0.275
year 0.7508 0.051 14.729 0.000 0.651 0.851
origin 1.4261 0.278 5.127 0.000 0.879 1.973
Omnibus: 31.906 Durbin-Watson: 1.309
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.100
Skew: 0.529 Prob(JB): 2.95e-12
Kurtosis: 4.460 Cond. No. 8.59e+04

i

Yes, there is a relationship between the predictors and the response. In the table above we can see that the value of the F-statistic is 252 which is much larger than 1, so we can reject the corresponding null hypothesis:

H_0 :\beta_{cylinders} = \beta_{displacement} = \beta_{weight} = \beta_{acceleration} = \beta_{year} = \beta_{origin} =0.

In fact, the probability this data would be generated if H_0 was true is Prob(F-Statistic) = 2 \times 10^{-139}, a ridiculously low value.

ii

We can see which predictors have a statistically significant relationship with the response by looking at the p-values in the table above. The predictors that have a statistically significant relationship to the response are definitely weight, year and origin, and we could say displacement as well; while cylinders, horsepower, and acceleration do not.

iii

The coefficient suggests that, on average, when the other variables are held constant, an increase of one year (of production) corresponds to an increase of 0.75 of mpg (so, the more recent the more efficient).

(d)

In R [4], by default, plot() on a fit produces 4 plots: * a plot of residuals against fitted values, * a Scale-Location plot of sqrt(| residuals |) against fitted values, * a Normal Q-Q plot, * a plot of residuals against leverages.

Below, we plot each of these 4 plots. We use the code published by Emre Can [5] with a few adaptations.

import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import ProbPlot

plt.style.use('seaborn') # pretty matplotlib plots
plt.rc('font', size=14)
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)

model_f = 'mpg ~ cylinders + \
                 displacement + \
                 horsepower + \
                 weight + \
                 acceleration + \
                 year + \
                 origin'

df.reset_index(drop=True, inplace=True)

model = smf.ols(formula=model_f, data=df)

model_fit = model.fit()

# fitted values (need a constant term for intercept)
model_fitted_y = model_fit.fittedvalues

# model residuals
model_residuals = model_fit.resid

# normalized residuals
model_norm_residuals = model_fit.get_influence().resid_studentized_internal

# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# absolute residuals
model_abs_resid = np.abs(model_residuals)

# leverage, from statsmodels internals
model_leverage = model_fit.get_influence().hat_matrix_diag

# cook's distance, from statsmodels internals
model_cooks = model_fit.get_influence().cooks_distance[0]

Residuals against fitted values

plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)

plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'mpg', data=df,
                                  lowess=True,
                                  scatter_kws={'alpha': 0.5},
                                  line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals')


# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_3 = abs_resid[:3]

for i in abs_resid_top_3.index:
    plot_lm_1.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_residuals[i]));

png

Normal Q-Q plot

QQ = ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)

plot_lm_2.set_figheight(8)
plot_lm_2.set_figwidth(12)

plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');

# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]

for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i, 
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));

png

Scale-Location plot of sqrt(|residuals|) against fitted values

plot_lm_3 = plt.figure(3)
plot_lm_3.set_figheight(8)
plot_lm_3.set_figwidth(12)

plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_3.axes[0].set_title('Scale-Location')
plot_lm_3.axes[0].set_xlabel('Fitted values')
plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');


for i in abs_norm_resid_top_3:
    plot_lm_3.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_norm_residuals_abs_sqrt[i]));

png

Residuals against leverages

plot_lm_4 = plt.figure(4)
plot_lm_4.set_figheight(8)
plot_lm_4.set_figwidth(12)

plt.scatter(model_leverage, model_norm_residuals, alpha=0.5)
sns.regplot(model_leverage, model_norm_residuals, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_4.axes[0].set_xlim(0, 0.20)
plot_lm_4.axes[0].set_ylim(-3, 5)
plot_lm_4.axes[0].set_title('Residuals vs Leverage')
plot_lm_4.axes[0].set_xlabel('Leverage')
plot_lm_4.axes[0].set_ylabel('Standardized Residuals')

# annotations
leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]

for i in leverage_top_3:
    plot_lm_4.axes[0].annotate(i, 
                               xy=(model_leverage[i], 
                                   model_norm_residuals[i]))

# shenanigans for cook's distance contours
def graph(formula, x_range, label=None, ls='-'):
    x = x_range
    y = formula(x)
    plt.plot(x, y, label=label, lw=1, ls=ls, color='red')

p = len(model_fit.params) # number of model parameters

graph(lambda x: np.sqrt((0.5 * p * (1 - x)) / x), 
      np.linspace(0.001, 0.200, 50), 
      'Cook\'s distance = .5', ls='--') # 0.5 line

graph(lambda x: np.sqrt((1 * p * (1 - x)) / x), 
      np.linspace(0.001, 0.200, 50), 'Cook\'s distance = 1', ls=':') # 1 line

plt.legend(loc='upper right');

png

Comments

No, there are no unusually large outliers, as per the the scale log location. They are however skewedly distributed. The larger the fitted value, the larger the variance, since the spread of the residuals increases.

No, even though there is an observation (number 13) with higher leverage, it is still well within Cook's 0.5 distance.

The normal qq-plot deviates at one extreme, which could indicate that there are other explanatory predictors that we are not considering (quadratic terms, for example). Additionally, the funnel shape of the residuals plot indicates heteroskedacity.

References

[1] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/plot.lm.html

[2] https://emredjan.github.io/blog/2017/07/11/emulating-r-plots-in-python/

(e)

Statsmodels uses patsy which is a "mini-language" inspired by R and S to describe statistical models. The symbols ':' and '*' have the same meaning as in R. Namely, a:b includes an interaction term between a and b, while a*b is shorthand for a + b + a:b, that is, it includes a and b as well.

References:

So, which pairs of variables would we expect to interact, both a priori (from our interpretation of the meaning of these variables) and from the pairs plot?

Perhaps horsepower and year? What would this mean? It would mean that, for different years, varying horsepower has different effect on mpg. It seems plausible. We could also interpret it in the reverse order: for different values of horsepower, does varying year have a different effect on mpg? For example, does the change in mpg when varying year (i.e., the derivative dmpg/dyear), differ when holding horsepower at either 130 or 160?

Let's find out.

reg = smf.ols('mpg ~ horsepower*year + displacement + weight + origin', df).fit()
reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.851
Model: OLS Adj. R-squared: 0.849
Method: Least Squares F-statistic: 367.0
Date: Fri, 08 Dec 2017 Prob (F-statistic): 7.51e-156
Time: 09:49:11 Log-Likelihood: -987.81
No. Observations: 392 AIC: 1990.
Df Residuals: 385 BIC: 2017.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -96.6688 9.667 -10.000 0.000 -115.675 -77.663
horsepower 0.7993 0.092 8.687 0.000 0.618 0.980
year 1.8179 0.128 14.221 0.000 1.567 2.069
horsepower:year -0.0113 0.001 -8.977 0.000 -0.014 -0.009
displacement 0.0068 0.005 1.344 0.180 -0.003 0.017
weight -0.0054 0.001 -10.170 0.000 -0.006 -0.004
origin 1.1866 0.253 4.684 0.000 0.688 1.685
Omnibus: 21.932 Durbin-Watson: 1.488
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.066
Skew: 0.411 Prob(JB): 6.60e-08
Kurtosis: 4.161 Cond. No. 5.60e+05

From the value of the p-value of the coefficient of the interaction term between horsepower and year, it does seem like there is a statistically significant relationship between the response and horsepower:year.

With 7 factors, there will be a total of 21 interaction terms. For simplicity sake, we will exclude all the terms with cylinders and acceleration, leaving us with 10 interaction terms. Let's try to fit a model with these terms - a total of 15 terms.

model = 'mpg ~ displacement + horsepower + origin + weight + year \
               + displacement:horsepower + displacement:origin + displacement:weight + displacement:year \
               + horsepower:origin + horsepower:weight + horsepower:year + origin:weight + origin:year + weight:year'
reg = smf.ols(model, df).fit()
reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.880
Model: OLS Adj. R-squared: 0.875
Method: Least Squares F-statistic: 184.0
Date: Fri, 08 Dec 2017 Prob (F-statistic): 1.09e-162
Time: 09:49:11 Log-Likelihood: -945.49
No. Observations: 392 AIC: 1923.
Df Residuals: 376 BIC: 1987.
Df Model: 15
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -51.3746 26.175 -1.963 0.050 -102.843 0.093
displacement -0.1818 0.120 -1.521 0.129 -0.417 0.053
horsepower 0.9485 0.232 4.089 0.000 0.492 1.405
origin -3.0637 5.496 -0.557 0.578 -13.871 7.744
weight -0.0174 0.016 -1.115 0.265 -0.048 0.013
year 1.3975 0.328 4.267 0.000 0.754 2.042
displacement:horsepower -0.0001 0.000 -0.815 0.416 -0.000 0.000
displacement:origin 0.0282 0.013 2.172 0.030 0.003 0.054
displacement:weight 2.792e-05 5.99e-06 4.663 0.000 1.61e-05 3.97e-05
displacement:year 0.0010 0.001 0.710 0.478 -0.002 0.004
horsepower:origin -0.0629 0.020 -3.104 0.002 -0.103 -0.023
horsepower:weight -1.175e-05 1.77e-05 -0.664 0.507 -4.65e-05 2.3e-05
horsepower:year -0.0114 0.003 -3.998 0.000 -0.017 -0.006
origin:weight 0.0014 0.001 1.200 0.231 -0.001 0.004
origin:year 0.0322 0.069 0.464 0.643 -0.104 0.169
weight:year 7.438e-05 0.000 0.394 0.694 -0.000 0.000
Omnibus: 52.658 Durbin-Watson: 1.599
Prob(Omnibus): 0.000 Jarque-Bera (JB): 115.208
Skew: 0.710 Prob(JB): 9.61e-26
Kurtosis: 5.244 Cond. No. 1.81e+08

These results indicate that the interactions that appear to be statistically significant are displacement:weight, horsepower:origin and horsepower:year. Interestingly, when these are considered the only first order terms that are statistically different are horsepower and year. By the hierarchy principle (page 89), we should nonetheless include all of the main effects (for more on this, see these answers).

We could also have a try at interaction plots, which are not covered in the book, but we will leave it as a mention only.

(f)

As an example, we fit the data with a model containing the transformations indicated for the variable horsepower, starting with X^2.

reg = smf.ols('mpg ~ horsepower + np.power(horsepower,2) + weight + year + origin', df).fit()
reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.851
Model: OLS Adj. R-squared: 0.849
Method: Least Squares F-statistic: 439.5
Date: Fri, 08 Dec 2017 Prob (F-statistic): 7.11e-157
Time: 09:49:11 Log-Likelihood: -988.57
No. Observations: 392 AIC: 1989.
Df Residuals: 386 BIC: 2013.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -6.6457 3.915 -1.698 0.090 -14.343 1.052
horsepower -0.2441 0.027 -9.099 0.000 -0.297 -0.191
np.power(horsepower, 2) 0.0008 9.13e-05 9.170 0.000 0.001 0.001
weight -0.0044 0.000 -10.426 0.000 -0.005 -0.004
year 0.7456 0.046 16.145 0.000 0.655 0.836
origin 1.0465 0.238 4.405 0.000 0.579 1.514
Omnibus: 21.819 Durbin-Watson: 1.500
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.447
Skew: 0.414 Prob(JB): 9.00e-08
Kurtosis: 4.140 Cond. No. 4.10e+05
fig = plt.figure()
fitted = reg.fittedvalues
sns.regplot(fitted, df.mpg - fitted,  lowess=True, line_kws={'color':'r', 'lw':1})
ax = fig.axes[0]
ax.axhline(color="grey", ls="--")
ax.set_title("Residuals vs Fitted Values")
ax.set_xlabel("Fitted Values")
ax.set_ylabel("Residuals");

png

It is clear that this quadratic term is statistically significant. Let's try adding a logarithmic term as well.

reg = smf.ols('mpg ~ horsepower + np.power(horsepower,2) + np.log(horsepower) + weight + year + origin', df).fit()
reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.855
Model: OLS Adj. R-squared: 0.853
Method: Least Squares F-statistic: 378.6
Date: Fri, 08 Dec 2017 Prob (F-statistic): 4.62e-158
Time: 09:49:11 Log-Likelihood: -982.62
No. Observations: 392 AIC: 1979.
Df Residuals: 385 BIC: 2007.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 80.3681 25.568 3.143 0.002 30.098 130.638
horsepower 0.2570 0.148 1.737 0.083 -0.034 0.548
np.power(horsepower, 2) -0.0002 0.000 -0.571 0.568 -0.001 0.000
np.log(horsepower) -27.5412 8.000 -3.443 0.001 -43.270 -11.812
weight -0.0048 0.000 -11.098 0.000 -0.006 -0.004
year 0.7561 0.046 16.565 0.000 0.666 0.846
origin 0.9480 0.236 4.016 0.000 0.484 1.412
Omnibus: 17.905 Durbin-Watson: 1.575
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.299
Skew: 0.312 Prob(JB): 4.34e-07
Kurtosis: 4.185 Cond. No. 2.84e+06

Now the p-value for the square term is very large. This indicates that there is indeed a non-linearity but it seems to be better captured by the logarithm than the square. Let's try adding the square root term.

reg = smf.ols('mpg ~ horsepower + np.power(horsepower,2) + np.log(horsepower) + np.sqrt(horsepower) + weight + year + origin', df).fit()
reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.859
Model: OLS Adj. R-squared: 0.856
Method: Least Squares F-statistic: 332.9
Date: Fri, 08 Dec 2017 Prob (F-statistic): 9.16e-159
Time: 09:49:12 Log-Likelihood: -977.89
No. Observations: 392 AIC: 1972.
Df Residuals: 384 BIC: 2004.
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -426.3991 167.288 -2.549 0.011 -755.314 -97.485
horsepower 8.4452 2.676 3.156 0.002 3.184 13.706
np.power(horsepower, 2) -0.0060 0.002 -3.117 0.002 -0.010 -0.002
np.log(horsepower) 416.0064 144.951 2.870 0.004 131.009 701.004
np.sqrt(horsepower) -229.6161 74.927 -3.065 0.002 -376.934 -82.298
weight -0.0048 0.000 -11.229 0.000 -0.006 -0.004
year 0.7475 0.045 16.522 0.000 0.659 0.836
origin 0.9088 0.234 3.886 0.000 0.449 1.369
Omnibus: 23.395 Durbin-Watson: 1.570
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.804
Skew: 0.411 Prob(JB): 6.18e-09
Kurtosis: 4.281 Cond. No. 2.50e+07

So now the square term is back to a small p-value, indicating that it is statistically significant in the presence of the square root and the logarithm.