Exercise 7.8

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from patsy import dmatrix

%matplotlib inline

df = pd.read_csv('../data/Auto.csv')

# Visualize dataset
df.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

Pre-processing data

# 'horsepower' is a string
# We would like it to be a integer, in order to appear in scatter plots.
type(df['horsepower'][0])

str

# 'horsepower' is a string because one of its values is '?'
df['horsepower'].unique()

array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
       '170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
       '200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
       '72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
       '112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
       '75', '91', '122', '67', '83', '78', '52', '61', '93', '148', '129',
       '96', '71', '98', '115', '53', '81', '79', '120', '152', '102',
       '108', '68', '58', '149', '89', '63', '48', '66', '139', '103',
       '125', '133', '138', '135', '142', '77', '62', '132', '84', '64',
       '74', '116', '82'], dtype=object)

# Replace '?' value by zero
df['horsepower'] = df['horsepower'].replace(to_replace = '?', value = '0')
df['horsepower'].unique()  # Check if it's ok

array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
       '170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
       '200', '210', '193', '0', '100', '105', '175', '153', '180', '110',
       '72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
       '112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
       '75', '91', '122', '67', '83', '78', '52', '61', '93', '148', '129',
       '96', '71', '98', '115', '53', '81', '79', '120', '152', '102',
       '108', '68', '58', '149', '89', '63', '48', '66', '139', '103',
       '125', '133', '138', '135', '142', '77', '62', '132', '84', '64',
       '74', '116', '82'], dtype=object)

# Convert 'horsepower' to int
df['horsepower'] = df['horsepower'].astype(int)
type(df['horsepower'][0])  # Check if it's ok

numpy.int32

Relationships analysis

# Scatterplot matrix using Seaborn
sns.pairplot(df);

png

mpg seems to have a non-linear relationship with: displacement, horsepower and weight. Since in the book they compare mpg to horsepower, we will only analyse that relationship in this exercise.

Non-linear models

Non-linear models investigated in Chapter 7: Polynomial regression Step function Regression splines Smoothing splines Local regression Generalized additive models (GAM)

We will develop models for the first three because we didn't find functions for smoothing splines, local regression and GAM.

# Dataset
X = df['horsepower'][:,np.newaxis]
y = df['mpg']

Polynomial regression

for i in range (1,11):
    model = Pipeline([('poly', PolynomialFeatures(degree=i)), 
                      ('linear', LinearRegression())])
    model.fit(X,y)

    score = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

    print("Degree: %i  CV mean squared error: %.3f" % (i, np.mean(np.abs(score))))

Degree: 1  CV mean squared error: 32.573
Degree: 2  CV mean squared error: 30.515
Degree: 3  CV mean squared error: 27.440
Degree: 4  CV mean squared error: 25.396
Degree: 5  CV mean squared error: 24.986
Degree: 6  CV mean squared error: 24.937
Degree: 7  CV mean squared error: 24.831
Degree: 8  CV mean squared error: 24.907
Degree: 9  CV mean squared error: 25.358
Degree: 10  CV mean squared error: 25.270

The polynomial degree that best fits the relationship between mpg and horsepower is 7.

Step function

for i in range(1,11):
    groups = pd.cut(df['horsepower'], i)
    df_dummies = pd.get_dummies(groups)

    X_step = df_dummies
    y_step = df['mpg']

    model.fit(X_step, y_step)
    score = cross_val_score(model, X_step, y_step, cv=5, scoring='neg_mean_squared_error')

    print('Number of cuts: %i  CV mean squared error: %.3f' %(i, np.mean(np.abs(score))))

Number of cuts: 1  CV mean squared error: 74.402
Number of cuts: 2  CV mean squared error: 43.120
Number of cuts: 3  CV mean squared error: 36.934
Number of cuts: 4  CV mean squared error: 41.236
Number of cuts: 5  CV mean squared error: 30.880
Number of cuts: 6  CV mean squared error: 29.083
Number of cuts: 7  CV mean squared error: 29.269
Number of cuts: 8  CV mean squared error: 28.562
Number of cuts: 9  CV mean squared error: 27.776
Number of cuts: 10  CV mean squared error: 24.407

The step function that best fits the relationship between mpg and horsepower is the one with 10 steps.

Regression splines

There are two types of regression splines: splines and natural splines. The difference between these two groups is that a natural spline is a regression spline with additional boundary constraints: the natural function is required to be linear at the boundary. This additional constraint means that natural splines generally produce more stable estimates at the boundaries.

Usually, cubic splines are used. These splines are popular because most human eyes cannot detect the discontinuity at the knots. We use both cubic and natural cubic splines in this exercise.

To generate split basis representation, the package patsy is used. patsy is a Python package for describing statistical models (especially linear models, or models that have a linear component). Check the References for more details.

Cubic splines

for i in range(3,11): # The degrees of freedom can't be less than 3 in a cubic spline transformed = dmatrix("bs(df.horsepower, df=%i, degree=3)" % i, {"df.horsepower":df.horsepower}, return_type='dataframe') # Cubic spline basis representation lin = LinearRegression() lin.fit(transformed, y) score = cross_val_score(lin, transformed, y, cv=10, scoring='neg_mean_squared_error') print('Number of degrees of freedom: %i CV mean squared error: %.3f' %(i, np.mean(np.abs(score))))

Number of degrees of freedom: 3 CV mean squared error: 24.268 Number of degrees of freedom: 4 CV mean squared error: 22.195 Number of degrees of freedom: 5 CV mean squared error: 22.319 Number of degrees of freedom: 6 CV mean squared error: 22.028 Number of degrees of freedom: 7 CV mean squared error: 22.099 Number of degrees of freedom: 8 CV mean squared error: 22.145 Number of degrees of freedom: 9 CV mean squared error: 21.783 Number of degrees of freedom: 10 CV mean squared error: 21.965

Natural cubic splines

for i in range(3,11): # The degrees of freedom can't be less than 3 in a cubic spline transformed = dmatrix("cr(df.horsepower, df=%i)" % i, {"df.horsepower":df.horsepower}, return_type='dataframe') # Cubic spline basis representation lin = LinearRegression() lin.fit(transformed, y) score = cross_val_score(lin, transformed, y, cv=10, scoring='neg_mean_squared_error') print('Number of degrees of freedom: %i CV mean squared error: %.3f' %(i, np.mean(np.abs(score))))

Number of degrees of freedom: 3 CV mean squared error: 27.132 Number of degrees of freedom: 4 CV mean squared error: 24.297 Number of degrees of freedom: 5 CV mean squared error: 22.317 Number of degrees of freedom: 6 CV mean squared error: 21.721 Number of degrees of freedom: 7 CV mean squared error: 21.904 Number of degrees of freedom: 8 CV mean squared error: 22.019 Number of degrees of freedom: 9 CV mean squared error: 21.926 Number of degrees of freedom: 10 CV mean squared error: 22.201

The regression spline that best fits the relationship between mpg and horsepower is a natural cubic spline with 6 degrees of freedom. This is also the model that gives the best results of the three considered in this exercise.

Notes regarding regression splines: In practice it is common to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data. This way, knots are placed in a uniform fashion. Knowing the degrees of freedom (df), we can know the number of knots (K). If it is a cubic spline, we have df = 4 + K. If it is a natural cubic spline, we have df = K because there are two additional natural constraints at each boundary to enforce linearity. However, regarding natural cubic splines, different interpretations can be made. For example, it can be said that df = K - 1 because it includes a constant that is absorbed in the intercept, so we can count only K-1 degrees of freedom. * Cross-validation is an objective approach to define how many degrees of freedom our spline should contain.

Notes

Solved exercises use displacement instead of horsepower, so I couldn't verify my solutions. I only found one set of solved exercises for this exercise.

According to https://github.com/JWarmenhoven/ISLR-python/blob/master/Notebooks/Chapter%207.ipynb, Python does not have functions to create smoothing splines, local regression or GAMs.

References

Introduction to Statistical Learning with R, Chapter 3.3

http://www.science.smith.edu/~jcrouser/SDS293/labs/lab13/Lab%2013%20-%20Splines%20in%20Python.pdf

https://github.com/JWarmenhoven/ISLR-python/blob/master/Notebooks/Chapter%207.ipynb

http://patsy.readthedocs.io/en/latest/API-reference.html (search for: 'patsy.cr')