# Exercise 3.13

%matplotlib inline
import numpy as np
import pandas as pd

import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(1)
# does not generate the same sequences as R
# it seems there's no easy, elegant way of getting Python and R to generate the same random sequences
# http://stackoverflow.com/questions/22213298/creating-same-random-number-sequence-in-python-numpy-and-r


## (a)

x = np.random.normal(size=100)

# http://seaborn.pydata.org/examples/distplot_options.html
fig, ax = plt.subplots()
sns.despine(left=True)

# Plot a kernel density estimate and rug plot
sns.distplot(x, hist=False, rug=True, color="r", ax=ax);


## (b)

eps = np.random.normal(scale = .25, size=100)

fig, ax = plt.subplots()
sns.despine(left=True)
sns.distplot(eps, hist=False, rug=True, color="r", ax=ax);


## (c)

y = -1 + .5*x + eps
print('Length of y = ' + str(len(y)))

Length of y = 100


Naturally, vector y has length 100, since it is a linear combination of 2 vectors of length 100.

In this model $\beta_0 = -1$ and $\beta_1=1/2$.

## (d)

df = pd.DataFrame({'x': x, 'y': y})
sns.jointplot(x='x', y='y', data=df);


As expected, it does seem linear with a variance on the order of 0.5. No alarming indication of high leverage points or outliers.

## (e)

reg = smf.ols('y ~ x', df).fit()
reg.summary()

Dep. Variable: R-squared: y 0.800 OLS 0.798 Least Squares 391.4 Fri, 08 Dec 2017 5.39e-36 13:05:05 4.1908 100 -4.382 98 0.8288 1 nonrobust
coef std err t P>|t| [0.025 0.975] -0.9632 0.023 -40.999 0.000 -1.010 -0.917 0.5239 0.026 19.783 0.000 0.471 0.576
 Omnibus: Durbin-Watson: 0.898 2.157 0.638 0.561 -0.172 0.755 3.127 1.15

The coefficients of the regression are similar to the "true" values, although they are not equal (naturally). They fall within the 95% confidence interval in both cases.

## (f)

model = LinearRegression(fit_intercept= True)
model.fit(x[:, np.newaxis],y)

plt.scatter(df.x, df.y);

xfit = np.linspace(df.x.min(), df.x.max(), 100)
yfit = model.predict(xfit[:, np.newaxis])
fit, = plt.plot(xfit, yfit, color='r');

xpop = np.linspace(df.x.min(), df.x.max(), 100)
ypop = -1 + .5*xpop
pop, = plt.plot(xpop, ypop, color='g');
plt.legend([fit, pop],['Fit line','Model line'])

<matplotlib.legend.Legend at 0x10c324eb8>


## (g)

reg = smf.ols('y ~ x + I(x**2)', df).fit()
reg.summary()

Dep. Variable: R-squared: y 0.800 OLS 0.796 Least Squares 193.8 Fri, 08 Dec 2017 1.32e-34 13:05:14 4.2077 100 -2.415 97 5.400 2 nonrobust
coef std err t P>|t| [0.025 0.975] -0.9663 0.029 -33.486 0.000 -1.024 -0.909 0.5234 0.027 19.582 0.000 0.470 0.576 0.0039 0.021 0.181 0.856 -0.038 0.046
 Omnibus: Durbin-Watson: 0.893 2.152 0.64 0.552 -0.17 0.759 3.132 2.1

After fitting a model with basis functions $X$ and $X^2$, looking at the table above we can see that the p-value for the squared term is 0.856, so we can conclude that there is no evidence to reject the null hypothesis asserting the coefficient of the second order term is zero. In other words, there is no statistically significant relationship between $X^2$ and $Y$ - the quadratic term does not improve the model fit. This is also evident when comparing the R-squared of both fits. They are equal, even though the second model has an additional variable.

## (h)

x = np.random.normal(size=100)
eps = np.random.normal(scale = .05, size=100)
y = -1 + .5*x + eps

df = pd.DataFrame({'x': x, 'y': y})
sns.jointplot(x='x', y='y', data=df)

model = LinearRegression(fit_intercept= True)
model.fit(x[:, np.newaxis],y)

plt.subplots()
plt.scatter(df.x, df.y);

xfit = np.linspace(df.x.min(), df.x.max(), 100)
yfit = model.predict(xfit[:, np.newaxis])
fit, = plt.plot(xfit, yfit, color='r');

xpop = np.linspace(df.x.min(), df.x.max(), 100)
ypop = -1 + .5*xpop
pop, = plt.plot(xpop, ypop, color='g');
plt.legend([fit, pop],['Fit line','Model line'])


<matplotlib.legend.Legend at 0x10c742b38>


reg = smf.ols('y ~ x', df).fit()
reg.summary()

Dep. Variable: R-squared: y 0.989 OLS 0.989 Least Squares 8662. Fri, 08 Dec 2017 1.97e-97 13:05:22 151.58 100 -299.2 98 -293.9 1 nonrobust
coef std err t P>|t| [0.025 0.975] -1.0010 0.005 -186.443 0.000 -1.012 -0.990 0.4972 0.005 93.071 0.000 0.487 0.508
 Omnibus: Durbin-Watson: 0.426 1.977 0.808 0.121 -0.045 0.941 3.145 1.01

As expected, we have a better fit, a better confidence intervals, and a higher R-squared.

Our model fits this data set better than the previous one that was generated from a noisier distribution.

## (i)

x = np.random.normal(size=100)
eps = np.random.normal(scale = 1, size=100)
y = -1 + .5*x + eps

df = pd.DataFrame({'x': x, 'y': y})
sns.jointplot(x='x', y='y', data=df)

model = LinearRegression(fit_intercept= True)
model.fit(x[:, np.newaxis],y)

plt.subplots()
plt.scatter(df.x, df.y);

xfit = np.linspace(df.x.min(), df.x.max(), 100)
yfit = model.predict(xfit[:, np.newaxis])
fit, = plt.plot(xfit, yfit, color='r');

xpop = np.linspace(df.x.min(), df.x.max(), 100)
ypop = -1 + .5*xpop
pop, = plt.plot(xpop, ypop, color='g');
plt.legend([fit, pop],['Fit line','Model line'])


<matplotlib.legend.Legend at 0x10bc05b00>


## (i)

reg = smf.ols('y ~ x', df).fit()
reg.summary()

Dep. Variable: R-squared: y 0.257 OLS 0.250 Least Squares 33.96 Fri, 08 Dec 2017 7.19e-08 13:05:27 -145.66 100 295.3 98 300.5 1 nonrobust
coef std err t P>|t| [0.025 0.975] -0.8802 0.105 -8.375 0.000 -1.089 -0.672 0.5903 0.101 5.828 0.000 0.389 0.791
 Omnibus: Durbin-Watson: 3.633 1.972 0.163 3.566 0.192 0.168 3.842 1.07

Now, on the other hand, we have a much worse fit. The R-squared is just .178 and the confidence intervals for the coefficients are much wider. Still there's no doubt we are in the presence of a statistically significant relationship, with very low p-values.

## (j)

Here's a table with the various confidence intervals:

Noise lower $\beta_0$ $\beta_0$ upper $\beta_0$ lower $\beta_1$ $\beta_1$ upper $\beta_1$
$\varepsilon = .05$ -1.009 0.9984 -0.987 0.489 0.5002 0.512
$\varepsilon=.25$ -1.010 -0.9632 -0.917 0.471 0.5239 0.576
$\varepsilon = 1$ -1.098 -0.8869 -0.675 0.267 0.4697 0.672

As expected the width of the intervals increases as the level of noise increases.