Exercise 3.14

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm # To use statsmodel
import statsmodels.formula.api as smf # To use statsmodel with R-style formulas
from statsmodels.stats import outliers_influence
from sklearn.linear_model import LinearRegression

%matplotlib inline

(a)

np.random.seed(5) # Python and R random generators give different values
x1 = np.random.uniform(size=100)
x2 = 0.5 * x1 + np.random.normal(size=100) / 10
y = 2 + 2 * x1 + 0.3 * x2 + np.random.normal(size=100)

# http://stackoverflow.com/questions/22213298/creating-same-random-number-sequence-in-python-numpy-and-r

Model form:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon = 2 + 2 X_1 + 0.3 X_2 + \epsilon.$

(b)

# Get correlations
np.corrcoef(x1,x2)

array([[ 1.        ,  0.81936924],
       [ 0.81936924,  1.        ]])

The correlation coefficient between $X_1$ and $X_2$ is 0.819.

# Draw scatterplot
plt.scatter(x1,x2);

png

(c)

# Define data
X = pd.DataFrame({'x1':x1, 'x2':x2})
X = sm.add_constant(X)  # No constant is added by the model unless we're using formulas, so we have to add it

# Create model
model = sm.OLS(y, X)

# Fit regression model
results = model.fit()

# Print results
results.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.444
Model:	OLS	Adj. R-squared:	0.433
Method:	Least Squares	F-statistic:	38.74
Date:	Fri, 08 Dec 2017	Prob (F-statistic):	4.31e-13
Time:	09:50:56	Log-Likelihood:	-123.67
No. Observations:	100	AIC:	253.3
Df Residuals:	97	BIC:	261.1
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	1.8158	0.162	11.231	0.000	1.495	2.137
x1	2.0758	0.488	4.257	0.000	1.108	3.044
x2	0.7584	0.817	0.929	0.355	-0.862	2.379

Omnibus:	0.718	Durbin-Watson:	1.960
Prob(Omnibus):	0.698	Jarque-Bera (JB):	0.574
Skew:	-0.185	Prob(JB):	0.750
Kurtosis:	2.981	Cond. No.	12.5

According to the results we have:

$\hat{\beta_0} = 1.8158$
$\hat{\beta_1} = 2.0758$
$\hat{\beta_2} = 0.7584$

These values are estimators of the true coefficients, which have the following values:

$\beta_0 = 2$
$\beta_1 = 2$
$\beta_2 = 0.3$

As we can see, there are some differences between the coefficients, especially in the case of $\hat{\beta_2}$ (0.7584 vs. 0.3).

$H_0 : \beta_1 = 0$ . The rejection of the null hypothesis depends on the t-statistic (t). In the case of $\beta_1$ , this value is high. If the t-statistics is high, the p-value will be low. The p-value is the probability of observing any value equal to |t| or larger than (P>|t|), assuming that the coefficient is zero. Thus, if the p-value is low we should reject the null hypothesis and accept the alternative hypothesis.

$H_0 : \beta_2 = 0$ . In this case the t-statistic is low (0.929) and the p-value is high (0.355). Accordingly, the null hypothesis can't be rejected.

Alternative solution

A different way to approximate the equation using R-style formula in StatsModel.

# Define data
df = pd.DataFrame({'x1':x1, 'x2':x2, 'y':y})  # We don't need to add constant because we will use formulas

# Create model
mod = smf.ols(formula='y ~ x1 + x2', data=df)  # R-style command

# Fit model
res = mod.fit()

# Print results
print (res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.444
Model:                            OLS   Adj. R-squared:                  0.433
Method:                 Least Squares   F-statistic:                     38.74
Date:                Fri, 08 Dec 2017   Prob (F-statistic):           4.31e-13
Time:                        09:50:56   Log-Likelihood:                -123.67
No. Observations:                 100   AIC:                             253.3
Df Residuals:                      97   BIC:                             261.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.8158      0.162     11.231      0.000       1.495       2.137
x1             2.0758      0.488      4.257      0.000       1.108       3.044
x2             0.7584      0.817      0.929      0.355      -0.862       2.379
==============================================================================
Omnibus:                        0.718   Durbin-Watson:                   1.960
Prob(Omnibus):                  0.698   Jarque-Bera (JB):                0.574
Skew:                          -0.185   Prob(JB):                        0.750
Kurtosis:                       2.981   Cond. No.                         12.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Alternative solution

This is an alternative solution using Scikit instead of StatsModel.

# Create model
lr = LinearRegression()

# Fit model
mod = lr.fit(X,y)

# Get coefficients
mod.coef_

array([ 0.        ,  2.0758066 ,  0.75840009])

(d)

X = pd.DataFrame({'x1':x1})
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.439
Model:                            OLS   Adj. R-squared:                  0.433
Method:                 Least Squares   F-statistic:                     76.72
Date:                Fri, 08 Dec 2017   Prob (F-statistic):           5.93e-14
Time:                        09:50:56   Log-Likelihood:                -124.11
No. Observations:                 100   AIC:                             252.2
Df Residuals:                      98   BIC:                             257.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8229      0.161     11.295      0.000       1.503       2.143
x1             2.4468      0.279      8.759      0.000       1.892       3.001
==============================================================================
Omnibus:                        0.357   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.836   Jarque-Bera (JB):                0.272
Skew:                          -0.127   Prob(JB):                        0.873
Kurtosis:                       2.963   Cond. No.                         4.17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient value increased to 2.4468 and the null hypothesis can be rejected and the alternative hypothesis accepted because p-value is zero. It can be said that this results are in line with our expectations from (c).

(e)

X = pd.DataFrame({'x2':x2})
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.340
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     50.53
Date:                Fri, 08 Dec 2017   Prob (F-statistic):           1.92e-10
Time:                        09:50:56   Log-Likelihood:                -132.23
No. Observations:                 100   AIC:                             268.5
Df Residuals:                      98   BIC:                             273.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.1250      0.157     13.572      0.000       1.814       2.436
x2             3.6070      0.507      7.108      0.000       2.600       4.614
==============================================================================
Omnibus:                        1.537   Durbin-Watson:                   1.828
Prob(Omnibus):                  0.464   Jarque-Bera (JB):                1.597
Skew:                          -0.272   Prob(JB):                        0.450
Kurtosis:                       2.704   Cond. No.                         5.89
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient value increased to 3.6070 and the null hypothesis can be rejected and the alternative hypothesis accepted because p-value is zero.

These results are significantly different from (c). In (c) the coefficient associated with $x_2$ had a lower value and the p-value suggested that the null hypothesis couldn't be rejected (coefficient value could be zero). Now, the coefficient value is higher (even higher than the coefficient value resulting from the case where only $x_1$ is used) and the null hypothesis can be rejected and the alternative hypothesis accepted.

(f)

The results do not contradict. What's happening here is a collinearity phenomenon. As suggested by the high correlation values and by the scatter plot (and, of course, from the generation of Y), we can linearly predict $x_1$ from $x_2$ (and vice-versa) with a substantial degree of accuracy. This is a clue of collinearity that is confirmed by the regression model. When both variables are combined in the same linear model, one of them loses explanatory power because the variance it explains is already being explained by the other variable. Accordingly, if considered individually, both variables lead to the rejection of the null hypothesis but, if considered together, one of the variables is dismissable.

Finally, the values of the coefficients agree with what we know from the underlying model. If one writes $X2$ in terms of $X1$ , substitutes it in the model and adds both coefficients of $X1$ , we get 2.15. This value is well within the confidence interval calculated in (d), namely [1.892; 3.001]. Likewise, for $X2$ the expected value of the coefficient is 4.3 which is inside the [2.600; 4.614] interval calculated in (e).

(g)

# Add observation 
x1 = np.append(x1, 0.1)  # To x1
x2 = np.append(x2, 0.8)  # To x2
y = np.append(y, 6)  # To y

# Add to dataframe (easier for outlier analysis plots)
sample = {'x1': .1, 'x2': .8, 'y': 6}  # Create point
df = df.append(sample, ignore_index=True)  # Append sample to existing dataframe

Models analysis

# Model (c)
X = pd.DataFrame({'x1':x1, 'x2':x2})
X = sm.add_constant(X)  # No constant is added by the model unless we're using formulas, so we have to add it

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.425
Model:                            OLS   Adj. R-squared:                  0.414
Method:                 Least Squares   F-statistic:                     36.26
Date:                Fri, 08 Dec 2017   Prob (F-statistic):           1.64e-12
Time:                        09:50:56   Log-Likelihood:                -129.50
No. Observations:                 101   AIC:                             265.0
Df Residuals:                      98   BIC:                             272.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8697      0.168     11.111      0.000       1.536       2.204
x1             1.2421      0.432      2.876      0.005       0.385       2.099
x2             2.2711      0.698      3.254      0.002       0.886       3.656
==============================================================================
Omnibus:                        0.673   Durbin-Watson:                   1.803
Prob(Omnibus):                  0.714   Jarque-Bera (JB):                0.465
Skew:                          -0.165   Prob(JB):                        0.792
Kurtosis:                       3.035   Cond. No.                         10.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# Model (d)
X = pd.DataFrame({'x1':x1})
X = sm.add_constant(X)  # No constant is added by the model unless we're using formulas, so we have to add it

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.363
Model:                            OLS   Adj. R-squared:                  0.357
Method:                 Least Squares   F-statistic:                     56.46
Date:                Fri, 08 Dec 2017   Prob (F-statistic):           2.60e-11
Time:                        09:50:56   Log-Likelihood:                -134.68
No. Observations:                 101   AIC:                             273.4
Df Residuals:                      99   BIC:                             278.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.9419      0.175     11.116      0.000       1.595       2.288
x1             2.2829      0.304      7.514      0.000       1.680       2.886
==============================================================================
Omnibus:                       12.832   Durbin-Watson:                   1.773
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               21.470
Skew:                           0.522   Prob(JB):                     2.18e-05
Kurtosis:                       5.003   Cond. No.                         4.14
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# Model (e)
X = pd.DataFrame({'x2':x2})
X = sm.add_constant(X)  # No constant is added by the model unless we're using formulas, so we have to add it

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.377
Model:                            OLS   Adj. R-squared:                  0.370
Method:                 Least Squares   F-statistic:                     59.84
Date:                Fri, 08 Dec 2017   Prob (F-statistic):           8.79e-12
Time:                        09:50:56   Log-Likelihood:                -133.59
No. Observations:                 101   AIC:                             271.2
Df Residuals:                      99   BIC:                             276.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0962      0.154     13.604      0.000       1.790       2.402
x2             3.7581      0.486      7.736      0.000       2.794       4.722
==============================================================================
Omnibus:                        1.784   Durbin-Watson:                   1.803
Prob(Omnibus):                  0.410   Jarque-Bera (JB):                1.826
Skew:                          -0.299   Prob(JB):                        0.401
Kurtosis:                       2.726   Cond. No.                         5.68
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Effect on models: Model (c). R-squared decreased, which means that the predictive capacity of the model was reduced. The value of the regression coefficients changed: x1 coefficient decreased and x2 coefficient increased. As a consequence, x2 became the coefficient with higher value. The null hypothesis is now rejected in both variables. Model (d). The only significant change was the reduction of R-squared. * Model (e). The only significant change was a small increase of R-squared.

Outliers and high leverage points analysis

Outliers

An outlier is a point for which $y_i$ is far from the expected range predicted by the fit of the model. This raises the question of whether it is representative of the population.

Outliers can be identified from a univariate, bivariate, or multivariate perspective based on the number of variables (characteristics) considered. We should use as many perspectives as possible, looking for a consistent pattern across them to identify outliers.

Cases that fall markedly outside the range of other observations will be seen as isolated points in the scatterplot. A drawback of the bivariate method in general is the potentially large number of scatterplots that arise as the number of variables increases. For 3 variables, it is only 3 graphs for all pairwise comparisons. But for 5 variables, it takes 10 graphs, and for 10 variables it takes 45 scatterplots! Since this analysis doesn't involve more than 2 variables, we can perform all pairwise comparisons.

High leverage points

We just saw that outliers are observations for which the response $y_i$ is unusual given the predictor $x_i$ . In contrast, observations with high leverage have an unusual value for $x_i$ . In statistics, particularly in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.

In a simple linear regression, high leverage observations are fairly easy to identify, since we can simply look for observations for which the predictor value is outside of the normal range of the observations. With a single predictor, an extreme x value is simply one that is particularly high or low.

But in a multiple linear regression with many predictors, it is possible to have an observation that is well within the range of each individual predictor’s values, but that is unusual in terms of the full set of predictors. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).

# Bivariate analysis (x1,x2)
sample = df.iloc[-1:]  # To get the last observation
other = df.iloc[:-1]  # To get all the observations but the last
ax = other.plot(kind='scatter',x='x1',y='x2', color='blue');  # Plot all observations but the last in blue
sample.plot(ax=ax, kind='scatter',x='x1',y='x2', color='red');  # Plot last observation added in red

png

There is an unusual combination of predictor values, so it looks like an high leverage point.

Note: we are not comparing predictors with responses nor evaluating if $y_i$ is far from the value predicted by the model. Therefore, it doesn't make sense to discuss if it's an outlier or not based on this plot.

# Bivariate analysis (x1,y)
sample = df.iloc[-1:]  # To get the last observation
other = df.iloc[:-1]  # To get all the observations but the last
ax = other.plot(kind='scatter',x='x1',y='y', color='blue');  # Plot all observations but the last in blue
sample.plot(ax=ax, kind='scatter',x='x1',y='y', color='red');  # Plot last observation added in red

png

The red point does not follow the trend, so it looks like an outlier.
The red point doesn't have an unusual $x_1$ value, so it doesn't look like an high leverage point.

# Bivariate analysis (x2,y)
sample = df.iloc[-1:]  # To get the last observation
other = df.iloc[:-1]  # To get all the observations but the last
ax = other.plot(kind='scatter',x='x2',y='y', color='blue');  # Plot all observations but the last in blue
sample.plot(ax=ax, kind='scatter',x='x2',y='y', color='red');  # Plot last observation added in red

png

The red point follows the trend, so it doesn't look like an outlier.
The red point has an extreme $x_2$ value, so it looks like an high leverage point.

In summary: the observation added influences significantly the model, in particular if we consider the regression model that includes $x_1$ and $x_2$ . In this case, $x_2$ passed from a neglected variable to a significant variable. This means that even being just 1 observation in 100, this observation reduced the existing phenomenon of collinearity. Also, the R-squared of the model reduced, which signifies a decrease in the model predicition capacity.

According to the scatter plots, the observation added seems to be both an outlier and and an high leverage point. This conclusion can be taken from the visual observation of the observation added when confronted with the remaining observations. The added observations shows an unusual combination of predictor values, extreme predictor values and a substantial different behaviour when compared with other observations in several cases.

References

https://onlinecourses.science.psu.edu/stat501/node/337
Hair, J. F., Black, B., Babin, B., Anderson, R. E., & Tatham, R. L. (2010). Multivariate Data Analysis (7th ed.). Upper Saddle River, NJ: Prentice-Hall