Exercise 5.6

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf 
import numpy as np

df = pd.read_csv('../data/Default.csv', index_col=0)

df.head() #just checking

	default	student	balance	income
1	No	No	729.526495	44361.625074
2	No	Yes	817.180407	12106.134700
3	No	No	1073.549164	31767.138947
4	No	No	529.250605	35704.493935
5	No	No	785.655883	38463.495879

np.random.seed(0) #asked in the exercise

(a)

#using generalized linear models with statsmodel
#see the wikipedia reference to understand why family is binomial
mod1 = smf.glm(formula='default ~ income + balance', data=df, family=sm.families.Binomial()).fit() #create & fit model
print(mod1.summary()) #show results

                        Generalized Linear Model Regression Results                        
===========================================================================================
Dep. Variable:     ['default[No]', 'default[Yes]']   No. Observations:                10000
Model:                                         GLM   Df Residuals:                     9997
Model Family:                             Binomial   Df Model:                            2
Link Function:                               logit   Scale:                             1.0
Method:                                       IRLS   Log-Likelihood:                -789.48
Date:                             Fri, 05 Jan 2018   Deviance:                       1579.0
Time:                                     21:08:26   Pearson chi2:                 6.95e+03
No. Iterations:                                  9                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     11.5405      0.435     26.544      0.000      10.688      12.393
income     -2.081e-05   4.99e-06     -4.174      0.000   -3.06e-05    -1.1e-05
balance       -0.0056      0.000    -24.835      0.000      -0.006      -0.005
==============================================================================

Estimated standard errors for the coefficients associated with income and balance are 4.99e-06 and 0, respectively.

(b)

def boot_fn(default):
    mod1 = smf.glm(formula='default ~ income + balance', data=default, family=sm.families.Binomial()).fit()
    coef_income = mod1.params[1]
    coef_balance = mod1.params[2]
    return [coef_income, coef_balance]

boot_fn(df)

[-2.0808975528986941e-05, -0.005647102950316488]

(c)

Since there is not Python equivalent to R boot function, we will create a boot function for Python.

#bootstrap function
def boot(X, bootSample_size=None):
    '''
    Sampling observations from a dataframe

    Parameters
    ------------
    X : pandas dataframe
        Data to be resampled

    bootSample_size: int, optional
        Dimension of the bootstrapped samples

    Returns
    ------------
    bootSample_X : pandas dataframe
        Resampled data

    Examples
    ----------
    To resample data from the X dataframe:
        >> boot(X)
    The resampled data will have length equal to len(X).

    To resample data from the X dataframe in order to have length 5:
        >> boot(X,5)

    References
    ------------
    http://nbviewer.jupyter.org/gist/aflaxman/6871948

    '''

    #assign default size if non-specified
    if bootSample_size == None:
        bootSample_size = len(X)

    #create random integers to use as indices for bootstrap sample based on original data
    bootSample_i = (np.random.rand(bootSample_size)*len(X)).astype(int)
    bootSample_i = np.array(bootSample_i)
    bootSample_X = X.iloc[bootSample_i]

    return bootSample_X

Now, we will call the boot function n times, apply boot_fn and compute the coefficients average value. We used n = 100 to have convergent results. Other values could be used.

#running model for bootstrapped samples
coefficients = [] #variable initialization
n = 100 #number of bootstrapped samples

for i in range(0,n):
    coef_i = boot_fn(boot(df)) #determining coefficients for specific bootstrapped sample
    coefficients.append(coef_i) #saving coefficients value

print(pd.DataFrame(coefficients).mean()) #print average of coefficients

0   -0.000021
1   -0.005716
dtype: float64

(d)

Results (b): [-2.0808975528987073e-05, -0.005647102950316495]
Results (c): [-0.000021, -0.005672]

References

Wikipedia, Generalized linear model, https://en.wikipedia.org/wiki/Generalized_linear_model
http://nbviewer.jupyter.org/gist/aflaxman/6871948