Exercise 5.6

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf 
import numpy as np
df = pd.read_csv('../data/Default.csv', index_col=0)
df.head() #just checking
default student balance income
1 No No 729.526495 44361.625074
2 No Yes 817.180407 12106.134700
3 No No 1073.549164 31767.138947
4 No No 529.250605 35704.493935
5 No No 785.655883 38463.495879
np.random.seed(0) #asked in the exercise

(a)

#using generalized linear models with statsmodel
#see the wikipedia reference to understand why family is binomial
mod1 = smf.glm(formula='default ~ income + balance', data=df, family=sm.families.Binomial()).fit() #create & fit model
print(mod1.summary()) #show results
                        Generalized Linear Model Regression Results                        
===========================================================================================
Dep. Variable:     ['default[No]', 'default[Yes]']   No. Observations:                10000
Model:                                         GLM   Df Residuals:                     9997
Model Family:                             Binomial   Df Model:                            2
Link Function:                               logit   Scale:                             1.0
Method:                                       IRLS   Log-Likelihood:                -789.48
Date:                             Fri, 05 Jan 2018   Deviance:                       1579.0
Time:                                     21:08:26   Pearson chi2:                 6.95e+03
No. Iterations:                                  9                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     11.5405      0.435     26.544      0.000      10.688      12.393
income     -2.081e-05   4.99e-06     -4.174      0.000   -3.06e-05    -1.1e-05
balance       -0.0056      0.000    -24.835      0.000      -0.006      -0.005
==============================================================================

Estimated standard errors for the coefficients associated with income and balance are 4.99e-06 and 0, respectively.

(b)

def boot_fn(default):
    mod1 = smf.glm(formula='default ~ income + balance', data=default, family=sm.families.Binomial()).fit()
    coef_income = mod1.params[1]
    coef_balance = mod1.params[2]
    return [coef_income, coef_balance]
boot_fn(df)
[-2.0808975528986941e-05, -0.005647102950316488]

(c)

Since there is not Python equivalent to R boot function, we will create a boot function for Python.

#bootstrap function
def boot(X, bootSample_size=None):
    '''
    Sampling observations from a dataframe

    Parameters
    ------------
    X : pandas dataframe
        Data to be resampled

    bootSample_size: int, optional
        Dimension of the bootstrapped samples

    Returns
    ------------
    bootSample_X : pandas dataframe
        Resampled data

    Examples
    ----------
    To resample data from the X dataframe:
        >> boot(X)
    The resampled data will have length equal to len(X).

    To resample data from the X dataframe in order to have length 5:
        >> boot(X,5)

    References
    ------------
    http://nbviewer.jupyter.org/gist/aflaxman/6871948

    '''

    #assign default size if non-specified
    if bootSample_size == None:
        bootSample_size = len(X)

    #create random integers to use as indices for bootstrap sample based on original data
    bootSample_i = (np.random.rand(bootSample_size)*len(X)).astype(int)
    bootSample_i = np.array(bootSample_i)
    bootSample_X = X.iloc[bootSample_i]

    return bootSample_X

Now, we will call the boot function n times, apply boot_fn and compute the coefficients average value. We used n = 100 to have convergent results. Other values could be used.

#running model for bootstrapped samples
coefficients = [] #variable initialization
n = 100 #number of bootstrapped samples

for i in range(0,n):
    coef_i = boot_fn(boot(df)) #determining coefficients for specific bootstrapped sample
    coefficients.append(coef_i) #saving coefficients value

print(pd.DataFrame(coefficients).mean()) #print average of coefficients
0   -0.000021
1   -0.005716
dtype: float64

(d)

References