%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:,.2f}'.format # Print only 2 decimal cases.
df = pd.read_csv('../data/auto.csv')
df
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.00 | 8 | 307.00 | 130 | 3504 | 12.00 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.00 | 8 | 350.00 | 165 | 3693 | 11.50 | 70 | 1 | buick skylark 320 |
2 | 18.00 | 8 | 318.00 | 150 | 3436 | 11.00 | 70 | 1 | plymouth satellite |
3 | 16.00 | 8 | 304.00 | 150 | 3433 | 12.00 | 70 | 1 | amc rebel sst |
4 | 17.00 | 8 | 302.00 | 140 | 3449 | 10.50 | 70 | 1 | ford torino |
5 | 15.00 | 8 | 429.00 | 198 | 4341 | 10.00 | 70 | 1 | ford galaxie 500 |
6 | 14.00 | 8 | 454.00 | 220 | 4354 | 9.00 | 70 | 1 | chevrolet impala |
7 | 14.00 | 8 | 440.00 | 215 | 4312 | 8.50 | 70 | 1 | plymouth fury iii |
8 | 14.00 | 8 | 455.00 | 225 | 4425 | 10.00 | 70 | 1 | pontiac catalina |
9 | 15.00 | 8 | 390.00 | 190 | 3850 | 8.50 | 70 | 1 | amc ambassador dpl |
10 | 15.00 | 8 | 383.00 | 170 | 3563 | 10.00 | 70 | 1 | dodge challenger se |
11 | 14.00 | 8 | 340.00 | 160 | 3609 | 8.00 | 70 | 1 | plymouth 'cuda 340 |
12 | 15.00 | 8 | 400.00 | 150 | 3761 | 9.50 | 70 | 1 | chevrolet monte carlo |
13 | 14.00 | 8 | 455.00 | 225 | 3086 | 10.00 | 70 | 1 | buick estate wagon (sw) |
14 | 24.00 | 4 | 113.00 | 95 | 2372 | 15.00 | 70 | 3 | toyota corona mark ii |
15 | 22.00 | 6 | 198.00 | 95 | 2833 | 15.50 | 70 | 1 | plymouth duster |
16 | 18.00 | 6 | 199.00 | 97 | 2774 | 15.50 | 70 | 1 | amc hornet |
17 | 21.00 | 6 | 200.00 | 85 | 2587 | 16.00 | 70 | 1 | ford maverick |
18 | 27.00 | 4 | 97.00 | 88 | 2130 | 14.50 | 70 | 3 | datsun pl510 |
19 | 26.00 | 4 | 97.00 | 46 | 1835 | 20.50 | 70 | 2 | volkswagen 1131 deluxe sedan |
20 | 25.00 | 4 | 110.00 | 87 | 2672 | 17.50 | 70 | 2 | peugeot 504 |
21 | 24.00 | 4 | 107.00 | 90 | 2430 | 14.50 | 70 | 2 | audi 100 ls |
22 | 25.00 | 4 | 104.00 | 95 | 2375 | 17.50 | 70 | 2 | saab 99e |
23 | 26.00 | 4 | 121.00 | 113 | 2234 | 12.50 | 70 | 2 | bmw 2002 |
24 | 21.00 | 6 | 199.00 | 90 | 2648 | 15.00 | 70 | 1 | amc gremlin |
25 | 10.00 | 8 | 360.00 | 215 | 4615 | 14.00 | 70 | 1 | ford f250 |
26 | 10.00 | 8 | 307.00 | 200 | 4376 | 15.00 | 70 | 1 | chevy c20 |
27 | 11.00 | 8 | 318.00 | 210 | 4382 | 13.50 | 70 | 1 | dodge d200 |
28 | 9.00 | 8 | 304.00 | 193 | 4732 | 18.50 | 70 | 1 | hi 1200d |
29 | 27.00 | 4 | 97.00 | 88 | 2130 | 14.50 | 71 | 3 | datsun pl510 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
367 | 28.00 | 4 | 112.00 | 88 | 2605 | 19.60 | 82 | 1 | chevrolet cavalier |
368 | 27.00 | 4 | 112.00 | 88 | 2640 | 18.60 | 82 | 1 | chevrolet cavalier wagon |
369 | 34.00 | 4 | 112.00 | 88 | 2395 | 18.00 | 82 | 1 | chevrolet cavalier 2-door |
370 | 31.00 | 4 | 112.00 | 85 | 2575 | 16.20 | 82 | 1 | pontiac j2000 se hatchback |
371 | 29.00 | 4 | 135.00 | 84 | 2525 | 16.00 | 82 | 1 | dodge aries se |
372 | 27.00 | 4 | 151.00 | 90 | 2735 | 18.00 | 82 | 1 | pontiac phoenix |
373 | 24.00 | 4 | 140.00 | 92 | 2865 | 16.40 | 82 | 1 | ford fairmont futura |
374 | 36.00 | 4 | 105.00 | 74 | 1980 | 15.30 | 82 | 2 | volkswagen rabbit l |
375 | 37.00 | 4 | 91.00 | 68 | 2025 | 18.20 | 82 | 3 | mazda glc custom l |
376 | 31.00 | 4 | 91.00 | 68 | 1970 | 17.60 | 82 | 3 | mazda glc custom |
377 | 38.00 | 4 | 105.00 | 63 | 2125 | 14.70 | 82 | 1 | plymouth horizon miser |
378 | 36.00 | 4 | 98.00 | 70 | 2125 | 17.30 | 82 | 1 | mercury lynx l |
379 | 36.00 | 4 | 120.00 | 88 | 2160 | 14.50 | 82 | 3 | nissan stanza xe |
380 | 36.00 | 4 | 107.00 | 75 | 2205 | 14.50 | 82 | 3 | honda accord |
381 | 34.00 | 4 | 108.00 | 70 | 2245 | 16.90 | 82 | 3 | toyota corolla |
382 | 38.00 | 4 | 91.00 | 67 | 1965 | 15.00 | 82 | 3 | honda civic |
383 | 32.00 | 4 | 91.00 | 67 | 1965 | 15.70 | 82 | 3 | honda civic (auto) |
384 | 38.00 | 4 | 91.00 | 67 | 1995 | 16.20 | 82 | 3 | datsun 310 gx |
385 | 25.00 | 6 | 181.00 | 110 | 2945 | 16.40 | 82 | 1 | buick century limited |
386 | 38.00 | 6 | 262.00 | 85 | 3015 | 17.00 | 82 | 1 | oldsmobile cutlass ciera (diesel) |
387 | 26.00 | 4 | 156.00 | 92 | 2585 | 14.50 | 82 | 1 | chrysler lebaron medallion |
388 | 22.00 | 6 | 232.00 | 112 | 2835 | 14.70 | 82 | 1 | ford granada l |
389 | 32.00 | 4 | 144.00 | 96 | 2665 | 13.90 | 82 | 3 | toyota celica gt |
390 | 36.00 | 4 | 135.00 | 84 | 2370 | 13.00 | 82 | 1 | dodge charger 2.2 |
391 | 27.00 | 4 | 151.00 | 90 | 2950 | 17.30 | 82 | 1 | chevrolet camaro |
392 | 27.00 | 4 | 140.00 | 86 | 2790 | 15.60 | 82 | 1 | ford mustang gl |
393 | 44.00 | 4 | 97.00 | 52 | 2130 | 24.60 | 82 | 2 | vw pickup |
394 | 32.00 | 4 | 135.00 | 84 | 2295 | 11.60 | 82 | 1 | dodge rampage |
395 | 28.00 | 4 | 120.00 | 79 | 2625 | 18.60 | 82 | 1 | ford ranger |
396 | 31.00 | 4 | 119.00 | 82 | 2720 | 19.40 | 82 | 1 | chevy s-10 |
397 rows × 9 columns
Looks good so far, no missing values in sight.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg 397 non-null float64
cylinders 397 non-null int64
displacement 397 non-null float64
horsepower 397 non-null object
weight 397 non-null int64
acceleration 397 non-null float64
year 397 non-null int64
origin 397 non-null int64
name 397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
It seems suspicious that 'horsepower' is of 'object' type. Let's have a closer look.
df.horsepower.unique()
array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
'170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
'200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
'72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
'112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
'75', '91', '122', '67', '83', '78', '52', '61', '93', '148', '129',
'96', '71', '98', '115', '53', '81', '79', '120', '152', '102',
'108', '68', '58', '149', '89', '63', '48', '66', '139', '103',
'125', '133', '138', '135', '142', '77', '62', '132', '84', '64',
'74', '116', '82'], dtype=object)
Ok, so there are some missing values represented by a question mark.
df = df[df.horsepower != '?'].copy() # [1]
df['horsepower']=pd.to_numeric(df['horsepower'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 396
Data columns (total 9 columns):
mpg 392 non-null float64
cylinders 392 non-null int64
displacement 392 non-null float64
horsepower 392 non-null int64
weight 392 non-null int64
acceleration 392 non-null float64
year 392 non-null int64
origin 392 non-null int64
name 392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 30.6+ KB
a) Quantitative and qualitative predictors
df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.00 | 8 | 307.00 | 130 | 3504 | 12.00 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.00 | 8 | 350.00 | 165 | 3693 | 11.50 | 70 | 1 | buick skylark 320 |
2 | 18.00 | 8 | 318.00 | 150 | 3436 | 11.00 | 70 | 1 | plymouth satellite |
3 | 16.00 | 8 | 304.00 | 150 | 3433 | 12.00 | 70 | 1 | amc rebel sst |
4 | 17.00 | 8 | 302.00 | 140 | 3449 | 10.50 | 70 | 1 | ford torino |
Quantitative predictors:
quantitative = df.select_dtypes(include=['number']).columns
quantitative
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'year', 'origin'],
dtype='object')
Qualitative predictors:
qualitative = df.select_dtypes(exclude=['number']).columns
qualitative
Index(['name'], dtype='object')
b) Range of each quantitative predictor
a = df.describe()
a.loc['range'] = a.loc['max'] - a.loc['min']
a.loc['range']
mpg 37.60
cylinders 5.00
displacement 387.00
horsepower 184.00
weight 3,527.00
acceleration 16.80
year 12.00
origin 2.00
Name: range, dtype: float64
c) Mean and standard deviation
a.loc[['mean','std', 'range']]
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
---|---|---|---|---|---|---|---|---|
mean | 23.45 | 5.47 | 194.41 | 104.47 | 2,977.58 | 15.54 | 75.98 | 1.58 |
std | 7.81 | 1.71 | 104.64 | 38.49 | 849.40 | 2.76 | 3.68 | 0.81 |
range | 37.60 | 5.00 | 387.00 | 184.00 | 3,527.00 | 16.80 | 12.00 | 2.00 |
d) Mean and standard deviation, removing observations
df_b = df.drop(df.index[10:85])
b = df_b.describe()
b.loc['range'] = b.loc['max'] - b.loc['min']
b.loc[['mean','std', 'range']]
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
---|---|---|---|---|---|---|---|---|
mean | 24.37 | 5.38 | 187.88 | 101.00 | 2,938.85 | 15.70 | 77.12 | 1.60 |
std | 7.87 | 1.66 | 100.17 | 36.00 | 811.64 | 2.72 | 3.13 | 0.82 |
range | 35.60 | 5.00 | 387.00 | 184.00 | 3,348.00 | 16.30 | 12.00 | 2.00 |
e) Visualizing relationships between variables
We use some common visualization tools, namely:
- Scatterplots
- Box plots
- Histograms
g = sns.PairGrid(df, size=2)
g.map_upper(plt.scatter, s=3)
g.map_diag(plt.hist)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.fig.set_size_inches(12, 12)
- The histogram for 'acceleration' resembles a normal distribution.
- 'displacement' and 'weight' have a strong linear relationship.
- 'mpg' has a non-linear relationship with 'weight', 'horsepower' and 'displacement'.
f) Predicting mpg
Based on the previous question, we could use weight, horsepower and displacement. As seen in the scatterplots, these variables seem to have a non-linear relationship with mpg. Are these relationships statistically significant? Exercises 3.8 and 3.9 delve further into this matter.