Exercise 2.8

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

pd.options.display.float_format = '{:,.2f}'.format # Print only 2 decimal cases.

(a) Read csv

college = pd.read_csv("../data/College.csv") # Portable import, works on Windows as well.
college
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.10 12 7041 60
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.20 16 10527 56
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.90 30 8735 54
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.70 37 19016 59
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.90 2 10922 15
5 Albertson College Yes 587 479 158 38 62 678 41 13500 3335 500 675 67 73 9.40 11 9727 55
6 Albertus Magnus College Yes 353 340 103 17 45 416 230 13290 5720 500 1500 90 93 11.50 26 8861 63
7 Albion College Yes 1899 1720 489 37 68 1594 32 13868 4826 450 850 89 100 13.70 37 11487 73
8 Albright College Yes 1038 839 227 30 63 973 306 15595 4400 300 500 79 84 11.30 23 11644 80
9 Alderson-Broaddus College Yes 582 498 172 21 44 799 78 10468 3380 660 1800 40 41 11.50 15 8991 52
10 Alfred University Yes 1732 1425 472 37 75 1830 110 16548 5406 500 600 82 88 11.30 31 10932 73
11 Allegheny College Yes 2652 1900 484 44 77 1707 44 17080 4440 400 600 73 91 9.90 41 11711 76
12 Allentown Coll. of St. Francis de Sales Yes 1179 780 290 38 64 1130 638 9690 4785 600 1000 60 84 13.30 21 7940 74
13 Alma College Yes 1267 1080 385 44 73 1306 28 12572 4552 400 400 79 87 15.30 32 9305 68
14 Alverno College Yes 494 313 157 23 46 1317 1235 8352 3640 650 2449 36 69 11.10 26 8127 55
15 American International College Yes 1420 1093 220 9 22 1018 287 8700 4780 450 1400 78 84 14.70 19 7355 69
16 Amherst College Yes 4302 992 418 83 96 1593 5 19760 5300 660 1598 93 98 8.40 63 21424 100
17 Anderson University Yes 1216 908 423 19 40 1819 281 10100 3520 550 1100 48 61 12.10 14 7994 59
18 Andrews University Yes 1130 704 322 14 23 1586 326 9996 3090 900 1320 62 66 11.50 18 10908 46
19 Angelo State University No 3540 2001 1016 24 54 4190 1512 5130 3592 500 2000 60 62 23.10 5 4010 34
20 Antioch University Yes 713 661 252 25 44 712 23 15476 3336 400 1100 69 82 11.30 35 42926 48
21 Appalachian State University No 7313 4664 1910 20 63 9940 1035 6806 2540 96 2000 83 96 18.30 14 5854 70
22 Aquinas College Yes 619 516 219 20 51 1251 767 11208 4124 350 1615 55 65 12.70 25 6584 65
23 Arizona State University Main campus No 12809 10308 3761 24 49 22593 7585 7434 4850 700 2100 88 93 18.90 5 4602 48
24 Arkansas College (Lyon College) Yes 708 334 166 46 74 530 182 8644 3922 500 800 79 88 12.60 24 14579 54
25 Arkansas Tech University No 1734 1729 951 12 52 3602 939 3460 2650 450 1000 57 60 19.60 5 4739 48
26 Assumption College Yes 2135 1700 491 23 59 1708 689 12000 5920 500 500 93 93 13.80 30 7100 88
27 Auburn University-Main Campus No 7548 6791 3070 25 57 16262 1716 6300 3933 600 1908 85 91 16.70 18 6642 69
28 Augsburg College Yes 662 513 257 12 30 2074 726 11902 4372 540 950 65 65 12.80 31 7836 58
29 Augustana College IL Yes 1879 1658 497 36 69 1950 38 13353 4173 540 821 78 83 12.70 40 9220 71
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
747 Westfield State College No 3100 2150 825 3 20 3234 941 5542 3788 500 1300 75 79 15.70 20 4222 65
748 Westminster College MO Yes 662 553 184 20 43 665 37 10720 4050 600 1650 66 70 12.50 20 7925 62
749 Westminster College Yes 996 866 377 29 58 1411 72 12065 3615 430 685 62 78 12.50 41 8596 80
750 Westminster College of Salt Lake City Yes 917 720 213 21 60 979 743 8820 4050 600 2025 68 83 10.50 34 7170 50
751 Westmont College No 950 713 351 42 72 1276 9 14320 5304 490 1410 77 77 14.90 17 8837 87
752 Wheaton College IL Yes 1432 920 548 56 84 2200 56 11480 4200 530 1400 81 83 12.70 40 11916 85
753 Westminster College PA Yes 1738 1373 417 21 55 1335 30 18460 5970 700 850 92 96 13.20 41 22704 71
754 Wheeling Jesuit College Yes 903 755 213 15 49 971 305 10500 4545 600 600 66 71 14.10 27 7494 72
755 Whitman College Yes 1861 998 359 45 77 1220 46 16670 4900 750 800 80 83 10.50 51 13198 72
756 Whittier College Yes 1681 1069 344 35 63 1235 30 16249 5699 500 1998 84 92 13.60 29 11778 52
757 Whitworth College Yes 1121 926 372 43 70 1270 160 12660 4500 678 2424 80 80 16.90 20 8328 80
758 Widener University Yes 2139 1492 502 24 64 2186 2171 12350 5370 500 1350 88 86 12.60 19 9603 63
759 Wilkes University Yes 1631 1431 434 15 36 1803 603 11150 5130 550 1260 78 92 13.30 24 8543 67
760 Willamette University Yes 1658 1327 395 49 80 1595 159 14800 4620 400 790 91 94 13.30 37 10779 68
761 William Jewell College Yes 663 547 315 32 67 1279 75 10060 2970 500 2600 74 80 11.20 19 7885 59
762 William Woods University Yes 469 435 227 17 39 851 120 10535 4365 550 3700 39 66 12.90 16 7438 52
763 Williams College Yes 4186 1245 526 81 96 1988 29 19629 5790 500 1200 94 99 9.00 64 22014 99
764 Wilson College Yes 167 130 46 16 50 199 676 11428 5084 450 475 67 76 8.30 43 10291 67
765 Wingate College Yes 1239 1017 383 10 34 1207 157 7820 3400 550 1550 69 81 13.90 8 7264 91
766 Winona State University No 3325 2047 1301 20 45 5800 872 4200 2700 300 1200 53 60 20.20 18 5318 58
767 Winthrop University No 2320 1805 769 24 61 3395 670 6400 3392 580 2150 71 80 12.80 26 6729 59
768 Wisconsin Lutheran College Yes 152 128 75 17 41 282 22 9100 3700 500 1400 48 48 8.50 26 8960 50
769 Wittenberg University Yes 1979 1739 575 42 68 1980 144 15948 4404 400 800 82 95 12.80 29 10414 78
770 Wofford College Yes 1501 935 273 51 83 1059 34 12680 4150 605 1440 91 92 15.30 42 7875 75
771 Worcester Polytechnic Institute Yes 2768 2314 682 49 86 2802 86 15884 5370 530 730 92 94 15.20 34 10774 82
772 Worcester State College No 2197 1515 543 4 26 3089 2029 6797 3900 500 1200 60 60 21.00 14 4469 40
773 Xavier University Yes 1959 1805 695 24 47 2849 1107 11520 4960 600 1250 73 75 13.30 31 9189 83
774 Xavier University of Louisiana Yes 2097 1915 695 34 61 2793 166 6900 4200 617 781 67 75 14.40 20 8323 49
775 Yale University Yes 10705 2453 1317 95 99 5217 83 19840 6510 630 2115 96 96 5.80 49 40386 99
776 York College of Pennsylvania Yes 2989 1855 691 28 63 2988 1726 4990 3560 500 1250 75 75 18.10 28 4509 99

777 rows × 19 columns

(b) University names as index

The fix() function in R (similar to edit()) allows on-the-fly edit to the dataframe by invoking an editor. Further details can be found here and here.

# [1]
college = college.set_index("Unnamed: 0") # The default option 'drop=True', deletes the column
college.index.name = 'Names'
college.head()
# The empty row below the columns names (e.g. Private, Apps, etc.) is there because the index has a name and that creates an additional row.
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Names
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.10 12 7041 60
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.20 16 10527 56
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.90 30 8735 54
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.70 37 19016 59
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.90 2 10922 15

[1] https://campus.datacamp.com/courses/manipulating-dataframes-with-pandas/advanced-indexing?ex=1

# Alternative solution: We could have done this all in one less line with:
college = pd.read_csv('../data/College.csv', index_col='Unnamed: 0')
college.index.name = 'Names'
college.head()
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Names
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.10 12 7041 60
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.20 16 10527 56
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.90 30 8735 54
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.70 37 19016 59
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.90 2 10922 15

(c)

i. Summary

college.describe(include='all')
# [2, 3, 4] Without the 'all' option, the column 'Private' is not shown because it is categorical
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00
unique 2 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
top Yes nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
freq 565 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
mean NaN 3,001.64 2,018.80 779.97 27.56 55.80 3,699.91 855.30 10,440.67 4,357.53 549.38 1,340.64 72.66 79.70 14.09 22.74 9,660.17 65.46
std NaN 3,870.20 2,451.11 929.18 17.64 19.80 4,850.42 1,522.43 4,023.02 1,096.70 165.11 677.07 16.33 14.72 3.96 12.39 5,221.77 17.18
min NaN 81.00 72.00 35.00 1.00 9.00 139.00 1.00 2,340.00 1,780.00 96.00 250.00 8.00 24.00 2.50 0.00 3,186.00 10.00
25% NaN 776.00 604.00 242.00 15.00 41.00 992.00 95.00 7,320.00 3,597.00 470.00 850.00 62.00 71.00 11.50 13.00 6,751.00 53.00
50% NaN 1,558.00 1,110.00 434.00 23.00 54.00 1,707.00 353.00 9,990.00 4,200.00 500.00 1,200.00 75.00 82.00 13.60 21.00 8,377.00 65.00
75% NaN 3,624.00 2,424.00 902.00 35.00 69.00 4,005.00 967.00 12,925.00 5,050.00 600.00 1,700.00 85.00 92.00 16.50 31.00 10,830.00 78.00
max NaN 48,094.00 26,330.00 6,392.00 96.00 100.00 31,643.00 21,836.00 21,700.00 8,124.00 2,340.00 6,800.00 103.00 100.00 39.80 64.00 56,233.00 118.00
# Alternative solution: call describe twice. One on number, and another on object.
college.describe(include=['number'])
# or college.describe(include=[np.number])
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00 777.00
mean 3,001.64 2,018.80 779.97 27.56 55.80 3,699.91 855.30 10,440.67 4,357.53 549.38 1,340.64 72.66 79.70 14.09 22.74 9,660.17 65.46
std 3,870.20 2,451.11 929.18 17.64 19.80 4,850.42 1,522.43 4,023.02 1,096.70 165.11 677.07 16.33 14.72 3.96 12.39 5,221.77 17.18
min 81.00 72.00 35.00 1.00 9.00 139.00 1.00 2,340.00 1,780.00 96.00 250.00 8.00 24.00 2.50 0.00 3,186.00 10.00
25% 776.00 604.00 242.00 15.00 41.00 992.00 95.00 7,320.00 3,597.00 470.00 850.00 62.00 71.00 11.50 13.00 6,751.00 53.00
50% 1,558.00 1,110.00 434.00 23.00 54.00 1,707.00 353.00 9,990.00 4,200.00 500.00 1,200.00 75.00 82.00 13.60 21.00 8,377.00 65.00
75% 3,624.00 2,424.00 902.00 35.00 69.00 4,005.00 967.00 12,925.00 5,050.00 600.00 1,700.00 85.00 92.00 16.50 31.00 10,830.00 78.00
max 48,094.00 26,330.00 6,392.00 96.00 100.00 31,643.00 21,836.00 21,700.00 8,124.00 2,340.00 6,800.00 103.00 100.00 39.80 64.00 56,233.00 118.00
college.describe(include=['object'])
# or college.describe(include=['O'])
Private
count 777
unique 2
top Yes
freq 565

ii. Pair plot

Unlike R, seaborn does not pairplot categorical vs numerical. See more here.

g = sns.PairGrid(college, vars=college.iloc[:,1:11], hue='Private')
g.map_upper(plt.scatter, s=3)
g.map_diag(plt.hist)
g.map_lower(plt.scatter, s=3)
g.fig.set_size_inches(12, 12)

png

iii. Box plots

sns.boxplot(x='Private', y='Outstate', data=college);

png

iv. Elite variable

college.loc[college['Top10perc']>50, 'Elite'] = 'Yes'
college['Elite'] = college['Elite'].fillna('No')

sns.boxplot(x='Elite', y='Outstate', data=college);

png

v. Histograms

In Python, to produce some histograms with differing numbers of bins for quantitative variables, we first need to convert these variables to bins. When we create bins, we transform a continuous range of values into a discrete one. For the purposes of this exercise, we will only consider equal-width bins.

# Bins creation
college['PhD'] = pd.cut(college['PhD'], 3, labels=['Low', 'Medium', 'High'])
college['Grad.Rate'] = pd.cut(college['Grad.Rate'], 5, labels=['Very low', 'Low', 'Medium', 'High', 'Very high'])
college['Books'] = pd.cut(college['Books'], 2, labels=['Low', 'High'])
college['Enroll'] = pd.cut(college['Enroll'], 4, labels=['Very low', 'Low', 'High', 'Very high'])
# Plot histograms
fig = plt.figure()

plt.subplot(221)
college['PhD'].value_counts().plot(kind='bar', title = 'Private');
plt.subplot(222)
college['Grad.Rate'].value_counts().plot(kind='bar', title = 'Grad.Rate');
plt.subplot(223)
college['Books'].value_counts().plot(kind='bar', title = 'Books');
plt.subplot(224)
college['Enroll'].value_counts().plot(kind='bar', title = 'Enroll');

fig.subplots_adjust(hspace=1) # To add space between subplots

png

vi. Continue exploring the data

"This exercise is trivial and is left to the reader." :)