An Introduction to Statistical Learning: with Applications in R... with Python!

This page contains the solutions to the exercises proposed in

'An Introduction to Statistical Learning with Applications in R' (ISLR) by James, Witten, Hastie and Tibshirani [1].

Both conceptual and applied exercises were solved. An effort was made to detail all the answers and to provide a set of bibliographical references that we found useful. The exercises were solved using Python instead of R. You are welcome to collaborate.

Note [03.October.2017]: we will release each chapter's solutions on a monthly basis (at least).

Solutions

Chapter 2	Chapter 3	Chapter 4	Chapter 5	Chapter 6	Chapter 7	Chapter 8	Chapter 9	Chapter 10
2.1	3.1	4.1	5.1	6.1	7.1	8.1	9.1	10.1
2.2	3.2	4.2	5.2	6.2	7.2	8.2	9.2	10.2
2.3	3.3	4.3	5.3	6.3	7.3	8.3	9.3	10.3
2.4	3.4	4.4	5.4	6.4	7.4	8.4	9.4	10.4
2.5	3.5	4.5	5.5	6.5	7.5	8.5	9.5	10.5
2.6	3.6	4.6	5.6	6.6	7.6	8.6	9.6	10.6
2.7	3.7	4.7	5.7	6.7	7.7	8.7	9.7	10.7
2.8	3.8	4.8	5.8	6.8	7.8	8.8	9.8	10.8
2.9	3.9	4.9	5.9	6.9	7.9	8.9		10.9
2.10	3.10	4.10		6.10	7.10	8.10		10.10
	3.11	4.11		6.11	7.11	8.11		10.11
	3.12	4.12			7.12	8.12
	3.13	4.13
	3.14
	3.15

Motivation

The main motivation of this project was learning. Today there are several good books and other resources from which to learn the material we covered, and we spent some time choosing a good learning project. We chose ISLR because it is an excellent, clear introduction to statistical learning, that keeps a nice balance between theory, intuition, mathematical rigour and programming. Our main goal was to use the exercises as an excuse to improve our proficiency using Python's data science stack. We had done other data science projects with Python, but, as we imagined, we still had a bit more to learn (and still do!). Since the book was written with R in mind, it made the use of Python a cool additional challenge. We are strong advocates of the active learning principles, and this project, once more, reinforced them in our minds. If you're starting out in machine learning with Python (or R!), we recommend you try it!

Technical requirements, and How to Install

This project was developed using Python 3.5 on Jupyter notebooks (Jupyter Lab, in fact). We tried to stay within the standard Python data science stack as much as possible. Accordingly, our main Python packages were numpy, matplotlib, pandas, seaborn, statsmodels and scikit-learn. You should be able to run this with the standard Python setup, and the additional libraries we list below.

If you're just starting out with Python, here's a more complete 'how-to'. We recommend using Anaconda whether you are using Linux, Mac or Windows. Anaconda allows you to easily manage several Python environments. An environment is a collection of installed Python packages. Imagine that you have two projects with different requirements: a recent one with, say, Python 3.5 and matplotlib 4.0, and a legacy project with Python 2.7 and matplotlib 3.5. A good environment manager helps you install libraries and allows you to switch between both environments easily, avoiding dependencies migraines. You can even work on both at the same time. You don't want to know what the alternative is, to not using an environment manager. So after installing Anaconda, the easiest way is to create a new environment and just install the libraries we list below one by one. After this is done, just make sure the desired environment is active (for example, on Linux and Mac, type 'source activate ', and you're good to go).

Here's the list of packages we installed:

jupyterlab (but this should run just as well on regular ipython notebooks)
numpy
pandas
matplotlib
sklearn
seaborn
ipywidgets (so that a seaborn import warning goes away)
statsmodels
mlxtend

In addition, we chose mkdocs to present these solutions in a website format, for a better presentation. We might change to a different scheme in the future (any suggestions), but meanwhile we used these libraries:

mkdocs
mkdocs-cinder
pymdown-extensions #for latex

How to colaborate

If you want to collaborate, you can open an issue in our GitHub project and give us your suggestions on how to improve these solutions. On GitHub, you can also fork this project and send a pull request to fix any mistakes that you have found. Alternatively, you can also go for the classical way of collaboration and send us an e-mail. Any effort to improve the quality of these solutions will be appreciated.

Main references

In addition to thinking hard about them, to solve the exercises we followed several references. Besides ISLR [1], which is available for free and explains almost everything you need to know to solve the exercises, we also read some other books that provide a self-contained introduction to the field of statistical learning [2, 3, 4]. We also spent some quality time on CrossValidate. For the Python data science stack we think Wes McKinney's book [5] is a good choice, as well as Jake VanderPlas' [6]. Additional references for some of the exercises are scattered througout the solutions.

Learning resources

Fortunately, online resources are becoming more and more an essential tool for self-learning strategies. During the course of this project, we found several resources that can help you on your learning path. Here are the best ones we found.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.