Exercise 2.7
import numpy as np
import pandas as pd
d = {'X1': pd.Series([0,2,0,0,-1,1]),
'X2': pd.Series([3,0,1,1,0,1]),
'X3': pd.Series([0,0,3,2,1,1]),
'Y': pd.Series(['Red','Red','Red','Green','Green','Red'])}
df = pd.DataFrame(d)
df.index = np.arange(1, len(df) + 1)
df
X1 | X2 | X3 | Y | |
---|---|---|---|---|
1 | 0 | 3 | 0 | Red |
2 | 2 | 0 | 0 | Red |
3 | 0 | 1 | 3 | Red |
4 | 0 | 1 | 2 | Green |
5 | -1 | 0 | 1 | Green |
6 | 1 | 1 | 1 | Red |
(a) Euclidian distance
from math import sqrt
df['distance']=np.sqrt(df['X1']**2+df['X2']**2+df['X3']**2)
df
X1 | X2 | X3 | Y | distance | |
---|---|---|---|---|---|
1 | 0 | 3 | 0 | Red | 3.000000 |
2 | 2 | 0 | 0 | Red | 2.000000 |
3 | 0 | 1 | 3 | Red | 3.162278 |
4 | 0 | 1 | 2 | Green | 2.236068 |
5 | -1 | 0 | 1 | Green | 1.414214 |
6 | 1 | 1 | 1 | Red | 1.732051 |
(b) K=1
df.sort_values(['distance'])
X1 | X2 | X3 | Y | distance | |
---|---|---|---|---|---|
5 | -1 | 0 | 1 | Green | 1.414214 |
6 | 1 | 1 | 1 | Red | 1.732051 |
2 | 2 | 0 | 0 | Red | 2.000000 |
4 | 0 | 1 | 2 | Green | 2.236068 |
1 | 0 | 3 | 0 | Red | 3.000000 |
3 | 0 | 1 | 3 | Red | 3.162278 |
As we can see by sorting the data by distance to the origin, for K=1, our prediction is Green, since that's the value of the nearest neighbor (point 5 at distance 1.41).
(c) K=3
On the other hand, for K=3 our prediction is Red, because that's the mode of the 3 nearest neigbours: Green, Red and Red (points 5, 6 and 2, respectively).
(d) Highly non-linear Bayes decision boundary
A large value of K leads to a smoother decision boundary, as if the non-linearities where averaged out. This happens because KNN uses majority voting and this means less emphasis on individual points. For a large value of K, we will likely have a decision boundary which varies very little from point to point, since the result of this majority voting would have to change while for most points this will be a large majority. That is, one of its nearest neighbors changing from one class to the other would still leave the majority voting the same. By contrast, when K is very small, the decision boundary will be better able to capture local non-linearities, because it the majority of neighbors can vary significantly from point to point since the are so few neighbors. Accordingly, we would expect the best value of K to be small.
Imagine this simple example: a true linear boundary that splits the plane in two semi-planes (classes A and B), but with an additional enclave in one of the regions. That is, the true model has a small region of, say, class A inside the class B semi-plane. Would we more likely capture this region for small or large K? Say we have 3 neighboring data points in the class A enclave inside the large class B region with say 50 points. If K > 4, each of the 3 points in the enclave will be classified as B, since they always lose by majority voting (unless K is so large, say 100, that many of the points from A semi-plane region enter this voting). If, on the other hand, K < 4, each of the 3 class A points inside the enclave will be classified as class A, and we will capture this non-linearity of the regions.