Exercise 2.7

import numpy as np
import pandas as pd

d = {'X1': pd.Series([0,2,0,0,-1,1]),
     'X2': pd.Series([3,0,1,1,0,1]),
     'X3': pd.Series([0,0,3,2,1,1]),
     'Y': pd.Series(['Red','Red','Red','Green','Green','Red'])}
df = pd.DataFrame(d)
df.index = np.arange(1, len(df) + 1)
df
X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

(a) Euclidian distance

from math import sqrt
df['distance']=np.sqrt(df['X1']**2+df['X2']**2+df['X3']**2)
df
X1 X2 X3 Y distance
1 0 3 0 Red 3.000000
2 2 0 0 Red 2.000000
3 0 1 3 Red 3.162278
4 0 1 2 Green 2.236068
5 -1 0 1 Green 1.414214
6 1 1 1 Red 1.732051

(b) K=1

df.sort_values(['distance'])
X1 X2 X3 Y distance
5 -1 0 1 Green 1.414214
6 1 1 1 Red 1.732051
2 2 0 0 Red 2.000000
4 0 1 2 Green 2.236068
1 0 3 0 Red 3.000000
3 0 1 3 Red 3.162278

As we can see by sorting the data by distance to the origin, for K=1, our prediction is Green, since that's the value of the nearest neighbor (point 5 at distance 1.41).

(c) K=3

On the other hand, for K=3 our prediction is Red, because that's the mode of the 3 nearest neigbours: Green, Red and Red (points 5, 6 and 2, respectively).

(d) Highly non-linear Bayes decision boundary

A large value of K leads to a smoother decision boundary, as if the non-linearities where averaged out. This happens because KNN uses majority voting and this means less emphasis on individual points. For a large value of K, we will likely have a decision boundary which varies very little from point to point, since the result of this majority voting would have to change while for most points this will be a large majority. That is, one of its nearest neighbors changing from one class to the other would still leave the majority voting the same. By contrast, when K is very small, the decision boundary will be better able to capture local non-linearities, because it the majority of neighbors can vary significantly from point to point since the are so few neighbors. Accordingly, we would expect the best value of K to be small.

Imagine this simple example: a true linear boundary that splits the plane in two semi-planes (classes A and B), but with an additional enclave in one of the regions. That is, the true model has a small region of, say, class A inside the class B semi-plane. Would we more likely capture this region for small or large K? Say we have 3 neighboring data points in the class A enclave inside the large class B region with say 50 points. If K > 4, each of the 3 points in the enclave will be classified as B, since they always lose by majority voting (unless K is so large, say 100, that many of the points from A semi-plane region enter this voting). If, on the other hand, K < 4, each of the 3 class A points inside the enclave will be classified as class A, and we will capture this non-linearity of the regions.