### Overview

In my last post, I started working through some examples given by Hastie et al in Elements of Statistical learning. I looked at using a linear model for classification across a randomly generated training set. In this post I’ll use nearest neighbour methods to create a non-linear decision boundary over the same data.

## Nearest neighbour algorithm

There are much more learned folk than I who give good explanations of the maths behind nearest neighbours, so I won’t spend too long on the theory. Hastie et al define the nearest neighbour approach as:

\[\hat{Y}(x)=\frac{1}{k}\sum_{x_i\in N_k(x)}y_i\]The $k$ refers to the number of groups that we are interested in, and is user defined. Our prediction $\hat{Y}$ (which is derived from $x$) is equal to the mean of $N_k$, where $N_k$ consists of the $k$ nearest training examples closest to $x$.

How do we define this closeness? Hastie et al simply use Euclidean distance:

\[N_k(x) = \sqrt{\sum_{i=1}^n(x_i - x)^2\\}\]So, simply put, all we need to do is look at the neighbours of a particular training example, and take an average of them, to create our prediction of the score of a given point.

## R Walkthrough

As ever, the code to produce this post is available on github, here.

Using the data I generated in my previous post I’ll walk through the process of producing a nearest neighbour prediction.

Just to recap: this is a dataset with 300 training examples, two features ($x_1$ and $x_2$), and a binary coded response variable ($X\in\mathbb{R}^{300 \times 2}$, $y\in{0,1}$):

### Calculating Euclidean distance

The first thing I need to do is calculate the Euclidean distance from every training example to every other training example - i.e. create a distance matrix. Fortunately R does this very simply with the `dist()`

command.
This returns a $m \times m$ dimensional matrix

I’m interested in the 15 nearest neighbour average, like Hastie et al., so I just need need to extract the 15 shortest distances from each of these columns.
It helps at this point to break the matrix into a list using `split()`

, with a vector element where each column was.
This will allow me to use `purrr::map()`

which has an easier syntax than other loop handlers like `apply()`

and its cousins.

Now I need a small helper function to return the closest $k$ points, so that I can take an average.
For this I use `order()`

This should return a vector element in the list containing the index of $D$ which corresponds to the $k$ closest training examples.

So far so good, the function returns us 15 indices with which we can subset $y$ to get our 15 nearest neighbour majority vote. The values of $y$ are then averaged…

…and converted into a binary classification, such that $G\in{0,1}$: where $\hat{Y}>0.5$, $G=1$, otherwise $G=0$.

### Intuition

Before looking at the predictions, now is a good point for a quick recap on what the model is actually doing.

For the training examples $(10, 47, 120)$ I have run the code above, and plotted out the 15 nearest neighbours whose $y$ is averaged to get out prediction $G$.

For the right hand point you can see that for all of the 15 nearest neighbours $y=1$, hence for our binary prediction $G=1$. The opposite can be said for the left hand: again there is a unanimous vote, and so $G=0$. For the middle point, most of the time $y=1$, hence although there is not unanimity, our prediction for this point would be $G=1$.

You can image that from this plot: whilst varying $k$ would have little effect on the points that are firmly within the respective classes, points close to the decision boundary are likely to be affected by small changes in $k$.
Set $k$ too low, and we invite *bias*, set $k$ too high, and we are likely to increase *variance*.
I’ll come back to this.

### Predictions

So how do the predictions made by nearest neighbours ($k=15$) match up with the actual values of $y$ in this training set?

In general: pretty good, and marginally better than the linear classifier I used in the previous post. In just $3$% of cases does our classifier get it wrong.

### Decision boundary

For this next plot, I use the `class::knn()`

function to replace the long-winded code I produced earlier.
This function allows us to train our classifier on a training set, and then apply it to a test set, all in one simple function.

In this case I produce a test set which is just a grid of points. By applying the model to this data, I can produce a decision boundary which can be plotted.

### Varying k

I mentioned before the impact that varying $k$ might have.
Here I have run `knn()`

on the same data but for multiple values of $k$.
For $k=1$ we get a perfect fit with multiple polygons separating all points in each class perfectly.
As $k$ increases, we see that the more peripheral polygons start to break down, until at $k=15$ there is largely a singular decision boundary which weaves its way between the two classes.
At $k=99$, this decision boundary is much more linear.

In my next post I will address this problem of setting $k$ again, and try to quantify when the model is suffering from variance or bias.