In my previous post I completed an exercise using logistic regression to generate complicated non-linear decision boundaries. In this exercise I’m going to use much of the same code for handwriting recognition. These exercises are all part of Andrew Ng’s Machine Learning course on coursera. All the exercises are done in Matlab/Octave, but I’ve been stubborn and have worked solutions in R instead.

### The Data

For this exercise the dataset comprises 5000 training examples where each examples is a $20 \times 20$ pixel grayscale image of a digit between 0-9. The pixel values (which are floating point numbers) have been unrolled into a 400 dimensional vector giving a $5000 \times 400$ matrix $X$, where each row is a training example.

To visualise a subset of the data, I have been using the `raster`

package in R

# 1.2 Visualising the data

In the Machine learning course by Andrew Ng the raster drawing function is already written. I’m going to try to produce an R equivalent using the raster package.

I’ll start by loading the data and randomly selecting a 100 row subset of the data.

One of the things about the raster package is that for a grayscale image it expects the values to be between 0 and 1, and this is not the case in the training data. The values are also unrolled, so to create a bitmap, they need to be rolled back up.

Now we can plot a single digit using:

So that’s great for a single row, or a single training example. But it would be nice to plot the entire 100 row dataset that we are working from as a matrix. The following code loops through each row, and parks the $20 \times 20$ pixel grid into a matrix of $100$ bitmaps.

Which gives us…

So great, this is what we are trying to classify.

### Multiclass classification

In this exercise I’m going to use the code I wrote in the previous post, which should be ready to go out of the box.

For multiclass classification with logistic regression we simply run a mdoel for each possible class, then combine this ensemble of mdoels, and pick the value that has the highest likelihood based on the several models.

Now because the code is well vectorised running ten models together is an absolute breaze. First we define the parameter matrix $\theta$.

Then use a for loop to generate parameters for each of our ten models

Now we run a logistic regression model using these parameters, which is simply $h_\theta=g(\theta^TX)$ where $g$ is the sigmoid function $g(z)=\frac{1}{1 + e^{-z}}$.

### The result

That was pretty straightforward. Let’s check the first few predictions against the bitmap plotted earlier:

So far so good. Note that zeros are classified as tens to avoid confusion. So how well does the model work on the training data overall?

So currently the model achieved 100% accuracy with $\lambda = 0$ (the regularisation parameter), i.e. no regularisation at all.

### What about a test set?

I’ll wrap this all in a function, then try it on a different subset of the $X$ matrix.

So repeating the earlier code, I select a different random subset of 100 rows from the $X$ matrix.

And looping through a range of $\lambda$, how accurately is the model predicting the digits?

So not bad considering the model was trained on a dataset the same size as the test set. With varying levels of regularisation ($\lambda$) the model has between 74% and 77% accuracy.

Next time I’ll define training, test, and cross validation sets with a 60:20:20 split, to improve classification, and better inform my choice of $\lambda$.