I’ve been doing Andrew Ng’s excellent Machine Learning course on coursera. The second exercise is to implement from scratch vectorised logistic regression for classification. Submissions to the exercises have to be made in Octave or Matlab; in this post I give the solution using R.

Andrew Ng uses the algorithm `fminunc`

in Matlab/Octave to optimise the logistic regression solution. In R you can use the `optim`

function, but I have been using the `ucminf`

function provided in the package `ucminf`

. `uncminf`

takes the following arguments:

`ucminf(par, fn, gr = NULL, ..., control = list(), hessian=0)`

The ones we are interested in are:

Arguments | |
---|---|

`par` |
Initial estimate of minimum for fn. |

`fn` |
Objective function to be minimized. |

`gr` |
Gradient of objective function If `NULL` a finite difference approximation is used. |

So I need to define three functions: logistic regression, a cost function, and a function which returns the gradient of that cost. These are defined in the course, helpfully:

Logistic regression is defined as:

\[h_{\theta}(x)=g(\theta^{T}x)\]where $g$ is the sigmoid function:

\[g(z)=\frac{1}{1+e^{-z}}\]The cost function is given by:

\[J(\theta)=\frac{1}{m}\sum^m_{i=1}[-y^{(i)}\log(h_\theta(x^{(i)}))-(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]\]And the gradient of the cost is a vector of the same length as $\theta$ where the $j^{th}$ element (for $j = 0,1,\cdots,n$) is defined as:

\[\frac{\delta J(\theta)}{\delta\theta_{j}}=\frac{1}{m}\sum^{m}_{i=1}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j\]### Vectorised logistic regression

The first step is to implement a sigmoid function:

…and with this function, implementing $h_{\theta}$ is easy:

### Cost function and gradient

I’ll start by implementing an only partially vectorised version of the cost function $J(\theta)$:

…and the gradient…

### Testing it out…

First, I’ll plot the data:

And now try out logistic regression with `ucminf`

:

So this gives a lot of output. But importantly it gives us three coefficients (`$par`

), the final cost (`$value`

), and that convergence was reached (`$convergence`

).

Andrew Ng suggests that the final cost should be 0.203, which is what I get, so it seems to be working, and using `$par`

to plot the decision voundary, we get a pretty good fit:

### Vectorising

There is an excellent post on vectorising these functions on Stack Overflow which gives a better vectorised version of the algorithms above, e.g.:

It wasn’t immediately clear to me what’s going on here, so I’m going to break this down piece by piece.

First we create two vectors `c(y, 1 - y)`

and `c(log(h(theta,X)), log(1 - h(theta,X)))`

and compute the cross product of them. The first matrix is the concatenation of the $-y$ and $(1-y)$ terms for length $m$ from the equation:

The second vector concatenates the remaining terms:

\[\text{log}(h_{\theta(x^{(i)})})\]and

\[\text{log}(1-h_{\theta(x^{(i)})})\]The crossproduct of these two vectors is essentially the same as $\vec{a}^T\vec{b}$; basically the sum of every value of $\vec{a}$ multiplied by the corresponding value of $\vec{b}$. i.e.: (note that not all of the first vector equal zero: $\vec{a_{(i)}}\neq0$, $\vec{a}\in{0,1}$).

\[\begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \end{bmatrix}\begin{bmatrix} -0.6931472 \\ -0.6931472 \\ -0.6931472 \\ \vdots \\ -0.6931472 \end{bmatrix} = -0.6931472\]The $\delta$ function can also be speeded up slightly by employing `crossprod`

instead of `t(X) %*% h(theta,X)`

.

Ok so now that we have some additional vectorisation, let’s look at plugging it into the `ucminf`

function.

So great, the two are giving the same answer. But it would be interesting to see what the speed increase is like when comparing the non-vectorised, vectorised, and the usual `glm`

method

First let me just check that the `glm`

implementation returns the same parameters:

Perfect. Now to compare the three I’ll use the excellent `rbenchmark`

package.

So even with a relatively small dataset of just 100 rows, we find that a vectorised linear regression solved using an optimisation algorithm is many times quicker than applying a generalised linear model. Kinda makes it all worthwhile!

Next time I’ll look at implementing regularisation to fit more complicated decision boundaries.