 ### Matt Upson

Yo no soy marinero

# Implementing vectorised logistic regression

I’ve been doing Andrew Ng’s excellent Machine Learning course on coursera. The second exercise is to implement from scratch vectorised logistic regression for classification. Submissions to the exercises have to be made in Octave or Matlab; in this post I give the solution using R.

Andrew Ng uses the algorithm fminunc in Matlab/Octave to optimise the logistic regression solution. In R you can use the optim function, but I have been using the ucminf function provided in the package ucminf. uncminf takes the following arguments:

ucminf(par, fn, gr = NULL, ..., control = list(), hessian=0)

The ones we are interested in are:

Arguments
par Initial estimate of minimum for fn.
fn Objective function to be minimized.
gr Gradient of objective function If NULL a finite difference approximation is used.

So I need to define three functions: logistic regression, a cost function, and a function which returns the gradient of that cost. These are defined in the course, helpfully:

Logistic regression is defined as:

$h_{\theta}(x)=g(\theta^{T}x)$

where $g$ is the sigmoid function:

$g(z)=\frac{1}{1+e^{-z}}$

The cost function is given by:

$J(\theta)=\frac{1}{m}\sum^m_{i=1}[-y^{(i)}\log(h_\theta(x^{(i)}))-(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$

And the gradient of the cost is a vector of the same length as $\theta$ where the $j^{th}$ element (for $j = 0,1,\cdots,n$) is defined as:

$\frac{\delta J(\theta)}{\delta\theta_{j}}=\frac{1}{m}\sum^{m}_{i=1}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j$

### Vectorised logistic regression

The first step is to implement a sigmoid function: …and with this function, implementing $h_{\theta}$ is easy:

I’ll start by implementing an only partially vectorised version of the cost function $J(\theta)$:

### Testing it out…

First, I’ll plot the data: And now try out logistic regression with ucminf:

So this gives a lot of output. But importantly it gives us three coefficients ($par), the final cost ($value), and that convergence was reached ($convergence). Andrew Ng suggests that the final cost should be 0.203, which is what I get, so it seems to be working, and using $par to plot the decision voundary, we get a pretty good fit: ### Vectorising

There is an excellent post on vectorising these functions on Stack Overflow which gives a better vectorised version of the algorithms above, e.g.:

It wasn’t immediately clear to me what’s going on here, so I’m going to break this down piece by piece.

First we create two vectors c(y, 1 - y) and c(log(h(theta,X)), log(1 - h(theta,X))) and compute the cross product of them. The first matrix is the concatenation of the $-y$ and $(1-y)$ terms for length $m$ from the equation:

$J(\theta)\frac{1}{m}\sum^m_{i=1}[-y^{(i)}\log(h_\theta(x^{(i)}))-(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$

The second vector concatenates the remaining terms:

$\text{log}(h_{\theta(x^{(i)})})$

and

$\text{log}(1-h_{\theta(x^{(i)})})$

The crossproduct of these two vectors is essentially the same as $\vec{a}^T\vec{b}$; basically the sum of every value of $\vec{a}$ multiplied by the corresponding value of $\vec{b}$. i.e.: (note that not all of the first vector equal zero: $\vec{a_{(i)}}\neq0$, $\vec{a}\in{0,1}$).

$\begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \end{bmatrix}\begin{bmatrix} -0.6931472 \\ -0.6931472 \\ -0.6931472 \\ \vdots \\ -0.6931472 \end{bmatrix} = -0.6931472$

The $\delta$ function can also be speeded up slightly by employing crossprod instead of t(X) %*% h(theta,X).

Ok so now that we have some additional vectorisation, let’s look at plugging it into the ucminf function.

So great, the two are giving the same answer. But it would be interesting to see what the speed increase is like when comparing the non-vectorised, vectorised, and the usual glm method

First let me just check that the glm implementation returns the same parameters:

Perfect. Now to compare the three I’ll use the excellent rbenchmark package.

So even with a relatively small dataset of just 100 rows, we find that a vectorised linear regression solved using an optimisation algorithm is many times quicker than applying a generalised linear model. Kinda makes it all worthwhile!

Next time I’ll look at implementing regularisation to fit more complicated decision boundaries.