I’m currently working on the excellent Machine Learning course by Andrew Ng available on coursera. I’ve been working through the exercises using R
, not matlab or octave as is requried in the course. This is the first programming exercise - implementing linear regression using the gradient descent algorithm rather than the normal equation method.
Gradient descent for a function with one parameter
Rather than calculating the optimal solution for the linear regression with a single algorithm, in this exercise we use gradient descent to iteratively find a solution. To get the concept behing gradient descent, I start by implementing gradient descent for a function which takes just on parameter (rather than two - like linear regression).
In this instance I have adapted code from Matt Bogard’s excellent blog Econometric Sense, and will use the same same function:
\[h_{\theta}=1.2(x-2)^2 + 3.2\]So we can state our objective to minimise $\theta_1$ with respect of $J(\theta_1)$ with a real number, or put mathetically $\min\limits_{\theta_1}J(\theta_1)$ and $\theta_1\in\mathbb{R}$
Cost function
We define the cost function $J(\theta_1)$ using calculus as $J(\theta)=2.4(x-2)$ (see Matt’s blog).
Gradient descent is defined by Andrew Ng as:
\[\begin{multline} \text{repeat until convergence} \{\\ \theta_1:=\theta_1 - \alpha\frac{d}{d\theta_1}J(\theta_1)\\ \} \end{multline}\]- where $\alpha$ is the learning rate governing the size of the step take with each iteration.
Here I define a function to plot the results of gradient descent graphically so we can get a sense of what is happening.
Below is the actual implementation of gradient descent.
Pretty simple! Now I use the plotting function to produce plots, and populate these with points using the gradient descent algorithm.
Another way to look at the rate of convergence is to plot the number of iterations against the output of $f(x)$. Vertical lines show when convergence occurs. When $\alpha$ is set very low, it takes much longer than necessary (although it does converge). When $\alpha$ is too high, convergence doesn’t occur at all within a hundred iterations.