I’m currently working on the excellent Machine Learning course by Andrew Ng available on coursera. I’ve been working through the exercises using R
, not matlab or octave as is requried in the course. This is the first programming exercise - implementing linear regression using the gradient descent algorithm rather than the normal equation method.
Gradient descent for a function with one parameter
Rather than calculating the optimal solution for the linear regression with a single algorithm, in this exercise we use gradient descent to iteratively find a solution. To get the concept behing gradient descent, I start by implementing gradient descent for a function which takes just on parameter (rather than two - like linear regression).
In this instance I have adapted code from Matt Bogard’s excellent blog Econometric Sense, and will use the same same function:
hθ=1.2(x−2)2+3.2So we can state our objective to minimise θ1 with respect of J(θ1) with a real number, or put mathetically minθ1J(θ1) and θ1∈R
Cost function
We define the cost function J(θ1) using calculus as J(θ)=2.4(x−2) (see Matt’s blog).
Gradient descent is defined by Andrew Ng as:
repeat until convergence{θ1:=θ1−αddθ1J(θ1)}- where α is the learning rate governing the size of the step take with each iteration.
Here I define a function to plot the results of gradient descent graphically so we can get a sense of what is happening.
library(dplyr)
library(magrittr)
library(ggplot2)
xs <- seq(0,4,len = 100) # create some values
# define the function we want to optimize
f <- function(x) {
1.2 * (x-2)^2 + 3.2
}
# plot the function
create_plot <- function(title) {
plot(
ylim = c(3,8),
x = xs,
y = f(xs),
type = "l",
ylab = expression(1.2(x-2)^2 + 3.2),
xlab = "x",
main = title
)
abline(
h = 3.2,
v = 2,
col = "red",
type = 2
)
}
# J(theta)
cost <- function(x){
1.2 * 2 * (x-2)
}
Below is the actual implementation of gradient descent.
# gradient descent implementation
grad <- function(x = 0.1, alpha = 0.6, j = 1000) {
xtrace <- x
ftrace <- f(x)
for (i in 1:j) {
x <- x - alpha * cost(x)
xtrace <- c(xtrace,x)
ftrace <- c(ftrace,f(x))
}
data.frame(
"x" = xtrace,
"f_x" = ftrace
)
}
Pretty simple! Now I use the plotting function to produce plots, and populate these with points using the gradient descent algorithm.
create_plot(expression(Low~alpha))
with(
alpha_too_low <- grad(
x = 0.1, # initialisation of x
alpha = 0.1, # learning rate
j = 100 # iterations
),
points(
x,
f_x,
type = "b",
col = "green"
)
)
create_plot(expression(alpha~just~right))
with(
alpha_just_right <- grad(
x = 0.1, # initialisation of x
alpha = 0.6, # learning rate
j = 100 # iterations
),
points(
x,
f_x,
type = "b",
col = "blue"
)
)
create_plot(expression(High~alpha))
with(
alpha_too_high <- grad(
x = 0.1, # initialisation of x
alpha = 0.8, # learning rate
j = 100 # iterations
),
points(
x,
f_x,
type = "b",
col = "red"
)
)
Another way to look at the rate of convergence is to plot the number of iterations against the output of f(x). Vertical lines show when convergence occurs. When α is set very low, it takes much longer than necessary (although it does converge). When α is too high, convergence doesn’t occur at all within a hundred iterations.
par(mfrow=c(1,3))
plot(alpha_too_low$x, type = "l",col = "green")
abline(v = (round(alpha_too_low$x,4) != 2) %>% which %>% length)
plot(alpha_just_right$x, type = "l",col = "blue")
abline(v = (round(alpha_just_right$x,4) != 2) %>% which %>% length)
plot(alpha_too_high$x, type = "l", col = "red")
abline(v = (round(alpha_too_high$x,4) != 2) %>% which %>% length)