Matt Gregory bio photo

Matt Gregory

Data Scientist

Twitter Github

Genetic Programming (GP) is a collection of techniques from evolutionary computing (EC) for the automatic generation of computer programs that perform a user-defined task. Starting with a high-level problem definition, GP creates a population of random programs that are progressively refined through variation and selection until a satisfactory solution is found. A lack of foresight does not prevent this blind watchmaker from arriving at a sensible design solution for modelling the data as demonstrated in this post.

My attention was first drawn to this approach through the MarIO video, where a program evolves through selection to get better and better at playing a computer game. A great advantage of GP is that the user need not specify (or know) much about the form of the eventual solution before starting.

I studied Zoology to doctorate level and thus spotted parallels between some of these processes and thinking tools and models for natural selection and evolution. This is all inherently interesting, warranting a blog post.

There are several packages that provide genetic programming tools for R, we will investigate a few of these packages and look at the scenarios in which they are useful compared to more typical machine learning methods. Basically, GP is an evolutionary search heuristic for arbitrary symbolic expressions, i.e. mathematical or logical formulas.

library(tidyverse)
library(ggthemes)
library(rgp)
 
# vignette("rgp_introduction")

Modelling the orbit of a celestial body

I’ve picked a topic I don’t know too much about so that the GP approach can help me with this lack of expert domain knowledge. This scenario is receiving alot of interest as there are a shortage of data scientists; it would be nice if feature extraction based on expert knowledge did not limit the training and accuracy of our models. (data smashing is a concept that is relevant here; or consider the Higgs-Boson problem). We investigate a problem tackled by Kepler in the 17th century (note: that’s why Pluto is included). We investigate the relationship between the planets’ distance from the Sun and their orbital periods (see here for an alternative take on this problem).

planets <- read.table(header = TRUE, text = "
planet    dist   period
Mercury   57.9    87.98
Venus    108.2   224.70
Earth    149.6   365.26
Mars     228.0   686.98
Ceres    413.8  1680.50
Jupiter  778.3  4332.00
Saturn  1427.0 10761.00
Uranus  2869.0 30685.00
Neptune 4498.0 60191.00
Pluto   5900.0 90742.00
")

With the data defined we can now set about looking for a relationship by dedicating some computing time. Prior to analysis we must convert everything to units of earth orbits, as the Earth is a relative centre of the universe. Thus, the units of distance will become astronomical units (AUs), and the units of period will become Earth years.

planets <- dplyr::mutate(
  planets, dist_au = dist / 149.6,
  years = period / 365.26
) %>%
  tibble::as_tibble()

As the planets move further from the Sun their orbit period seems to increase exponentially. However, I’m not really sure if they are circular or elliptical or in-between, so wouldn’t like to guess. We can look at the data to confirm this suspicion.

dplyr::glimpse(planets)
## Observations: 10
## Variables: 5
## $ planet  <fctr> Mercury, Venus, Earth, Mars, Ceres, Jupiter, Saturn, ...
## $ dist    <dbl> 57.9, 108.2, 149.6, 228.0, 413.8, 778.3, 1427.0, 2869....
## $ period  <dbl> 87.98, 224.70, 365.26, 686.98, 1680.50, 4332.00, 10761...
## $ dist_au <dbl> 0.3870321, 0.7232620, 1.0000000, 1.5240642, 2.7660428,...
## $ years   <dbl> 0.2408695, 0.6151782, 1.0000000, 1.8807972, 4.6008323,...

Or plot it.

#devtools::install_github("UKGov-Data-Science/govstyle")
library(govstyle)
 
ggplot(planets, aes(years, dist_au)) +
  geom_point(color = "grey") + 
  geom_text(aes(label = planet, color = "yellow")) +
  xlab("Orbital period (years)") +
  ylab("Distance from the Sun (AU)") +
  theme_gov()

plot of chunk 2016-11-14_fancy_scatter

Regression Run

This tutorial shows how to use the symbolicRegression function to solve symbolic modelling and regression tasks with minimal configuration work. The theme of this tutorial is the discovery of a mathematical formula describing the behaviour of a physical system based on measurement data, i.e. symbolic regression.

Symbolic regression

Symbolic regression is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity. No particular model is provided as a starting point to the algorithm. Instead, initial expressions are formed by randomly combining mathematical building blocks such as mathematical operators, analytic functions, constants, and state variables. New equations are then formed by recombining previous equations, using genetic programming.

We are ready to start a symbolic regression run. We choose a time budget of one minute:

set.seed(255)
 
model_set1 <- rgp::symbolicRegression(
  dist_au ~ years,
  data = planets,
  stopCondition = makeTimeStopCondition(1 * 60)
)

Result Analysis

Selection and plotting of the model with best fitness can be performed as follows:

best_model1 <- model_set1$population[[which.min(model_set1$fitnessValues)]]
 
hyp_orbit_years <- 1:250
 
plot(
  x = hyp_orbit_years, 
  y = best_model1(hyp_orbit_years),  #  we predict distance on 250 hypothetical planets
  type = "l", lty = 1,
  xlab = "Orbital period (years)",
  ylab = "Distance from the Sun (AU)",
  ylim = c(0, 40), xlim = c(0, 250)
)
 
points(x = planets$years, y = planets$dist_au, col = "red")

plot of chunk 2016-11-14_m1_scatter

Hmm, not too bad, let’s give it a bit more time! Most of the data we’re training on is near the lower end of the scale thus we might expect our best fit line to wiggle towards those points given more time perhaps at the expense of the accuracy of distance from the Sun prediction at the regions of space with lower celestial object density (the Gas Giants hoover everything up?).

model_set2 <- rgp::symbolicRegression(
  dist_au ~ years,
  data = planets,
  stopCondition = makeTimeStopCondition(3 * 60)
)
 
best_model2 <- model_set2$population[[which.min(model_set2$fitnessValues)]]
 
plot(
  x = hyp_orbit_years, 
  y = best_model2(hyp_orbit_years),
  xlab = "Orbital period (years)",
  ylab = "Distance from the Sun (AU)",
  type = "l", lty = 1,
  ylim = c(0, 40), xlim = c(0, 250)
)
 
points(x = planets$years, y = planets$dist_au, col = "red")

plot of chunk 2016-11-14_m2_scatter

When sat twiddling your thumbs waiting for this to compute, consider Kepler who said, regarding the lengthy arguments within his epic tome sparking the Enlightenment:

If thou art bored with this wearisome method of calculation, take pity on me, who had to go through with at least seventy repetitions of it, at a very great loss of time.

Not much better? Is it the lack of data or do we need to tune our parameters?

Sequential Parameter Optimization for Genetic Programming

Finding good algorithm parameter settings for Genetic Programming applications is a complex task. Sequential parameter optimization (SPO) provides a framework for applying modern statistical methods to solve this task. Fortunately a package exists to help us! We provide the same starting conditions for our original experiment to allow comparison.

library(SPOT)
 
set.seed(255)
result1 <- symbolicRegression(
  dist_au ~ years,
  data = planets,
  stopCondition = makeTimeStopCondition(1 * 60)
)
 
bestFitness <- min(sapply(result1$population, result1$fitnessFunction))
fit_fun <- result1$population[[which.min(result1$fitnessValues)]]
 
plot(
  x = hyp_orbit_years, 
  y = fit_fun(hyp_orbit_years),
  type = "l", lty = 1,
  xlab = "Orbital period (years)",
  ylab = "Distance from the Sun (AU)",
  ylim = c(0, 40), xlim = c(0, 250)
)
 
points(x = planets$years, y = planets$dist_au, col = "red")

plot of chunk 2016-11-14_m3_scatter

We don’t even have to specify anything, we just load the package and it does it for us. I ran all this code for comparison several times by adjusting the set.seed at the start of this post. You’ll notice that by adjusting this your get different solutions or best fitted models. This can be considered analagous to a phenomenon in Evolutionary Biology.

The great evolutionist Stephen Jay Gould often employed the metaphor “replaying life’s tape” to emphasize the preeminent role of contingency in the evolutionary process. In Gould’s view, the outcome of this Gedanken experiment would have been dramatically different from the actually observed course of events because evolution is essentially a stochastic phenomenon whereby trajectories that start infinitely close to each other soon diverge because the divergence is exponential. The actual evolutionary trajectory is fundamentally unpredictable because the selection of the fittest could occur along a great number of forking paths. Here a different seed provides a different starting point.

But does it provide anymore information? Do we get an understanding of the relationship that Kepler was after?

In this simple example we see that we have produced identical solutions given the same starting position even with the SPOT package loaded which automatically tunes our parameters for us. Perhaps with a different example we would have seen greater impact.

best_model1
## function (years) 
## sqrt(years)
## <environment: 0x1384bee8>
# using SPOT
fit_fun
## function (years) 
## sqrt(years)
## <environment: 0x1384bee8>

We have a function that takes five times the square root of the input. So the distance from the Sun is sort-of proportional to five times the square root of the orbit in years of the celestial body of interest. We can compare this to the modern solution and other historical approximations of the truth here. This demonstrates how GP could be useful for feature selection despite it being quite slow, as it can help non experts in determining sensible relationships or transformations (other options exist e.g. Box-Cox transformation). This functionality is provided in the caret package. caret::gafs conducts a supervised binary search of the predictor space using a genetic algorithm.

Altogether a good reminder that:

nanos gigantum humeris insidentes

The “third law”

Kepler discovered his “third law” a decade after the publication of the Astronomia nova as a result of his investigations in the 1619 Harmonices Mundi. He found that the ratio of the cube of the length of the semi-major axis of each planet’s orbit, to the square of time of its orbital period, is the same for all planets.

planets <- planets %>%
  mutate(
    third_law = (dist_au^3) / (years^2)
  )
 
round(planets$third_law, digits = 3)
##  [1] 0.999 1.000 1.000 1.001 1.000 1.001 1.000 0.999 1.001 0.994

We rearrange the formula and create a function using Kepler’s knowledge.

kepler <- function(years) {
  (years)^(2/3)
}
 
plot(
  x = hyp_orbit_years, 
  y = kepler(hyp_orbit_years),
  type = "l", lty = 1,
  xlab = "Orbital period (years)",
  ylab = "Distance from the Sun (AU)",
  ylim = c(0, 40), xlim = c(0, 250)
)
 
points(x = planets$years, y = planets$dist_au, col = "red")

plot of chunk 2016-11-14_kepler_scatter

This is an improvement, demonstrating the value of expert domain knowledge provided by careful observation and experimentation to inform model building. Kepler built upon Copernicus’ heliocentric theory by explaining how the planets’ speeds varied using elliptical orbits. A consequence of this was the third law.

The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit. This captures the relationship between the distance of planets from the Sun, and their orbital periods.

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] SPOT_1.1.0      govstyle_0.1.0  rgp_0.4-1       ggthemes_3.2.0 
##  [5] readr_1.0.0     tidyverse_1.0.0 tidyr_0.6.0     ggplot2_2.1.0  
##  [9] dplyr_0.5.0     magrittr_1.5    tibble_1.2      purrr_0.2.2    
## [13] rmd2md_0.1.2   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7         emoa_0.5-0          formatR_1.4        
##  [4] git2r_0.15.0        plyr_1.8.4          class_7.3-14       
##  [7] tools_3.3.2         rpart_4.1-10        digest_0.6.10      
## [10] evaluate_0.9        memoise_1.0.0       gtable_0.2.0       
## [13] DBI_0.5-1           curl_2.2            withr_1.0.2        
## [16] stringr_1.1.0       httr_1.2.1          knitr_1.14         
## [19] devtools_1.12.0     grid_3.3.2          R6_2.2.0           
## [22] rsm_2.7-4           mco_1.0-15.1        scales_0.4.0       
## [25] codetools_0.2-15    MASS_7.3-45         randomForest_4.6-12
## [28] assertthat_0.1      checkpoint_0.3.16   colorspace_1.2-7   
## [31] labeling_0.3        stringi_1.1.2       lazyeval_0.2.0     
## [34] munsell_0.4.3       AlgDesign_1.1-7.3