I’ve been doing a lot of programming in Python recently, and have taken my eye off the #RStats ball of late.
With a bit of time to play over the Easter weekend, I’ve been reading Hadley’s new R for Data Science book.
One thing I particularly like so far is the purrr package which he describes in the lists chapter.
I’ve always thought that the sapply,lapply, vapply (etc) commands are rather complicated.
The purrr package threatens to simplify this using the same left-to-right chaining framework that we have become used to in ggplot2, and more recently dplyr.
Something I find myself doing more and more is subsetting a dataframe by a factor, and applying the same or a similar model to each subset of the data.
There are some new ways to do this in purrr.
do()
In this post I’ll briefly explore some of the functions of purrr, and use them together with dplyr and broom (as much for my own memory as anything else).
In the past I have used dplyr::do() to apply a model like so.
This results in three models, one each for 4, 6, and 8 cylinders,
We can now use a second call to do(), dplyr::summarise() or dplyr::mutate to extract elements from these models: for example extract the coefficients…
We can also use mutate() to extract one or more elements
The broom package
If we want to get a tidier output, we can use the broom package, which provides three levels of aggregation.
glance gives a single line for each model, similar to the do() and summarise() calls above:
tidy() gives details of the model coefficicents:
augment() returns a row for each data point in the original data with relevant model outputs
One nice use case of augment() is for plotting fitted models against the data.
In this simple example, we could achieve the same just with geom_smooth(aes(group=cyl), method="lm"); however this would not be so easy with a more complicated model.
purrr
So what is new about purrr?
Well first off we can do similar things to do() using map():
And we can keep adding map() functions to get the output we want:
Note the three types of input to map(): a function, a formula (converted to an anonymous function), or a string (used to extract named components). 1
So to use a string this time, returning a double vector…
Creating training and test splits
A more complicated example that is a purrrfect use case is: creating splits in a dataset on which a model can be trained and then validated.
Here I shamelessly copy Hadley’s example1. Note that you will need the latest dev version of dplyr to run this correctly due to this issue (fixed in the next dplyr release > 0.4.3).
First define a cost function on which to evaluate the models (in this case the mean squared difference (but this could be anything).
And a function to generate $n$ random groups with a given probability
And wrap this up in a function to replicate it…
Note that this makes use of the new purrr::transpose() function which applies something like a matrix transpose to a list, and when coerced, returns a data_frame containing $n$ random splits of the data.
Finally use map() to:
Fit simple linear models to the data as before.
Make predictions based on those models on the test dataset.
Evaluate model performance using the cost function (msd).
This still results in a data frame, but with three new list columns. We need to subset out the columns of interest:
Rounding up
I’ve been playing with some things in this post that I am just getting to grips with, but look to be some really powerful additions to the hadleyverse, and the R landscape in general.
Keeping an eye on the development of purrr would be a good move I think.