In the past I have worked with a number of timeseries of sensor data that I collected using raspberry pis, arduinos, and esp8266 modules.
It’s not something I do regularly enough to remember the best way to do it, so I’m writing this post as a reminder to myself, and perhaps someone will benefit from my aide-memoire.
In previous posts I have combined data from two sensors I built, both based on raspberry pis (e.g. Measuring obsession).
The first sensor sampled internal and external temperature, internal humidity, and internal light levels at a frequency of once every three minutes.
Another sensor I built recorded my electricity usage every minute by essentially counting the pulses on my electricity meter.
The data are all in the machinegurning github repo, so I’ll access it here.
In the cleaned state that they are available in, the data consist of some 750,000 observations.
So what are the simple ways that we can visualise the data, first off?
Great, so ggplot is smart enough to detect that we need time on the x-axis, and it gives us an appropriate scale - good job Hadley!
We can also set the breaks we want…
And these can be times, not just dates - smart.
Date aggregation
OK, so far so good, all very simple.
The fun begins when we start to aggregate this data.
In this case I use tidyr::spread to move this data from long format to wide format.
Because we started by randomly sampling 100,000 values from a dataset of 750,000, and this dataset was in long format, we are likely to have a lot of NA values across the various values of the timeseries:
yday
timestamp
week
elec
ext_temp1
int_humidity
int_light
int_temp
2
2015-01-02 15:18:00
1
NA
NA
NA
2069.0
NA
2
2015-01-02 15:30:00
1
NA
7.187
NA
NA
16.8
2
2015-01-02 15:33:00
1
NA
NA
NA
3719.0
NA
2
2015-01-02 15:39:00
1
NA
NA
NA
5384.0
NA
2
2015-01-02 15:45:00
1
NA
7.000
NA
NA
NA
2
2015-01-02 16:00:00
1
NA
NA
NA
1230.5
NA
2
2015-01-02 16:12:00
1
NA
6.687
NA
NA
NA
2
2015-01-02 16:15:00
1
NA
NA
48.75
NA
NA
2
2015-01-02 16:18:00
1
NA
NA
49.05
NA
NA
2
2015-01-02 16:27:00
1
NA
NA
NA
NA
17.3
Just looking at these rows, we can see that there are often multiple observations per minute.
Two problems I often have are:
how to aggregate to the nearest unit of time, and
how to aggregate across a unit of time
This is the difference between aggregating to every five minutes of every day, and aggregating to every five minutes across all days.
The former is easy, and can be achieved with lubridate::ceiling_date and lubridate::floor_date.
Ceiling rounds up, whilst floor rounds down, and we can choose any time period of interest:
…you get the idea.
But if I wanted to plot the average temperature at five minute intervals for each month, I will not be able to do this:
This doesn’t give us what we want because there is still date information wrapped up within the timestamp, so we only get a timeseries of each value from each month.
To get what we want is a little more tricky, and there may well be a better way that I have not yet discovered, but this is what I have been doing so far.
First we need to extract the time from the timestamp without date information.
The downside here is that while format will return time as a character vector, so we will not be able to rely on ggplot2 to cleverly adjust axes.
To fix this, we can turn these times back into timestamps, but this time with all the same date.
Now we can get the plot we are after:
If anyone knows a better way of doing this, I would love to know, but this works for now.