Character matching provides a powerful way to make lookup tables. There are more concise functions available in packages like dplyr that achieve the same end but it is useful to understand how they are implemented with basic subsetting.
We start off by building an example dataframe.
If we look at the data, we notice the variable superoutputarea is a nine digit code that doesn’t tell a human much.
We are interested in how the area relates to the socio-economic classification of typical people who live in that area or a measure of deprivation of the area.
We must convert this into the more informative proxy which can then be used in our machine learning tools later.
What we need is a list which contains the necessary translation for superoutputarea. We define that here as lookup. A 7-point scale is used for deprivation with Sweet Valley High in a wealthy area and Park View in a deprived area.
To convert we simply:
Thus we can use this to create a new variable called depriv.
Great, now we can use this dataframe for machine learning.
What if we have a large dataframe, are there more concise and faster ready made functions to use?
Probably but we won’t elucidate that here, we just assume dplyr is fast as it passess to C++.
Plus I like dplyr with its nice chaining.
Sometimes we might have a more complicated lookup table which has multiple columns of infomation.
Suppose we take our vector of attainment grades and round them to the nearest whole number.
We want to duplicate the info table so that we have a row for each values in grade. We can do this in two ways either using match() and integer subsetting, or rownames() and character subsetting:
We have matched the grade of the student with its appropriate descriptor and pass / fail status using a more complicated lookup table.
Conclusion
A named character vector can act as a simple lookup table. We could even read this in from a csv file. Lookup is simple in R.