This post is based on a paper by Cortez & Morais (2007). Forest fires are a major environmental issue, creating economical and ecological damage while endangering human lives. Fast detection is a key element for controlling such phenomenon. To achieve this, one alternative is to use automatic tools based on local sensors, such as microclimate and weather data provided by meteorological stations.
All this data holds valuable information, such as trends and patterns, which can be used to improve decision making. Yet, human experts are limited and may overlook important details. Moreover, classical statistical analysis breaks down when such vast and/or complex data is present. Hence, the alternative is to use automated machine learning tools to analyze the raw data and extract high-level information for the decision-maker.
This is a very difficult regression task. We got the forest fire data from the UCI machine learning repository. Specifically we want to predict the burned area or size of the forest fires in the northeast region of Portugal. We demonstrate the proposed solution of Cortez which includes only four weather variables (i.e. rain, wind, temperature and humidity) in conjunctionwith a support vector machines (SVM) and it is capable of predicting the burned area of small fires, which constitute the majority of the fire occurrences.
A SVM uses a nonlinear mapping to transform the original training data into a higher dimension. Within this new dimension, it searches for the linear optimal separating hyperplane.
Objective
Perform a support vector regression.
Assess the accuracy of the model.
Demonstrate the work flow for fitting a SVM for classification in R.
The data
Upon glimpse()ing the data, we notice the area of the burn has a lot of zeroes. We investigate further with a histogram. We may want to log(area+1) transform the area due to the heavy skew and many zeroes (fires that burnt less than a hectare). The variables are fully explained in the original paper.
We transform area into the new response variable y, this would be useful if we wanted to use the SVM for regression.
We start at an advantage, as we know what model structure for the SVM was most effective for prediction based on the findings of the paper. Thus we can limit our data preparation to a few variables.The proposed solution, which is based in a SVM and requires only four direct
weather inputs (i.e. temperature, rain, relative humidity and wind speed) is capable of predicting small fires, which constitute the majority of the fire occurrences.
We also need to normalise the continuous variables between zero and one to control for different ranges. However, the function we use does this for us! We show what it would look like if we wanted to set this up manually but we don’t evaluate it.
Classification
SVM is better suited to a classification problem. Let’s pretend we’re interested in the weather conditions that give rise to small fires (arbitary set to < 5 hectares), compared to larger fires. Can we classify the type of fire we might expect to see if we send a fireman out with remote meterological data? This may help them bring the right gear.
These fires are split unevenly.
Splitting the data
As usual, we need a training and testing data set to assess how well the model predicts data it hasn’t seen before.
Method
We use the kernlab package and the ksvm() function therein to fit an SVM using a non-linear kernel. We can use the argument kernel = "polydot" to set it to polynomial or "rbfdot" for a radial basis and "tanhdot" for the complicated sounding hyperbolic tangentsigmoid. Note the hugh amount of parameter customisation that is possible at this stage. For simplicity we use the default settings which will be far from optimal.
Using the simple defaults, the radial basis non-linear mapping for the SVM appears equivalent to the polynomial, based on the lower training error; with the polynomial slightly better. We should evaulate the model performance using the predict() function. In order to examine how well our classifier performed we need to compare our predicted size of the fire with the actual size in the test dataset.
Test with training data
Conclusion
A basic introduciton to SVM in R showing the workflow. Bear in mind we have some way to go in optimising and validating this model! Changing parameters is likely to improve our 70% accuracy achieved with the default settings.
References
Cortez, P., & Morais, A. (2007). A Data Mining Approach to Predict Forest Fires using Meteorological Data. New Trends in Artificial Intelligence, 512-523. Retrieved from http://www.dsi.uminho.pt/~pcortez/fires.pdf
Crawley (2004). Statistics an introduction using R.
James et al., (2014). An introduction to statistical learning with applications in R.