Overview
Introduction
Classification problems occur often, perhaps even more so than regression problems. Consider the Cortez student maths attainment data discussed in previous posts. The response variable, final grade of the year (range 0-20), G3
can be classified into a binary pass or fail variable called final
, based on a threshold mark. We used a decision tree approach to model this data before which provided 95% accuracy and had the benefit of interpretability. We will now model this using logistic regression so we can attach probabilities to our student pass or fail predictions.
Make the final grade binary (pass and fail)
G3
is pretty normally distributed, despite the dodgy tail. To simplify matters converted G3
marks below 10 as a fail, above or equal to 10 as a pass. Often a school is judged by whether students meet a critcal boundary, in the UK it is a C grade at GCSE for example. Rather than modelling this response Y directly, logistic regression models the probability that Y belongs to a particular category.
From our learnings of the decision tree we can include the variables that were shown to be important predictors in this multiple logistic regression.
Objective
- Using the training data estimate the regression coefficients using maximum likelihood.
- Use these coefficients to predict the test data and compare with reality.
- Evaluate the binary classifier with receiver operating characteristic curve (ROC).
- Evaluate the logistic regression performance with the resampling method cross-validation
Training and test datasets.
We need to split the data so we can build the model and then test it, to see if it generalises well. The data arrived in a random order.
Now we need to train the model using the data. From our decision tree we know that the prior attainment data variables G1
and G2
are important as are the Fjob
and reason
variables. We fit a logistic regression model in order to predict final
using the variables mentioned in the previous sentence.
The model does appear to suffer from overdispersion. The p-values associated with reason
are all non-significant. Following Crawley’s recommendation we attempt model simplification by removing this term from the model after changing the model family argument to family = quasibinomial
.
We use the more conservative “F-test” to compare models due to the quasibinomial error distribution, after Crawley.
No difference in explanatory power between the models. There is no evidence that reason
is associated with a students pass or fail in their end of year maths exam. We continue model simplification after using summary()
(not shown).
We don’t need the earlier G1
exam result as we have G2
in the model already. What happens if we remove Fjob
?
We lose explanatory power, we need to keep Fjob
in the model. This gives us our minimal adequate model. Fjob
is a useful predictor but perhaps we could reduce the number of levels by recoding the variable as only some of the jobs seem useful as predictors.
Contrasts
For a better understanding of how R dealt with the categorical variables, we can use the contrasts()
function. This function will show us how the variables have been dummyfied by R and how to interpret them in a model. Note how the default in R is to use alphabetical order.
Model interpretation
The smallest p-value here is assocaited with G2
. The positive coefficient for this predictor suggests that an increase in G2
is associated increase in the probability of final = pass
. To be precise a one-unit increase in G2
is associated with an increase in the log odds of pass
by 2.0357671.
The first command predicts the probability of the test students’ characteristics resulting in a pass
based on the glm()
built using the training data. The second and third command creates a vector of 45 fails
with those probabilities greater than 50% being converted into pass
. The predicted passes and failures are compared with the real ones in a table with a test error of 4.444%.
Model performance
As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.
Conclusion
The 0.95 accuracy on the test set is quite a good result and an AUC of 0.9884454. However, keep in mind that this result is somewhat dependent on the manual split of the data that I made earlier, therefore if you wish for a more precise score, you would be better off running some kind of cross validation such as k-fold cross validation. The logistic regression also provides coefficients allowing a quantitative understanding of the association between a variable and the odss of success which can be useful.
Leave-one-out cross-validation for Generalized Linear Models
As mentioned above let’s conduct a cross validation using the cv.glm()
function from the boot package.This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. We produce our model glm.fit
based on our earlier learnings. We follow guidance of the Chapter 5.3.2 cross-validation lab session in James et al., 2014.
The cv.glm()
function produces a list with several components. The two numbers in the delta
vector contain the cross-validation results. Our cross-validation estimate for the test error is approximately 0.056.
k-fold cross-validation
The cv.glm()
function can also be used to implement k-fold cross-validation. Below we use k = 10, a common choice for k, on our data.
On this data set, using this model, the two estimates are very close for K = 1 and K = 10. The error estimates are small, suggesting the model may perform OK if applied to predict future student final
pass or fail.
References
- Cortez and Silva (2008). Using data mining to predict secondary school performance.
- Crawley (2004). Statistics an introduction using R.
- James et al., (2014). An introduction to statistical learning with applications in R. Springer.
- http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
- https://archive.ics.uci.edu/ml/datasets/Student+Performance