Keras is a high-level neural networks API developed with a focus on enabling fast experimentation (it has a TensorFlow backend). Being able to go from idea to result with the least possible delay is key to doing good research. This blog post celebrates the release of the new The Deep Learning with R book by François Chollet (the creator of Keras) which provides a more comprehensive introduction to both Keras and the concepts and practice of deep learning. The first few chapters are free and are definitely worth checking out.
For this blog post we assume the reader is familiar with neural networks and how they work.
Install the package and load it, then install all the dependencies required. We suggest you visit the links above and go through the MNIST example which we reproduce here (we looked at this dataset before in our t-SNE post). I’m attempting to do this from memory as a way to practise my learning from the book (you can try using just the cheatsheet). Teaching others (through blogs or presentations) is a great way to consolidate your learning of new techniques which will be a theme of your auto-didactic-life-long-learningness critical to your success as a data scientist.
My favourite bits to paraphrase from the Deep Learning in R book, is how a neural network is like a multi-stage information-distillation pipeline. With every layer we purify the message from the input, getting more and more dissimilar to the original input but more and more informative to completing a task such as predicting what animal is represented by a bunch of pixels (cat or not). Thus to proceed, we need labelled input data and a loss function (a way to score how well our network is doing at predicting by comparing predictions against “truth” labels).
Keras is like LEGO, it’s at a high enough level so that we can stick our layers together and experiment readily. It also facilitates scalability, as neural networks can take considerable computing effort (given the size of the data), we need to be able to move our computation readily to a GPU or the cloud (AWS can be handy but to reduce some of the GUI faff and complicated setup, check out this databox repo for creating some handy default instances in your terminal).
In the above code chunk we loaded the data with its respective labels (i.e. what digit is represented by the handwritten pixel representation).
Tensors and our digit data
As Keras is a TensorFlow frontend, unsurprisingly the data is stored as tensors (multidimensional arrays). You probably are familiar with a one dimensional vector and a two dimensional matrix, thus we can extend to N dimensions using tensor parlance (in tensors people normally say axes rather than dimensions). For example, if you pack matrices (2D tensor) into an array, you create a 3D tensor, or a cube of numbers (interestingly and inconsistently, we just said that axes is preferred but note how “D” is suffixed here!).
Our training data is a 3D tensor of integers - I thought the digit pixel data would be 2D? Well it is, more precisely, it’s an array of 60,000 matrices of 28 × 28 integers. It’s common when working with tensors that the first axis is the sample number, here samples are images of digits. This can be useful in breaking the data up into “batches”, as it would be hard to process it all at once, and we can take advantage of parallelising the work to speed up computation time. Each sample has an associated matrix that is a gray-scale image, with integers between 0 and 255.
This looks like a five. We can check my mental model against reality by comparing it to the “truth” label. We are correct, I wonder what our accuracy for all 70,000 data would be?
Gray-scale might be unnecessary detail, we could simplify the problem by setting any non-zero pixel to 1 (thus each pixel is either off or on). We should also convert the labels of the digits to a binary class matrix, a necessary step.
Using Keras we can add layers to our network. From our basic knowledge of neural networks we know we need an input layer (the pixels), a hidden layer to distill relevant information from these pixels of the input layer (learning is possible by updating our weights using back propagation based on our loss function and optimiser) which will feed into our final output layer which should have as many nodes as there are classes (ten digits; 0, 1, 2 … 9). See the comments in the code chunk for extra detail.
The dense layers are fully connected, there is a direct link between every pair of nodes (for available activation functions see here). We need to train a lot of weights (parameters)!
Prior to fitting the model to the training data we must configure it by specifying what optimiser and loss function to use.
We are now ready to fit the model to the training data, for a fixed number of iterations on the dataset, in place. The batch_size specifies the number of samples per gradient update. You can use the view_metrics argument to watch the training in realtime.
Wow! You can see how quickly the accuracy is improved and how fast the process is. On my machine it took about 10 seconds to run with an accuracy of just under 99% on the training data. However, inevitably our model has overfit, let’s assess how it performs on data it hasn’t seen before.
Almost 98%, not bad for the “Hello World!” of machine learning vision tasks. It was all put together with a few lines of code.
As usual in R we can make use of predict functionality that could ultimately lead to our model being deployed (albeit the Keras version of it). We can also save it as a JSON or other machine readable format for later use.
Hospital readmissions prediction
Thus far we have simply copy and pasted an example. Let’s use a dataset we are interested in and attempt to use Keras for prediction of a classification problem based on hospital data from the UCI machine learning repo (due to its size we have not put this in the data folder for machine gurning repo). We use data that is not pre-processed so we can reflect on this important step.
The dataset represents ten years (1999-2008) of clinical care at 130 U.S. hospitals. Hospital readmission for diabetic patients is a major concern in the United States. Predicting those patients that are readmitted within 30 days would be useful for efficiency savings, helping doctors spot patients at risk of being readmitted.
Searching for this dataset on Rpubs reveals someone who has already tidied it somewhat, so we built on and adapted their work to our purpose. We also inspected the file by opening it and checking for any dodgy codings of “NA”, we found “?” was used and some variables had mostly missing data.
Race, weight and a few other variables are responsible for most of the missing data, as shown above (see here for help interpreting this figure). We drop them from our dataset.
For simplicity, we drop encounter_id and patient_nbr and assume that every row represents a unique admission to the hospital (not true of course, we could use a recurrent neural network to learn from historical experience of the patient but we simplify things for now).
Look at the variables. Age is not useful in its current format, let’s convert it into something infomrative Some of the variables / features are of the wrong type. We do some basic tidying below, with the comments explaining what’s happening. We could also benefit from converting to dummy variables for modelling purposes. The caret package can help with this.
This is problematic as we have falling foul of the dummy variable trap; we have perfect collinearity between some variables, for example genderFemale and genderMale, if we know one we know the other (see here for more details). The fullRank parameter is worth protects us against this. Let’s turn on fullRank and try our data frame again.
This reduces our variables or features from 95 to 69 and importantly avoids perfect collinearity in the features. In essence perfect collinearity is bad as you are increasing the number of parameters in the model without accrueing any extra information. It also invalidates some of the assumptions for linear regression and least squares estimation.
Almost there! Just need to scale some of our variables between zero and one so they don’t have excessive effect on our model (bigger numbers would have stronger effect on predicting the response variable).
That was a bit long winded but we’ve scaled all our variables and removed some of the uninformative and unhelpful variables (some of the diag variables could be useful, it would be nice if we could link this dataset or fill in the NAs somehow, anyhow we proceed with what we have).
The Keras Workflow
As in the MNIST exampler we can follow a workflow:
Define your training data: input tensors and target tensors.
Define a network of layers (or model) that maps your inputs to your targets.
Configure the learning process by choosing a loss function, an optimizer, and some metrics to monitor.
Iterate on your training data by calling the fit() method of your model.
Defining the training data
Prior to this we inspect the data and notice it is imbalanced (we can revisit this problem later, we continue as is).
Now we need to get our training data and our labels into the correct format. Let’s remind ourselves of what the MNIST data structure was (we probably want tensors right!?).
The train_images were scaled pixel scores 28 by 28, or a 784 pixels in total for each sample. There were 60, 000 samples or rows. The train_labels were correspondingly 60, 000 samples. The caret package and the CreateDataPartition is a better approach but we use this here for simplicity.
Currently our data is not in the form of tensors. You can’t feed lists of integers into a neural network so we better do something about it. We also need to convert our labels from integer to numeric and transform our target attribute from a vector that contains values for each class value to a matrix with a boolean for each class value and whether or not a given instance has that class value or not.
Now we have reached the end of the exploration and preprocessing steps for the hospital data, although we caveat that with we flew through this process and did not apply any expert knowledge nor did we expect the data as closely as we should have, this blog is meant to be illustrative of using keras. We can now go on to building our neural network with keras!
Define the network architecture
Can we predict which patients will be readmitted to hospital within 30 days of discharge? The input data is vectors, and the labels are scalars (1s and 0s): this is the easiest setup we’ll ever encounter (compared to image, video or time series data). For this problem we probably want to use a fully-connected multi layer perceptron (not just because it sounds cool!), as it should perform well for this type of problem.
Note how the output layer creates 2 output values, one for each readmission class (not, or within 30 days). The first layer, which contains a somewhat arbritrary 10 hidden nodes (a dimension in the representation space) in the hidden layer, on the other hand, has an input_shape of 69. This is because our x_train has 69 columns.
Having more hidden units (a higher-dimensional representation space) allows your network to learn more-complex representations, but it slows things down and may result in overfitting to training data. To save on computation time, we use a smaller number to start with than we might otherwise for the purposes of this blog.
Having the relu activation function allows our model to learn non-linear transformations of the input data which would restrict our search through the hypothesis space for a decent representation of the data. Imagine you had a piece of blue paper mingled with a yellow piece, not using an activation function (the dense layer would consist of two linear operations) would be like trying to untangle the two pieces of paper with just your forehead.
The output layer has a sigmoid activation function so we can squash our output to resemble a probability, as this fits the use case better. This score between 0 and 1, indicates how likely the sample is to have the target “1”: how likely the patient is to be readmitted within 30 days. This also explains why we have just one output node, we just need one number, the probability.
We can inspect our network:
Choose loss function, optimiser and metric
Crossentropy has its origins in Information Theory and measures the distance between probability distributions or, in this case, between the ground-truth distribution and your predictions. It’s usually a good go-to for probabilities.
We can create a validation set to track our progress as our model trains, thus we can check it on data it has never seen. For simplicity we skip this step here but it is a thing that can be useful.
Training the model
We can now train the model for 5 epochs (5 iterations over all samples in the x_train and y_train tensors), in mini-batches of 512 samples.
We can plot the accuracy and the loss of the model through training. We load ggplot2 so it looks nicer.
You might be pretty hyped by the accuracy but hold your horses! Let’s reflect on how well the null model would perform. What if we just predicted every patient would not be readmitted within 30 days?
Thus our neural network has the same accuracy as this null model! Agh the curse of an imbalanced data set. This highlights an important point, it’s tempting to get caught up in the wizardry of deep learning, but remember it’s not magic, apply your standard thinking to the problem to avoid embaressing outcomes like this (we of course should use a confusion matrix as accuracy is not that useful a measure here but we use it to make the point).
Improving the model
We need to penalise the model somehow for simply predicting the modal class (not readmitted) otherwise we will get stuck at this local minima. There are a few approaches that spring to mind; rebalance the data with SMOTE, customise loss function (see alpha gov), change class weights (easier than customising loss function).
SMOTE
Considering this SMOTE blog, our data should be considered a rare event (as < 15%). The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases.
We go back to our dataframe version to make handling easier. The one where everything was scaled. We randomly split the data
The outcome balance between both splits is still around 11% therefore representative of the bigger set - we’re in good shape.
Let’s create extra positive observations using SMOTE. We set perc.over = 100 to double the quantity of positive cases, and set perc.under = 200 to keep half of what was created as negative cases.
And I SMOTE his ruin upon the mountain-side. - Gandalf
Fail. Unfortunately this is erroring and a quick debug attempt fails. Let’s try one of our alternatives so we don’t have to mess about with the data again. We leave this code in our blog as we acknowledge it as a strategy worth investigating further.
Custom class weights
Let’s penalise the model for guessing zero everytime, we can do that using the class_weight argument. In essence it makes it more expensive to guess zero each time. From the ?fit and documentation on class_weight it looks like we pass it a list. We force our algorithm to treat every instance of class 1 (readmission within 30 days) as 9 instances of class 0.
This returns a more realistic accuracy as our model is penalised from just guessing “No readmission” for every patient.
Predict labels of test data
As you can see the model is performing poorly but at least it isn’t just guessing “No readmission” for every patient. It seems our class_weight specification has swung things the other way, where by perhaps we are being too hard on them model missing patients that are readmitted the model is creating lots of false positives.
We can also generate the likelihood of a patient being readmitted within 30 days. For different samples we are more confident of a given outcome.
The model seems to be mostly unsure with likelihoods hovering around 50:50.
Model evaluation
This model is not particularly useful it needs some work.
Experimentation and model fine tuning
As you explore deep learning don’t be dissapointed that things weren’t as easy as with the MNIST data set. We need to work for our improvements in our model, we have a few options.
Change the number of hidden layers
Adjust the number of nodes in each hidden layer
Given our data set isn’t too small this might be an OK approach (overfitting might be less of a problem than with a small data problem).
Let’s add another layer (we should change one thing at a time so we can see the impact on accuracy). We create another pre-processing layer prior to our 10 unit hidden layer we had previously. It will take longer to run but we might uncover more complicated features within the data that we might of missed before (or we might overfit).
We perform better on the test data this time round with a 20% improvement compared to the simpler network. However we are still missing two thirds of the readmitted patients and creating more work for ourselves by flagging non-readmitted patients as a readmission risk four times more than the real risk ones.
Twisting the knobs
Not bad for a first go and there are other options which include:
Fine tune optimiser
Reduce the learning rate
Adjusting the batch site and number of epochs
Saving your model
You can save your model using the save_model_hdf5() function as well as a few other options.
Learning from others
This field moves fast and its important to consult the literature for good setups. For example a recent paper looked at this kind of problem, let’s use their method and model architecture (Jamie et al., 2017). Read the model training and evaluation section from this open PLOS ONE paper for a description of the design decisions taking and why.
We can combine it with our fine tuning we used earlier as this is a different data set and problem.
This compares to the aforementioned paper’s sensitivity score of 59% for neural network and 50% for traditional hospital risk modelling methods in the United States.
Multiclass modelling
Using the same data we could see how we perform in predicting “No readmission”, less than 30 days readmision or greater than 30 days readmission. This would require a different loss function (categorical crossentropy). We leave this to you, the reader to explore.
Conclusion
Keras provides a simple and powerful LEGO block approach to building a neural network. In this blog we demonstrated how to go from a reasonably clean patient data set to a model that predicted the readmission rates of those patients albeit poorly.
References
Strack, B., Deshazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c Measurement on Hospital Readmission Rates : Analysis of 70 , 000 Clinical Database Patient Records. Biomed Research International, 2014, 1–11. https://doi.org/10.1155/2014/781670
Billings, J., Blunt, I., Steventon, A., Georghiou, T., Lewis, G., & Bardsley, M. (2012). Development of a predictive model to identify inpatients at risk of re-admission within 30 days of discharge ( PARR-30 ). BMJ Open, 1–10. https://doi.org/10.1136/bmjopen-2012-001667
Jamei, M., Nisnevich, A., Wetchler, E., Sudat, S., & Liu, E. (2017). Predicting all-cause risk of 30-day hospital readmission using artificial neural networks. PLoS ONE, 12(7), 1–14. https://doi.org/10.1371/journal.pone.0181173