Predicting Vaccine Usage
Using Machine Learning
High Stakes Medicine
A once in a century pandemic began spreading March, 2020. There was no living person old enough to remember something comparable, and few people knowledgeable enough to even estimate just how much it would affect the world. As nations around the world recover, the production and distribution of a vaccine has become a hope and catalyst for getting past this. It is believed that vaccines are one of the single most life-saving innovations of modern technology.
Even with this possible hope, there is much fear and distrust of the institutions administering these vaccines. As fear swings between mortality and an “anti-vax” conspiracy, the human cost continues to increase while the long term societal effects are yet to be seen. With this backdrop in mind, an understanding of why people take or don’t take vaccines has never been more valuable. Predicting what percentage of people have taken the vaccine is especially important in order to understand the probability of a virus outbreak occurring again. Scientists around the world are being called to tackle this problem. Though not directly, we’ll be looking at way to tackle this issue using H1N1 and Seasonal Flu vaccine datasets.
Using people’s information to predict their actions is nothing new. It is commonly used in recommendation systems and marketing. In order to apply this to vaccine usage, I used a dataset from a competition available on www.drivendata.org/.
For predicting whether someone took a vaccine, the dataset offers some obvious information like whether the person was worried about the specific virus, how knowledgeable they were about the virus, and if their doctor recommended the virus. There is also other, more subtle information, like the patients education, race, and gender.
There are two target vectors for seasonal flu vaccine and H1N1 vaccine usage. I split the training labels into two datasets so I could build models for each of these target vectors separately. I also used the H1N1 predictions as a feature in my Seasonal Flu model. The ‘respondent_id’ feature caused some data leakage in training my model so I removed it. Besides that, the data was already pretty clean and required little wrangling.
After splitting the data, a baseline was made to compare the model against. The seasonal flu label set was balanced so a simple accuracy score was enough. The H1N1 label set was unbalanced so the AUC score was used instead:
Since there are two target vectors, two sets of prediction models were made for the Seasonal Flu and H1N1 vaccines. For a vaccine usage predictor, either recall or AUC score seemed optimal metrics. True positive rate will allow scientist to estimate how many people have actually taken the vaccine, and from that deduce the probability of an outbreak occurring again. Since the competition was based on the AUC score, that was the metric I settled on. With that metric in mind, I created three types of predictors for each model set. Random forest, gradient boosting, and xgboost classifier:
To test these models for over-fitting, a cross validation test was used. The random forest model had the highest mean scores but the xgboost classifier inconsistently had higher scores. Along with that, the AUC scores for each of the models was reported:
Here, the ROC curves for each of the models are plotted for easier visualization:
Lastly, a bar chart sorting the features by importance. With this chart, the less significant features can be removed from the dataset for increased performance:
In order to find the optimal parameters for these models, a randomized search was used. After each randomized search, the range of parameters was made smaller and smaller. This may not have given the exact optimal parameter choice, but they should still be among the best choices.
To my surprise, the random forest classifier performed the best. The xgboost classifier had the best scores on the training data but didn’t score quite as well on the test data. This is interpreted as the model overfitting the training data. The random forest model predictions were submitted to the Driven Data competition:
With an AUC score of .855, it seems predicting vaccine usage is pretty effective. It turns out not only medical data is useful in making these kinds of predictions, but also personal data like employment industry and income. Machine learning models add value to all sorts of studies and real-world uses. Though this model was built for H1N1 and seasonal flu dataset, there are definitely other data scientists applying this technology to fight the pandemic. By pooling our resources and using medicine in concert with technology, there is increased chances to save lives and bring the world back to normalcy.