Overview
This post demonstrates a strategy for rare class learning whereby out-of-sample predictions are made using data-centric ensembling. This strategy is described by Charu Aggarwal in section 7.2 of his book Outlier Analysis. This is an excellent book that presents a ton of interesting material on statistics and modeling in a very readable way.
A little bit about ensembling
The basic idea behing ensembling is to make predictions on out-of-sample data by combining the predictions from many models. There are two general approaches to ensembling: model-centric and data-centric. Model-centric ensembling combines predictions from several distinct algorithms (e.g. random forest, logistic regression, etc.). Data-centric ensembling combines predictions from the same algorithm trained on several distinct subsets of the training data. Of course model-centric and data-centric ensembling can be combined through the use of several algorithms trained on several subsets of the training data. In general, ensembling is a robust modeling strategy that reduces prediction variance and improves accuracy. One drawback to ensembling is that training, testing, and deploying an ensemble of models can be complex and time consuming. Another drawback is reduced interpretability since each model in the ensemble may be interpreted differently.
Defining the strategy
The strategy shown in this post is an example of data-centric ensembling applied to binary classification where the classes are imbalanced. The goal is to make accurate out-of-sample predictions by combining predictions from several rounds of training (e.g. 25 rounds). Prior to training, a dedicated out-of-sample testing set is drawn from the available data. The remainder is used for training. In each round of training, a balanced training subsample (subset of the training data) is created by downsampling the negative class to match the size of entire positive class. Predictions are made on the dedicated out-of-sample testing set. Once the training rounds are complete, the predictions from each round are combined by averaging the predicted probabilities for each case in the dedicated out-of-sample testing set.
This strategy is very efficient due to the use of downsampling, leading to training data sets that are very manageable in size. As a result, this strategy can be repeated many times without being too computationally expensive.
Testing the strategy
It is useful to think about the training strategy as a single data point. One should not draw any grand conclusions about a single data point. Instead, multiple data points are sought to gain a more complete understanding of the data. Likewise, any training strategy should be repeated several times to gain a more complete picture of how the final implementation is expected to work. This is why techniques such as repeated K-fold cross validation exist (to repeat the basic K-fold CV strategy).
For this reason, we do not want to draw conclusions about a single model, nor should we draw conclusions from a single implementation of the ensemble strategy. Instead, the ensemble strategy in this post is repeated several times in experimental fashion. Another experiment is run that tests a simple, baseline strategy which ignores the ensemble component. The outcomes across all repitions for each experiment are then visualized to better understand how the ensemble strategy would be expected to work in a live setting.
Pseudocode for training and testing
READ data
SAMPLE stratified training and testing data
SET number of ensemble iterations
FOR each ensemble iteration
SAMPLE training data via downsampling
TRAIN model
PREDICT positive class probabilities for testing data
ENDFOR
COMPUTE average predicted positive class probabilities for testing data
SET array of candidate probability cutoffs
FOR each candidate probability cutoff
PREDICT class labels for testing data using probability cutoff
COMPUTE performance metrics
ENDFOR
COMPUTE optimal probability cutoff which maximizes performance metric
Pseudocode for cross validation
READ data
SAMPLE stratified training and testing data
SET number of ensemble iterations
FOR each ensemble iteration
SAMPLE training data via downsampling
TRAIN model
PREDICT positive class probabilities for testing data
ENDFOR
COMPUTE average predicted positive class probabilities for testing data
SET array of candidate probability cutoffs
FOR each candidate probability cutoff
COMPUTE predicted positive class labels for testing data
COMPUTE performance metric given predicted and known positive class labels
ENDFOR
COMPUTE performance metric associated with optimal probability cutoff
Pseudocode for production
READ training data
SET number of ensemble iterations
FOR each ensemble iteration
SAMPLE training data via downsampling
TRAIN model
ENDFOR
DEPLOY ensemble object
READ inference data
PREDICT positive class probabilities for inference data
SET optimal probability cutoff
COMPUTE predicted class labels using optimal probability cutoff