A supervised learning strategy for detecting rare classes using data-centric ensembling

Overview
A little bit about ensembling
Defining the strategy
Testing the strategy
Pseudocode for training and testing
Pseudocode for cross validation
Pseudocode for production

Overview

This post demonstrates a strategy for rare class learning whereby out-of-sample predictions are made using data-centric ensembling. This strategy is described by Charu Aggarwal in section 7.2 of his book Outlier Analysis. This is an excellent book that presents a ton of interesting material on statistics and modeling in a very readable way.

A little bit about ensembling

The basic idea behing ensembling is to make predictions on out-of-sample data by combining the predictions from many models. There are two general approaches to ensembling: model-centric and data-centric. Model-centric ensembling combines predictions from several distinct algorithms (e.g. random forest, logistic regression, etc.). Data-centric ensembling combines predictions from the same algorithm trained on several distinct subsets of the training data. Of course model-centric and data-centric ensembling can be combined through the use of several algorithms trained on several subsets of the training data. In general, ensembling is a robust modeling strategy that reduces prediction variance and improves accuracy. One drawback to ensembling is that training, testing, and deploying an ensemble of models can be complex and time consuming. Another drawback is reduced interpretability since each model in the ensemble may be interpreted differently.

Defining the strategy

The strategy shown in this post is an example of data-centric ensembling applied to binary classification where the classes are imbalanced. The goal is to make accurate out-of-sample predictions by combining predictions from several rounds of training (e.g. 25 rounds). Prior to training, a dedicated out-of-sample testing set is drawn from the available data. The remainder is used for training. In each round of training, a balanced training subsample (subset of the training data) is created by downsampling the negative class to match the size of entire positive class. Predictions are made on the dedicated out-of-sample testing set. Once the training rounds are complete, the predictions from each round are combined by averaging the predicted probabilities for each case in the dedicated out-of-sample testing set.

This strategy is very efficient due to the use of downsampling, leading to training data sets that are very manageable in size. As a result, this strategy can be repeated many times without being too computationally expensive.

Testing the strategy

It is useful to think about the training strategy as a single data point. One should not draw any grand conclusions about a single data point. Instead, multiple data points are sought to gain a more complete understanding of the data. Likewise, any training strategy should be repeated several times to gain a more complete picture of how the final implementation is expected to work. This is why techniques such as repeated K-fold cross validation exist (to repeat the basic K-fold CV strategy).

For this reason, we do not want to draw conclusions about a single model, nor should we draw conclusions from a single implementation of the ensemble strategy. Instead, the ensemble strategy in this post is repeated several times in experimental fashion. Another experiment is run that tests a simple, baseline strategy which ignores the ensemble component. The outcomes across all repitions for each experiment are then visualized to better understand how the ensemble strategy would be expected to work in a live setting.

Pseudocode for training and testing

READ data

SAMPLE stratified training and testing data

SET number of ensemble iterations

FOR each ensemble iteration 
  SAMPLE training data via downsampling
  TRAIN model
  PREDICT positive class probabilities for testing data
ENDFOR

COMPUTE average predicted positive class probabilities for testing data

SET array of candidate probability cutoffs

FOR each candidate probability cutoff
  PREDICT class labels for testing data using probability cutoff
  COMPUTE performance metrics
ENDFOR

COMPUTE optimal probability cutoff which maximizes performance metric

Pseudocode for cross validation

READ data

SAMPLE stratified training and testing data

SET number of ensemble iterations

FOR each ensemble iteration 
  SAMPLE training data via downsampling
  TRAIN model
  PREDICT positive class probabilities for testing data
ENDFOR

COMPUTE average predicted positive class probabilities for testing data

SET array of candidate probability cutoffs

FOR each candidate probability cutoff
  COMPUTE predicted positive class labels for testing data
  COMPUTE performance metric given predicted and known positive class labels
ENDFOR

COMPUTE performance metric associated with optimal probability cutoff

Pseudocode for production

READ training data

SET number of ensemble iterations

FOR each ensemble iteration 
  SAMPLE training data via downsampling
  TRAIN model
ENDFOR

DEPLOY ensemble object

READ inference data

PREDICT positive class probabilities for inference data

SET optimal probability cutoff

COMPUTE predicted class labels using optimal probability cutoff