A strategy for rare class learning using data-centric ensembling

Danny Morris

2019/07/01

Overview

This post demonstrates a strategy for predicting out-of-sample cases using a binary classification ensemble when the training data contains imbalanced classes. This strategy is described by Charu Aggarwal in section 7.2 of his book Outlier Analysis.

This ensemble strategy aims to reduce the overall variance of the classification model by fitting several models to different subsamples of the training data, generating predictions for out-of-sample cases using the parameter estimates from each model, then combining the predictions for each out-of-sample case to arrive at a final prediction for each out-of-sample case. It is a useful strategy when the outcome being modeled is rare, e.g. unexpected health crisis.

Packages

library(tidyverse) 
library(rsample)   
library(ranger) 
library(imbalance)
library(yardstick)

options(yardstick.event_first = F)

Data

sample_df <- imbalance::banana %>%
  as_tibble()

Class distribution.

sample_df %>%
  count(Class) %>%
  mutate(pct_n = n/sum(n))
## # A tibble: 2 x 3
##   Class        n pct_n
##   <fct>    <int> <dbl>
## 1 negative  2376   0.9
## 2 positive   264   0.1

Training and out-of-sample splits

split_pct <- 0.8

set.seed(9560)

split_idx <- rsample::initial_split(
  data = sample_df, 
  prop = split_pct,
  strata = "Class"
)
train <- rsample::training(split_idx)

out_of_sample <- rsample::assessment(split_idx)

Generate subsamples from training data

# determine the number of rows to which the negative class is down-sampled
# by default, set to the number of positive cases
downsample_level <- train %>%
  dplyr::filter(Class == "positive") %>%
  nrow()

subsample_idx <- seq(1,  25, 1)

subsamples <- purrr::map(subsample_idx, function(idx) {
  train %>%
    dplyr::group_by(Class) %>%
    sample_n(size = downsample_level) %>%
    ungroup()
}) 

Fit ensemble of models

models <- map(subsamples, function(idx) {
  ranger_model <- ranger::ranger(
    formula = as.factor(Class) ~ .,
    data = idx,
    probability = T
  )
})

Predicted labels via ensemble

probs <- map(models, function(idx) {
  predict(idx, data = out_of_sample)$predictions[, "positive"]
}) 

avg_prob <- probs %>%
  bind_cols() %>%
  apply(., 1., mean)

pred_label <- ifelse(avg_prob >= 0.5, "positive", "negative")

Results

ensemble_f1 <- f_meas_vec(out_of_sample$Class, factor(pred_label))

map_dbl(probs, function(prob) {
  label <- ifelse(prob >= 0.5, "positive", "negative")
  yardstick::f_meas_vec(out_of_sample$Class, factor(label))
}) %>%
  enframe(name = "Subsample", value = "Subsample_F1") %>%
  mutate(Ensemble_F1 = ensemble_f1) %>%
  gather(Metric, Value, -Subsample) %>%
  ggplot(aes(x = Subsample, y = Value)) +
  geom_line(aes(group = Metric, color = Metric), lwd = 1) +
  scale_color_manual(values = c("blue", "gray80")) +
  labs(title = "F1 across 25 subsamples",
       subtitle = "The ensemble reduces variation and improves accuracy.") +
  theme_bw()