A Data-Centric Ensemble Approach for Supervised Classification with Imbalanced Classes

Danny Morris

2019/07/01

Overview

This article demonstrates an approach for supervised learning using an ensemble of classifiers applied to imbalanced data. At the core of the ensemble technique is data-centric subsampling. The basic steps include:

  1. Designate 80% of the original data for training and 20% for testing. These percentages can be modified as needed.

  2. Decide on S number of subsamples. Between 10 and 25 is recommended by Charu Aggarwal in his book Outlier Analysis.

  3. Using functional programming, extract S subsamples from the training data. Within each subsample, the entire positive class is retained and the negative class is down-sampled to match the size of the positive class.

  4. Fit S models, one to each subsample.

  5. Apply each of the S models to the testing dat to generate S sets of predictions.

  6. Combine the S sets of predictions into a single vector of predictions using majority vote.

R packages

library(tidyverse) 
library(aws.s3)
library(aws.signature)
library(rsample)   
library(ranger) 

Data

The data is located in my S3 bucket, however a CSV of the raw data can be found here.

aws.signature::use_credentials()
credit_card <- s3read_using(
  bucket = "abn-distro-data",
  object = "creditcard.csv",
  FUN = read_csv
) %>%
  select(-Time) %>%
  mutate(Class = as.character(Class))

A smaller sample is taken to reduce computational burden.

sample_majority <- credit_card %>%
  filter(Class == "0") %>%
  sample_n(30000)

sample_df <- bind_rows(
  sample_majority,
  credit_card %>% filter(Class == "1")
)

Class distribution.

table(sample_df$Class)
## 
##     0     1 
## 30000   492

Global Train/Test Split

split_pct <- 0.8

set.seed(9560)

split_idx <- rsample::initial_split(
  data = sample_df, 
  prop = split_pct,
  strata = "Class"
)
train <- rsample::training(split_idx)
test <- rsample::testing(split_idx)

Subsampling

Give specifications for the subsampling procedure.

# determine the number of rows to which the negative class is down-sampled
# by default, set to the number of positive cases
downsample_level <- train %>%
  dplyr::filter(Class == "1") %>%
  nrow()

# number of subsamples N
n_subsamples <- 25

# sequence of integers from 1:N
# used for iterations
subsample_idx <- seq(1, n_subsamples, 1)

cat("Subsamples index:", subsample_idx)
## Subsamples index: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Generate subsamples

subsamples <- purrr::map(subsample_idx, function(idx) {
  train %>%
    dplyr::group_by(Class) %>%
    sample_n(size = downsample_level) %>%
    ungroup()
}) 

Training

Fit individual model to each subsample

models <- map(subsamples, function(idx) {
  ranger_model <- ranger::ranger(
    formula = as.factor(Class) ~ .,
    data = idx,
  )
})

Testing

Apply individual models to predict testing set

model_votes <- map(models, function(idx) {
  predict(idx, data = test)$predictions
})

Combine predictions via majority vote

# a function to compute the mode of a discrete vector
majority_vote <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

final_votes <- bind_cols(model_votes) %>%
  apply(., 1, majority_vote)

Evaluate performance

caret::confusionMatrix(factor(final_votes, levels = c("0", "1")),
                       factor(test$Class, levels = c("0", "1")),
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5869    7
##          1  142   80
##                                           
##                Accuracy : 0.9756          
##                  95% CI : (0.9714, 0.9793)
##     No Information Rate : 0.9857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.5077          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.91954         
##             Specificity : 0.97638         
##          Pos Pred Value : 0.36036         
##          Neg Pred Value : 0.99881         
##              Prevalence : 0.01427         
##          Detection Rate : 0.01312         
##    Detection Prevalence : 0.03641         
##       Balanced Accuracy : 0.94796         
##                                           
##        'Positive' Class : 1               
## 

Why This Approach?

map_dbl(model_votes, function(model) {
  caret::sensitivity(model, 
                         factor(test$Class, levels = c("0", "1")), 
                         positive = "1")
}) %>%
  enframe(name = "Subsample", value = "Sensitivity") %>%
  ggplot(aes(x = Sensitivity)) +
  geom_density() +
  labs(title = "Distribution of Sensitivty Across 25 Subsamples",
       subtitle = "Sensitivity scores vary by subsample. The ensemble approach reduces variation.")