## Overview

This post demonstrates a strategy for predicting out-of-sample cases using a binary classification ensemble when the training data contains imbalanced classes. This strategy is described by Charu Aggarwal in section 7.2 of his book Outlier Analysis.

This ensemble strategy aims to reduce the overall variance of the classification model by fitting several models to different subsamples of the training data, generating predictions for out-of-sample cases using the parameter estimates from each model, then combining the predictions for each out-of-sample case to arrive at a final prediction for each out-of-sample case. It is a useful strategy when the outcome being modeled is rare, e.g. unexpected health crisis.

## Packages

```
library(tidyverse)
library(rsample)
library(ranger)
library(imbalance)
library(yardstick)
options(yardstick.event_first = F)
```

## Data

```
sample_df <- imbalance::banana %>%
as_tibble()
```

Class distribution.

```
sample_df %>%
count(Class) %>%
mutate(pct_n = n/sum(n))
```

```
## # A tibble: 2 x 3
## Class n pct_n
## <fct> <int> <dbl>
## 1 negative 2376 0.9
## 2 positive 264 0.1
```

## Training and out-of-sample splits

```
split_pct <- 0.8
set.seed(9560)
split_idx <- rsample::initial_split(
data = sample_df,
prop = split_pct,
strata = "Class"
)
```

```
train <- rsample::training(split_idx)
out_of_sample <- rsample::assessment(split_idx)
```

## Generate subsamples from training data

```
# determine the number of rows to which the negative class is down-sampled
# by default, set to the number of positive cases
downsample_level <- train %>%
dplyr::filter(Class == "positive") %>%
nrow()
subsample_idx <- seq(1, 25, 1)
subsamples <- purrr::map(subsample_idx, function(idx) {
train %>%
dplyr::group_by(Class) %>%
sample_n(size = downsample_level) %>%
ungroup()
})
```

## Fit ensemble of models

```
models <- map(subsamples, function(idx) {
ranger_model <- ranger::ranger(
formula = as.factor(Class) ~ .,
data = idx,
probability = T
)
})
```

## Predicted labels via ensemble

```
probs <- map(models, function(idx) {
predict(idx, data = out_of_sample)$predictions[, "positive"]
})
avg_prob <- probs %>%
bind_cols() %>%
apply(., 1., mean)
pred_label <- ifelse(avg_prob >= 0.5, "positive", "negative")
```

## Results

```
ensemble_f1 <- f_meas_vec(out_of_sample$Class, factor(pred_label))
map_dbl(probs, function(prob) {
label <- ifelse(prob >= 0.5, "positive", "negative")
yardstick::f_meas_vec(out_of_sample$Class, factor(label))
}) %>%
enframe(name = "Subsample", value = "Subsample_F1") %>%
mutate(Ensemble_F1 = ensemble_f1) %>%
gather(Metric, Value, -Subsample) %>%
ggplot(aes(x = Subsample, y = Value)) +
geom_line(aes(group = Metric, color = Metric), lwd = 1) +
scale_color_manual(values = c("blue", "gray80")) +
labs(title = "F1 across 25 subsamples",
subtitle = "The ensemble reduces variation and improves accuracy.") +
theme_bw()
```