Bag of Words Text Classification in R

Danny Morris

2019/02/17

In this article, I’ll demonstrate how to perform a type of text classification using a collection of Reuters news articles. I’ll touch on topics such as text extraction and cleansing, tokenization, feature vectorization, document-term matrices, vocabulary pruning, n-grams, feature hashing, TF-IDF, and training/testing penalized logistic regression and XGBoost.

The main text mining packages to demonstrate are tidytext for general text mining and processing and text2vec for sparse document-term matrices and feature vectorization. The penalized logistic regression classifer comes from the glmnet pacakage and the XGBoost implementation comes from the xgboost package.

Load R Packages

library(text2vec)     # NLP tools
library(tidytext)     # tidy text mining

library(glmnet)       # logistic regression training
library(xgboost)      # XGBoost implementation

library(tidyverse)    # general purpose data manipulation
library(textstem)     # word lemmatization
library(caret)        # model evaluation

Load Data

The collection of documents I am working with is a sample of documents that appeared on the Reuters newswire in 1987. These data are commonly used for demonstrating text mining applications.

The first step is to load the raw text documents into the R session. The readr::read_lines() function reads text line by line and converts each line of text to a row in a table. This is the general approach to converting text documents to a familiar tabular data structure.

reuters = list(
  reuters_train = readr::read_lines("reuters-train"),
  reuters_test = readr::read_lines("reuters-test")
) %>%
  map(., function(x) tibble::tibble(txt = x))

reuters
## $reuters_train
## # A tibble: 311,980 x 1
##    txt                                                                     
##    <chr>                                                                   
##  1 "<REUTERS TOPICS=\"YES\" LEWISSPLIT=\"TRAIN\" CGISPLIT=\"TRAINING-SET\"~
##  2 <DATE>26-FEB-1987 15:01:01.79</DATE>                                    
##  3 <TOPICS><D>cocoa</D></TOPICS>                                           
##  4 <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>             
##  5 <PEOPLE></PEOPLE>                                                       
##  6 <ORGS></ORGS>                                                           
##  7 <EXCHANGES></EXCHANGES>                                                 
##  8 <COMPANIES></COMPANIES>                                                 
##  9 "<UNKNOWN> "                                                            
## 10 &#5;&#5;&#5;C T                                                         
## # ... with 311,970 more rows
## 
## $reuters_test
## # A tibble: 105,392 x 1
##    txt                                                                     
##    <chr>                                                                   
##  1 "<REUTERS TOPICS=\"YES\" LEWISSPLIT=\"TEST\" CGISPLIT=\"TRAINING-SET\" ~
##  2 <DATE> 8-APR-1987 01:03:47.52</DATE>                                    
##  3 <TOPICS><D>trade</D></TOPICS>                                           
##  4 <PLACES><D>hong-kong</D><D>usa</D><D>japan</D><D>taiwan</D><D>malaysia<~
##  5 <PEOPLE></PEOPLE>                                                       
##  6 <ORGS></ORGS>                                                           
##  7 <EXCHANGES></EXCHANGES>                                                 
##  8 <COMPANIES></COMPANIES>                                                 
##  9 "<UNKNOWN> "                                                            
## 10 &#5;&#5;&#5;RM C                                                        
## # ... with 105,382 more rows

Extract Individual News Articles

Our data is currently in the form of one-row-per-line-of-text. For this particular collection of documents, each document begins with a “</REUTERS” tag. When performing tokenization using the tidytext package, we can specify this pattern to separate individual documents. Futhermore, we will parse each document in search of the body of the text (i.e. the actual article and not the metadata).

The function below will perform this step in addition to some others, including:

  1. Separate individual documents (i.e. document tokenization)
  2. Extract text between <body> tags containing the body of the news article
  3. Remove non-alphabetic characters (e.g. numbers, symbols)
get_document_text = function(x) {
  x %>% 
    # document tokens
    tidytext::unnest_tokens(output = article, 
                            input = txt, 
                            token = "regex", 
                            pattern = "</REUTERS>") %>%
    # id docs
    mutate(doc_id = row_number()) %>%
    # extract text between body tags
    filter(str_detect(article, "<body>")) %>%
    mutate(body_text = sub(".*<body> *(.*?) *</body>.*", "\\1", article)) %>%
    # remove non-alphabetic characters and extra whitespace
    mutate(body_text = str_replace_all(body_text, "[^[:alpha:]]", " ")) %>%
    mutate(body_text = trimws(body_text)) %>%
    select(-article) 
}

Apply get_document_text() to training and testing sets and return a data structure with one-row-per-document and the body of the article.

document_text = reuters %>%
  map(., get_document_text)

document_text
## $reuters_train
## # A tibble: 8,762 x 2
##    doc_id body_text                                                        
##     <int> <chr>                                                            
##  1      1 showers continued throughout the week in the bahia cocoa zone  a~
##  2      2 the u s  agriculture department reported the farmer owned reserv~
##  3      3 argentine grain board figures show crop registrations of grains ~
##  4      4 moody s investors service inc said it lowered the debt and prefe~
##  5      5 champion products inc said its board of directors approved a two~
##  6      6 computer terminal systems inc said it has completed the sale of ~
##  7      7 shr    cts vs      dlrs     net         vs               assets ~
##  8      8 ohio mattress co said its first quarter  ending february     pro~
##  9      9 oper shr loss two cts vs profit seven cts     oper shr profit   ~
## 10     10 shr one dlr vs    cts     net      mln vs      mln     revs     ~
## # ... with 8,752 more rows
## 
## $reuters_test
## # A tibble: 3,009 x 2
##    doc_id body_text                                                        
##     <int> <chr>                                                            
##  1      1 mounting trade friction between the u s  and japan has raised fe~
##  2      2 a survey of    provinces and seven cities showed vermin consume ~
##  3      3 the ministry of international trade and industry  miti  will rev~
##  4      4 thailand s trade deficit widened to     billion baht in the firs~
##  5      5 indonesia expects crude palm oil  cpo  prices to rise sharply to~
##  6      7 tug crews in new south wales  nsw   victoria and western austral~
##  7      8 the indonesian commodity exchange is likely to start trading in ~
##  8      9 food department officials said the u s  department of agricultur~
##  9     10 western mining corp holdings ltd  lt wmng s   wmc  said it will ~
## 10     11 sumitomo bank ltd  lt sumi t  is certain to lose its status as j~
## # ... with 2,999 more rows

Get Documents with Topic “earn”

  1. Extract text between <TOPICS> tags for each document
  2. Search <TOPICS> tags for the term “earn.”
  3. Create a reference table of matching documents
# A function to search documents containing specified tpic
search_documents_for_topic = function(topic) {
  
  reuters %>%
    map(., function(x) {
      x %>%
        # get text between TOPICS tags
        filter(str_detect(txt, "<TOPICS>")) %>%
        mutate(doc_id = row_number()) %>%
        # retain document with matching topic
        filter(str_detect(txt, topic)) %>%
        mutate(earn_ind = 1L) %>%
        select(doc_id, earn_ind)
    })
}

Run seach_document_for_topic("earn") to obtain indexed collection of documents containing “earn” in the set of topics.

# create reference table of matching docs
documents_with_earn = search_documents_for_topic("earn")

documents_with_earn
## $reuters_train
## # A tibble: 2,877 x 2
##    doc_id earn_ind
##     <int>    <int>
##  1      5        1
##  2      7        1
##  3      8        1
##  4      9        1
##  5     10        1
##  6     11        1
##  7     15        1
##  8     16        1
##  9     18        1
## 10     22        1
## # ... with 2,867 more rows
## 
## $reuters_test
## # A tibble: 1,087 x 2
##    doc_id earn_ind
##     <int>    <int>
##  1     21        1
##  2     22        1
##  3     31        1
##  4     32        1
##  5     33        1
##  6     34        1
##  7     50        1
##  8     54        1
##  9     58        1
## 10     67        1
## # ... with 1,077 more rows

Label Documents

Using the reference table of documents containing the topic of “earn,” label all training and testing documents with a binary indicator for the presence (1) or absence (0) of the topic “earn.”

labeled_documents = map2(document_text, documents_with_earn, function(x, y) {
  left_join(x = x, 
            y = y, 
            by = "doc_id") %>%
    mutate(earn_ind = ifelse(is.na(earn_ind), "0", earn_ind)) 
})

labeled_documents
## $reuters_train
## # A tibble: 8,762 x 3
##    doc_id body_text                                                earn_ind
##     <int> <chr>                                                    <chr>   
##  1      1 showers continued throughout the week in the bahia coco~ 0       
##  2      2 the u s  agriculture department reported the farmer own~ 0       
##  3      3 argentine grain board figures show crop registrations o~ 0       
##  4      4 moody s investors service inc said it lowered the debt ~ 0       
##  5      5 champion products inc said its board of directors appro~ 1       
##  6      6 computer terminal systems inc said it has completed the~ 0       
##  7      7 shr    cts vs      dlrs     net         vs             ~ 1       
##  8      8 ohio mattress co said its first quarter  ending februar~ 1       
##  9      9 oper shr loss two cts vs profit seven cts     oper shr ~ 1       
## 10     10 shr one dlr vs    cts     net      mln vs      mln     ~ 1       
## # ... with 8,752 more rows
## 
## $reuters_test
## # A tibble: 3,009 x 3
##    doc_id body_text                                                earn_ind
##     <int> <chr>                                                    <chr>   
##  1      1 mounting trade friction between the u s  and japan has ~ 0       
##  2      2 a survey of    provinces and seven cities showed vermin~ 0       
##  3      3 the ministry of international trade and industry  miti ~ 0       
##  4      4 thailand s trade deficit widened to     billion baht in~ 0       
##  5      5 indonesia expects crude palm oil  cpo  prices to rise s~ 0       
##  6      7 tug crews in new south wales  nsw   victoria and wester~ 0       
##  7      8 the indonesian commodity exchange is likely to start tr~ 0       
##  8      9 food department officials said the u s  department of a~ 0       
##  9     10 western mining corp holdings ltd  lt wmng s   wmc  said~ 0       
## 10     11 sumitomo bank ltd  lt sumi t  is certain to lose its st~ 0       
## # ... with 2,999 more rows

Create Word Tokens

The labeled data set contains an indexed collection of documents that have been parsed, cleansed, and labeled for classification. The next step

  1. Tokenize documents by word
  2. Remove stopwords
  3. Apply lemmatization to standardize and reduce the number of unique terms. Applying lemmatization necessitates that we again eliminate non-alphabetic characters as the lemmatization method converts some words to digits (e.g. “tenth” -> 10).
  4. Obtain vector of tokens (words) for each document in training and testing sets
tokens = labeled_documents %>%
  map(., function(x) {
    x %>%
      tidytext::unnest_tokens(word, body_text) %>%
      ungroup() %>%
      # remove stopwords
      anti_join(stop_words, by = "word") %>%
      # lemmatize
      mutate(word = textstem::lemmatize_words(word)) %>%
      mutate(word = str_replace_all(word, "[^[:alpha:]]", " ")) %>% 
      mutate(word = str_replace(gsub("\\s+", " ", str_trim(word)), "B", "b")) %>%
      # obtain the vector of words (tokens) for each training and testing document
      split(.$doc_id) %>%
      map(., function(x) x %>% pull(word))
  })

Here is a sample of 10 words from the first document in the training set.

# first ten words found in the first training document
tokens$reuters_train$`1`[1:10]
##  [1] "shower"    "continue"  "week"      "bahia"     "cocoa"    
##  [6] "zone"      "alleviate" "drought"   "january"   "improve"

Vectorize Vocabulary from Training Documents

Using the set of tokens from the training set, create a training vocaulary that collects unique terms and corresponding statistics. Then transform the vocabulary into the feature vector space for each document.

# collect unique terms and mark with id
iter_train = text2vec::itoken(iterable = tokens$reuters_train, 
                               ids = labeled_documents$reuters_train$doc_id,
                               progressbar = FALSE)

vocab = text2vec::create_vocabulary(iter_train)

Create document-term Matrix

First we vectorize each training document to generate features and convert the data structure to a sparse matrix. The Matrix package offers a sparse matrix class that has been recommended.

vectorizer = text2vec::vocab_vectorizer(vocab)

doc_term_train = text2vec::create_dtm(iter_train, vectorizer)

head(doc_term_train)
## 6 x 19014 sparse Matrix of class "dgCMatrix"
##                                                                       
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 
##  .....suppressing columns in show(); maybe adjust 'options(max.print= *, width = *)'
##  ..............................

The document-term matrix is a sparse matrix implementation from the Matrix package.

Train Classifier

The author of text2vec demonstrates the effectiveness of the logistic classifier available in the glmnet package. This implementation offers penalization and cross-validation, and it trains relatively quickly. The feature matrix is the document-term matrix created in the previous step. The vector of labels comes come from the collection of labeled documents.

# 4-fold cross validation
n_folds = 4

glmnet_classifier = glmnet::cv.glmnet(
  x = doc_term_train, 
  y = labeled_documents$reuters_train %>% pull(earn_ind),
  # set binary classification
  family = 'binomial', 
  # L1 penalty
  alpha = 1,
  # interested in the area under ROC curve
  type.measure = "auc",
  # 5-fold cross-validation
  nfolds = n_folds,
  # high value is less accurate, but has faster training
  thresh = 1e-3,
  # again lower number of iterations for faster training
  maxit = 1e3)

To measure the performance of the classifier, we can look at the AUROC.

paste("Maximum AUROC from training:", 
      max(glmnet_classifier$cvm) %>% round(., 4))
## [1] "Maximum AUROC from training: 0.9911"

Predict Test Set

To make predictions for the test set, we first need to vectorize the documents in the test set using the vectorizer created from the training data. This ensures that all features (words) present in the training vocabulary appear in the vocabulary of the test set.

  1. Create a document-term matrix from the vocabulary
iter_test = text2vec::itoken(iterable = tokens$reuters_test, 
                             ids = labeled_documents$reuters_test$doc_id,
                             progressbar = FALSE)

doc_term_test = create_dtm(iter_test, vectorizer)

Class Probabilities

# obtain probability of document containing "earn"
probs = predict(glmnet_classifier, doc_term_test, type = 'response')[, 1]

test_auc = glmnet:::auc(y = labeled_documents$reuters_test$earn_ind %>% as.integer(),
                         prob = probs)

paste("AUROC for testing set:", round(test_auc, 4))
## [1] "AUROC for testing set: 0.9941"

Class Labels

Alternatively, we can predict the class labels directly. Class labels are determined based on the class with the highest predicted probability.

class_labels = predict(glmnet_classifier, doc_term_test, type = 'class') %>%
  as.factor()

caret::confusionMatrix(data = class_labels,
                       reference = as.factor(labeled_documents$reuters_test$earn_ind),
                       positive = "1",
                       mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1942   78
##          1   23  966
##                                           
##                Accuracy : 0.9664          
##                  95% CI : (0.9594, 0.9726)
##     No Information Rate : 0.653           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.925           
##  Mcnemar's Test P-Value : 7.735e-08       
##                                           
##               Precision : 0.9767          
##                  Recall : 0.9253          
##                      F1 : 0.9503          
##              Prevalence : 0.3470          
##          Detection Rate : 0.3210          
##    Detection Prevalence : 0.3287          
##       Balanced Accuracy : 0.9568          
##                                           
##        'Positive' Class : 1               
## 

Replicate Model Using Pruned Vocabulary

Pruning the vocabulary reduces the size of the vocabulary, leading to faster performance in training and scoring.

The following function replicate the modeling process that was previously described. The basic procedure is to create document-term matrices from feature vector spaces, train the classifier, and measure the AUROC.

replicate_model = function(vectorizer) {
  
  doc_term_train  = create_dtm(iter_train, vectorizer)
  doc_term_test   = create_dtm(iter_test, vectorizer)
  
  train_labels = labeled_documents$reuters_train %>% pull(earn_ind)
  test_labels = labeled_documents$reuters_test %>% pull(earn_ind)
  
  n_folds = 4
  
  glmnet_classifier = cv.glmnet(
    x = doc_term_train, 
    y = train_labels,
    family = 'binomial', 
    alpha = 1,
    type.measure = "auc",
    nfolds = n_folds,
    thresh = 1e-3,
    maxit = 1e3
  )
  
  probs = predict(glmnet_classifier, doc_term_test, type = 'response')[, 1]
  class_labels = predict(glmnet_classifier, doc_term_test, type = 'class')
  
  auc = glmnet:::auc(y = test_labels %>% as.integer(),
                      prob = probs) 
  
  list(auc = auc,
       predicted_probs = probs,
       predicted_labels = class_labels,
       true_labels = test_labels) %>%
    return()
}

Here we create the vocabulary from the training data, prune, and vectorize the documents.

vocab = text2vec::create_vocabulary(iter_train)

pruned_vocab = text2vec::prune_vocabulary(vocab, 
                                           term_count_min = 50, 
                                           doc_proportion_max = 0.5,
                                           doc_proportion_min = 0.001)

pruned_vectorizer = vocab_vectorizer(pruned_vocab)
replicate_model(pruned_vectorizer)$auc
## [1] 0.9948822

N-grams

# bi-grams
bigrams = create_vocabulary(iter_train, ngram = c(1L, 2L))
bigram_vectorizer = vocab_vectorizer(bigrams)

replicate_model(bigram_vectorizer)$auc
## [1] 0.9951059

Feature Hashing

hash_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(1L, 2L))

replicate_model(hash_vectorizer)$auc
## [1] 0.9960145

TF-IDF

Term frequency - inverse document frequency (TF-IDF) is a measure used to identify terms that make each document unique. This degree of uniqueness for a word in a document is used as a weight to normalize the documents. Common terms are given less weight, and rare terms are given more weight.

# create vocabulary from training data
vocab = create_vocabulary(iter_train)

# vectorize documents and create DTM from training data
vectorizer = vocab_vectorizer(vocab)
doc_term_train = create_dtm(iter_train, vectorizer)

# define tf-idf model
tf_idf = TfIdf$new()

# fit tf-idf to training data
doc_term_train_tfidf = fit_transform(doc_term_train, tf_idf)

# apply pre-trained tf-idf transformation to testing data
doc_term_test_tfidf  = create_dtm(iter_test, vectorizer) %>% 
  transform(tf_idf)

# build classifier using tf-idf normalized documents as features
glmnet_classifier = cv.glmnet(
  x = doc_term_train_tfidf, 
  y = labeled_documents$reuters_train %>% pull(earn_ind),
  family = 'binomial', 
  alpha = 1,
  type.measure = "auc",
  nfolds = n_folds,
  thresh = 1e-3,
  maxit = 1e3)

# AUROC from training 
paste("Maximum AUROC from training:", 
      max(glmnet_classifier$cvm) %>% round(., 4))
## [1] "Maximum AUROC from training: 0.9941"
# class probabilities for documents in testing set
probs = predict(glmnet_classifier, doc_term_test_tfidf, type = 'response')[,1]

# AUROc from testing 
labeled_documents$reuters_test$earn_ind %>%
  as.integer() %>%
  glmnet::auc(., prob = probs)
## [1] 0.9971713

XGBoost

The XGBoost algorithm is a tree-based method used for classification. An ensemble technique, it combines the output from several independent decision trees and creates an aggregate score (or label). Each tree is designed to learn from the errors produced by the previous tree and grow smarter.

xgb = xgboost::xgboost(
  data = doc_term_train_tfidf, 
  label = labeled_documents$reuters_train[['earn_ind']], 
  nrounds = 100,
  objective = "binary:logistic")
## [1]  train-error:0.031386 
## [2]  train-error:0.027848 
## [3]  train-error:0.024652 
## [4]  train-error:0.023967 
## [5]  train-error:0.018375 
## [6]  train-error:0.016891 
## [7]  train-error:0.015636 
## [8]  train-error:0.014152 
## [9]  train-error:0.013125 
## [10] train-error:0.012782 
## [11] train-error:0.012098 
## [12] train-error:0.010728 
## [13] train-error:0.010386 
## [14] train-error:0.010386 
## [15] train-error:0.009587 
## [16] train-error:0.009244 
## [17] train-error:0.007875 
## [18] train-error:0.007190 
## [19] train-error:0.007190 
## [20] train-error:0.006505 
## [21] train-error:0.006505 
## [22] train-error:0.005592 
## [23] train-error:0.005592 
## [24] train-error:0.005136 
## [25] train-error:0.004908 
## [26] train-error:0.004565 
## [27] train-error:0.003880 
## [28] train-error:0.003766 
## [29] train-error:0.003538 
## [30] train-error:0.003424 
## [31] train-error:0.003310 
## [32] train-error:0.003081 
## [33] train-error:0.002853 
## [34] train-error:0.002967 
## [35] train-error:0.002967 
## [36] train-error:0.002739 
## [37] train-error:0.002625 
## [38] train-error:0.002511 
## [39] train-error:0.002168 
## [40] train-error:0.001940 
## [41] train-error:0.002168 
## [42] train-error:0.002054 
## [43] train-error:0.002054 
## [44] train-error:0.001712 
## [45] train-error:0.001598 
## [46] train-error:0.001484 
## [47] train-error:0.001484 
## [48] train-error:0.001370 
## [49] train-error:0.001370 
## [50] train-error:0.001141 
## [51] train-error:0.001027 
## [52] train-error:0.001027 
## [53] train-error:0.000913 
## [54] train-error:0.000913 
## [55] train-error:0.000913 
## [56] train-error:0.000799 
## [57] train-error:0.000799 
## [58] train-error:0.000685 
## [59] train-error:0.000571 
## [60] train-error:0.000685 
## [61] train-error:0.000799 
## [62] train-error:0.000571 
## [63] train-error:0.000457 
## [64] train-error:0.000457 
## [65] train-error:0.000342 
## [66] train-error:0.000342 
## [67] train-error:0.000228 
## [68] train-error:0.000228 
## [69] train-error:0.000228 
## [70] train-error:0.000342 
## [71] train-error:0.000228 
## [72] train-error:0.000228 
## [73] train-error:0.000228 
## [74] train-error:0.000228 
## [75] train-error:0.000228 
## [76] train-error:0.000228 
## [77] train-error:0.000228 
## [78] train-error:0.000228 
## [79] train-error:0.000228 
## [80] train-error:0.000228 
## [81] train-error:0.000228 
## [82] train-error:0.000228 
## [83] train-error:0.000228 
## [84] train-error:0.000228 
## [85] train-error:0.000228 
## [86] train-error:0.000228 
## [87] train-error:0.000228 
## [88] train-error:0.000228 
## [89] train-error:0.000228 
## [90] train-error:0.000228 
## [91] train-error:0.000228 
## [92] train-error:0.000228 
## [93] train-error:0.000228 
## [94] train-error:0.000228 
## [95] train-error:0.000228 
## [96] train-error:0.000228 
## [97] train-error:0.000228 
## [98] train-error:0.000228 
## [99] train-error:0.000228 
## [100]    train-error:0.000228
xgb$evaluation_log %>%
  as_tibble() %>%
  ggplot(aes(x = iter, y = train_error)) +
  geom_line() +
  labs(title = "XGBoost Training Error")

xgb_probs = predict(xgb, doc_term_test_tfidf, type = 'response')

labeled_documents$reuters_test$earn_ind %>%
  as.integer() %>%
  glmnet::auc(., prob = xgb_probs)
## [1] 0.9988267