# Bag of Words Text Classification in R

## 2019/02/17

In this article, I’ll demonstrate how to perform a type of text classification using a collection of Reuters news articles. I’ll touch on topics such as text extraction and cleansing, tokenization, feature vectorization, document-term matrices, vocabulary pruning, n-grams, feature hashing, TF-IDF, and training/testing penalized logistic regression and XGBoost.

The main text mining packages to demonstrate are tidytext for general text mining and processing and text2vec for sparse document-term matrices and feature vectorization. The penalized logistic regression classifer comes from the glmnet pacakage and the XGBoost implementation comes from the xgboost package.

library(text2vec)     # NLP tools
library(tidytext)     # tidy text mining

library(glmnet)       # logistic regression training
library(xgboost)      # XGBoost implementation

library(tidyverse)    # general purpose data manipulation
library(textstem)     # word lemmatization
library(caret)        # model evaluation

The collection of documents I am working with is a sample of documents that appeared on the Reuters newswire in 1987. These data are commonly used for demonstrating text mining applications.

The first step is to load the raw text documents into the R session. The readr::read_lines() function reads text line by line and converts each line of text to a row in a table. This is the general approach to converting text documents to a familiar tabular data structure.

reuters = list(
) %>%
map(., function(x) tibble::tibble(txt = x))

reuters
## $reuters_train ## # A tibble: 311,980 x 1 ## txt ## <chr> ## 1 "<REUTERS TOPICS=\"YES\" LEWISSPLIT=\"TRAIN\" CGISPLIT=\"TRAINING-SET\"~ ## 2 <DATE>26-FEB-1987 15:01:01.79</DATE> ## 3 <TOPICS><D>cocoa</D></TOPICS> ## 4 <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES> ## 5 <PEOPLE></PEOPLE> ## 6 <ORGS></ORGS> ## 7 <EXCHANGES></EXCHANGES> ## 8 <COMPANIES></COMPANIES> ## 9 "<UNKNOWN> " ## 10 &#5;&#5;&#5;C T ## # ... with 311,970 more rows ## ##$reuters_test
## # A tibble: 105,392 x 1
##    txt
##    <chr>
##  1 "<REUTERS TOPICS=\"YES\" LEWISSPLIT=\"TEST\" CGISPLIT=\"TRAINING-SET\" ~
##  2 <DATE> 8-APR-1987 01:03:47.52</DATE>
##  4 <PLACES><D>hong-kong</D><D>usa</D><D>japan</D><D>taiwan</D><D>malaysia<~
##  5 <PEOPLE></PEOPLE>
##  6 <ORGS></ORGS>
##  7 <EXCHANGES></EXCHANGES>
##  8 <COMPANIES></COMPANIES>
##  9 "<UNKNOWN> "
## 10 &#5;&#5;&#5;RM C
## # ... with 105,382 more rows

# Extract Individual News Articles

Our data is currently in the form of one-row-per-line-of-text. For this particular collection of documents, each document begins with a “</REUTERS” tag. When performing tokenization using the tidytext package, we can specify this pattern to separate individual documents. Futhermore, we will parse each document in search of the body of the text (i.e. the actual article and not the metadata).

The function below will perform this step in addition to some others, including:

1. Separate individual documents (i.e. document tokenization)
2. Extract text between <body> tags containing the body of the news article
3. Remove non-alphabetic characters (e.g. numbers, symbols)
get_document_text = function(x) {
x %>%
# document tokens
tidytext::unnest_tokens(output = article,
input = txt,
token = "regex",
pattern = "</REUTERS>") %>%
# id docs
mutate(doc_id = row_number()) %>%
# extract text between body tags
filter(str_detect(article, "<body>")) %>%
mutate(body_text = sub(".*<body> *(.*?) *</body>.*", "\\1", article)) %>%
# remove non-alphabetic characters and extra whitespace
mutate(body_text = str_replace_all(body_text, "[^[:alpha:]]", " ")) %>%
mutate(body_text = trimws(body_text)) %>%
select(-article)
}

Apply get_document_text() to training and testing sets and return a data structure with one-row-per-document and the body of the article.

document_text = reuters %>%
map(., get_document_text)

document_text
## $reuters_train ## # A tibble: 8,762 x 2 ## doc_id body_text ## <int> <chr> ## 1 1 showers continued throughout the week in the bahia cocoa zone a~ ## 2 2 the u s agriculture department reported the farmer owned reserv~ ## 3 3 argentine grain board figures show crop registrations of grains ~ ## 4 4 moody s investors service inc said it lowered the debt and prefe~ ## 5 5 champion products inc said its board of directors approved a two~ ## 6 6 computer terminal systems inc said it has completed the sale of ~ ## 7 7 shr cts vs dlrs net vs assets ~ ## 8 8 ohio mattress co said its first quarter ending february pro~ ## 9 9 oper shr loss two cts vs profit seven cts oper shr profit ~ ## 10 10 shr one dlr vs cts net mln vs mln revs ~ ## # ... with 8,752 more rows ## ##$reuters_test
## # A tibble: 3,009 x 2
##    doc_id body_text
##     <int> <chr>
##  1      1 mounting trade friction between the u s  and japan has raised fe~
##  2      2 a survey of    provinces and seven cities showed vermin consume ~
##  3      3 the ministry of international trade and industry  miti  will rev~
##  4      4 thailand s trade deficit widened to     billion baht in the firs~
##  5      5 indonesia expects crude palm oil  cpo  prices to rise sharply to~
##  6      7 tug crews in new south wales  nsw   victoria and western austral~
##  7      8 the indonesian commodity exchange is likely to start trading in ~
##  8      9 food department officials said the u s  department of agricultur~
##  9     10 western mining corp holdings ltd  lt wmng s   wmc  said it will ~
## 10     11 sumitomo bank ltd  lt sumi t  is certain to lose its status as j~
## # ... with 2,999 more rows

# Get Documents with Topic “earn”

1. Extract text between <TOPICS> tags for each document
2. Search <TOPICS> tags for the term “earn.”
3. Create a reference table of matching documents
# A function to search documents containing specified tpic
search_documents_for_topic = function(topic) {

reuters %>%
map(., function(x) {
x %>%
# get text between TOPICS tags
filter(str_detect(txt, "<TOPICS>")) %>%
mutate(doc_id = row_number()) %>%
# retain document with matching topic
filter(str_detect(txt, topic)) %>%
mutate(earn_ind = 1L) %>%
select(doc_id, earn_ind)
})
}

Run seach_document_for_topic("earn") to obtain indexed collection of documents containing “earn” in the set of topics.

# create reference table of matching docs
documents_with_earn = search_documents_for_topic("earn")

documents_with_earn
## $reuters_train ## # A tibble: 2,877 x 2 ## doc_id earn_ind ## <int> <int> ## 1 5 1 ## 2 7 1 ## 3 8 1 ## 4 9 1 ## 5 10 1 ## 6 11 1 ## 7 15 1 ## 8 16 1 ## 9 18 1 ## 10 22 1 ## # ... with 2,867 more rows ## ##$reuters_test
## # A tibble: 1,087 x 2
##    doc_id earn_ind
##     <int>    <int>
##  1     21        1
##  2     22        1
##  3     31        1
##  4     32        1
##  5     33        1
##  6     34        1
##  7     50        1
##  8     54        1
##  9     58        1
## 10     67        1
## # ... with 1,077 more rows

# Label Documents

Using the reference table of documents containing the topic of “earn,” label all training and testing documents with a binary indicator for the presence (1) or absence (0) of the topic “earn.”

labeled_documents = map2(document_text, documents_with_earn, function(x, y) {
left_join(x = x,
y = y,
by = "doc_id") %>%
mutate(earn_ind = ifelse(is.na(earn_ind), "0", earn_ind))
})

labeled_documents
## $reuters_train ## # A tibble: 8,762 x 3 ## doc_id body_text earn_ind ## <int> <chr> <chr> ## 1 1 showers continued throughout the week in the bahia coco~ 0 ## 2 2 the u s agriculture department reported the farmer own~ 0 ## 3 3 argentine grain board figures show crop registrations o~ 0 ## 4 4 moody s investors service inc said it lowered the debt ~ 0 ## 5 5 champion products inc said its board of directors appro~ 1 ## 6 6 computer terminal systems inc said it has completed the~ 0 ## 7 7 shr cts vs dlrs net vs ~ 1 ## 8 8 ohio mattress co said its first quarter ending februar~ 1 ## 9 9 oper shr loss two cts vs profit seven cts oper shr ~ 1 ## 10 10 shr one dlr vs cts net mln vs mln ~ 1 ## # ... with 8,752 more rows ## ##$reuters_test
## # A tibble: 3,009 x 3
##    doc_id body_text                                                earn_ind
##     <int> <chr>                                                    <chr>
##  1      1 mounting trade friction between the u s  and japan has ~ 0
##  2      2 a survey of    provinces and seven cities showed vermin~ 0
##  3      3 the ministry of international trade and industry  miti ~ 0
##  4      4 thailand s trade deficit widened to     billion baht in~ 0
##  5      5 indonesia expects crude palm oil  cpo  prices to rise s~ 0
##  6      7 tug crews in new south wales  nsw   victoria and wester~ 0
##  7      8 the indonesian commodity exchange is likely to start tr~ 0
##  8      9 food department officials said the u s  department of a~ 0
##  9     10 western mining corp holdings ltd  lt wmng s   wmc  said~ 0
## 10     11 sumitomo bank ltd  lt sumi t  is certain to lose its st~ 0
## # ... with 2,999 more rows

# Create Word Tokens

The labeled data set contains an indexed collection of documents that have been parsed, cleansed, and labeled for classification. The next step

1. Tokenize documents by word
2. Remove stopwords
3. Apply lemmatization to standardize and reduce the number of unique terms. Applying lemmatization necessitates that we again eliminate non-alphabetic characters as the lemmatization method converts some words to digits (e.g. “tenth” -> 10).
4. Obtain vector of tokens (words) for each document in training and testing sets
tokens = labeled_documents %>%
map(., function(x) {
x %>%
tidytext::unnest_tokens(word, body_text) %>%
ungroup() %>%
# remove stopwords
anti_join(stop_words, by = "word") %>%
# lemmatize
mutate(word = textstem::lemmatize_words(word)) %>%
mutate(word = str_replace_all(word, "[^[:alpha:]]", " ")) %>%
mutate(word = str_replace(gsub("\\s+", " ", str_trim(word)), "B", "b")) %>%
# obtain the vector of words (tokens) for each training and testing document
split(.$doc_id) %>% map(., function(x) x %>% pull(word)) }) Here is a sample of 10 words from the first document in the training set. # first ten words found in the first training document tokens$reuters_train$1[1:10] ## [1] "shower" "continue" "week" "bahia" "cocoa" ## [6] "zone" "alleviate" "drought" "january" "improve" # Vectorize Vocabulary from Training Documents Using the set of tokens from the training set, create a training vocaulary that collects unique terms and corresponding statistics. Then transform the vocabulary into the feature vector space for each document. # collect unique terms and mark with id iter_train = text2vec::itoken(iterable = tokens$reuters_train,
ids = labeled_documents$reuters_train$doc_id,
progressbar = FALSE)

vocab = text2vec::create_vocabulary(iter_train)

# Create document-term Matrix

First we vectorize each training document to generate features and convert the data structure to a sparse matrix. The Matrix package offers a sparse matrix class that has been recommended.

vectorizer = text2vec::vocab_vectorizer(vocab)

doc_term_train = text2vec::create_dtm(iter_train, vectorizer)

head(doc_term_train)
## 6 x 19014 sparse Matrix of class "dgCMatrix"
##
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
##
##  .....suppressing columns in show(); maybe adjust 'options(max.print= *, width = *)'
##  ..............................

The document-term matrix is a sparse matrix implementation from the Matrix package.

# Train Classifier

The author of text2vec demonstrates the effectiveness of the logistic classifier available in the glmnet package. This implementation offers penalization and cross-validation, and it trains relatively quickly. The feature matrix is the document-term matrix created in the previous step. The vector of labels comes come from the collection of labeled documents.

# 4-fold cross validation
n_folds = 4

glmnet_classifier = glmnet::cv.glmnet(
x = doc_term_train,
y = labeled_documents$reuters_train %>% pull(earn_ind), # set binary classification family = 'binomial', # L1 penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 5-fold cross-validation nfolds = n_folds, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) To measure the performance of the classifier, we can look at the AUROC. paste("Maximum AUROC from training:", max(glmnet_classifier$cvm) %>% round(., 4))
## [1] "Maximum AUROC from training: 0.9911"

# Predict Test Set

To make predictions for the test set, we first need to vectorize the documents in the test set using the vectorizer created from the training data. This ensures that all features (words) present in the training vocabulary appear in the vocabulary of the test set.

1. Create a document-term matrix from the vocabulary
iter_test = text2vec::itoken(iterable = tokens$reuters_test, ids = labeled_documents$reuters_test$doc_id, progressbar = FALSE) doc_term_test = create_dtm(iter_test, vectorizer) ## Class Probabilities # obtain probability of document containing "earn" probs = predict(glmnet_classifier, doc_term_test, type = 'response')[, 1] test_auc = glmnet:::auc(y = labeled_documents$reuters_test$earn_ind %>% as.integer(), prob = probs) paste("AUROC for testing set:", round(test_auc, 4)) ## [1] "AUROC for testing set: 0.9941" ## Class Labels Alternatively, we can predict the class labels directly. Class labels are determined based on the class with the highest predicted probability. class_labels = predict(glmnet_classifier, doc_term_test, type = 'class') %>% as.factor() caret::confusionMatrix(data = class_labels, reference = as.factor(labeled_documents$reuters_test$earn_ind), positive = "1", mode = "prec_recall") ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 1942 78 ## 1 23 966 ## ## Accuracy : 0.9664 ## 95% CI : (0.9594, 0.9726) ## No Information Rate : 0.653 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.925 ## Mcnemar's Test P-Value : 7.735e-08 ## ## Precision : 0.9767 ## Recall : 0.9253 ## F1 : 0.9503 ## Prevalence : 0.3470 ## Detection Rate : 0.3210 ## Detection Prevalence : 0.3287 ## Balanced Accuracy : 0.9568 ## ## 'Positive' Class : 1 ##  # Replicate Model Using Pruned Vocabulary Pruning the vocabulary reduces the size of the vocabulary, leading to faster performance in training and scoring. The following function replicate the modeling process that was previously described. The basic procedure is to create document-term matrices from feature vector spaces, train the classifier, and measure the AUROC. replicate_model = function(vectorizer) { doc_term_train = create_dtm(iter_train, vectorizer) doc_term_test = create_dtm(iter_test, vectorizer) train_labels = labeled_documents$reuters_train %>% pull(earn_ind)
test_labels = labeled_documents$reuters_test %>% pull(earn_ind) n_folds = 4 glmnet_classifier = cv.glmnet( x = doc_term_train, y = train_labels, family = 'binomial', alpha = 1, type.measure = "auc", nfolds = n_folds, thresh = 1e-3, maxit = 1e3 ) probs = predict(glmnet_classifier, doc_term_test, type = 'response')[, 1] class_labels = predict(glmnet_classifier, doc_term_test, type = 'class') auc = glmnet:::auc(y = test_labels %>% as.integer(), prob = probs) list(auc = auc, predicted_probs = probs, predicted_labels = class_labels, true_labels = test_labels) %>% return() } Here we create the vocabulary from the training data, prune, and vectorize the documents. vocab = text2vec::create_vocabulary(iter_train) pruned_vocab = text2vec::prune_vocabulary(vocab, term_count_min = 50, doc_proportion_max = 0.5, doc_proportion_min = 0.001) pruned_vectorizer = vocab_vectorizer(pruned_vocab) replicate_model(pruned_vectorizer)$auc
## [1] 0.9948822

# N-grams

# bi-grams
bigrams = create_vocabulary(iter_train, ngram = c(1L, 2L))
bigram_vectorizer = vocab_vectorizer(bigrams)

replicate_model(bigram_vectorizer)$auc ## [1] 0.9951059 # Feature Hashing hash_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(1L, 2L)) replicate_model(hash_vectorizer)$auc
## [1] 0.9960145

# TF-IDF

Term frequency - inverse document frequency (TF-IDF) is a measure used to identify terms that make each document unique. This degree of uniqueness for a word in a document is used as a weight to normalize the documents. Common terms are given less weight, and rare terms are given more weight.

# create vocabulary from training data
vocab = create_vocabulary(iter_train)

# vectorize documents and create DTM from training data
vectorizer = vocab_vectorizer(vocab)
doc_term_train = create_dtm(iter_train, vectorizer)

# define tf-idf model
tf_idf = TfIdf$new() # fit tf-idf to training data doc_term_train_tfidf = fit_transform(doc_term_train, tf_idf) # apply pre-trained tf-idf transformation to testing data doc_term_test_tfidf = create_dtm(iter_test, vectorizer) %>% transform(tf_idf) # build classifier using tf-idf normalized documents as features glmnet_classifier = cv.glmnet( x = doc_term_train_tfidf, y = labeled_documents$reuters_train %>% pull(earn_ind),
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = n_folds,
thresh = 1e-3,
maxit = 1e3)

# AUROC from training
paste("Maximum AUROC from training:",
max(glmnet_classifier$cvm) %>% round(., 4)) ## [1] "Maximum AUROC from training: 0.9941" # class probabilities for documents in testing set probs = predict(glmnet_classifier, doc_term_test_tfidf, type = 'response')[,1] # AUROc from testing labeled_documents$reuters_test$earn_ind %>% as.integer() %>% glmnet::auc(., prob = probs) ## [1] 0.9971713 # XGBoost The XGBoost algorithm is a tree-based method used for classification. An ensemble technique, it combines the output from several independent decision trees and creates an aggregate score (or label). Each tree is designed to learn from the errors produced by the previous tree and grow smarter. xgb = xgboost::xgboost( data = doc_term_train_tfidf, label = labeled_documents$reuters_train[['earn_ind']],
nrounds = 100,
objective = "binary:logistic")
## [1]  train-error:0.031386
## [2]  train-error:0.027848
## [3]  train-error:0.024652
## [4]  train-error:0.023967
## [5]  train-error:0.018375
## [6]  train-error:0.016891
## [7]  train-error:0.015636
## [8]  train-error:0.014152
## [9]  train-error:0.013125
## [10] train-error:0.012782
## [11] train-error:0.012098
## [12] train-error:0.010728
## [13] train-error:0.010386
## [14] train-error:0.010386
## [15] train-error:0.009587
## [16] train-error:0.009244
## [17] train-error:0.007875
## [18] train-error:0.007190
## [19] train-error:0.007190
## [20] train-error:0.006505
## [21] train-error:0.006505
## [22] train-error:0.005592
## [23] train-error:0.005592
## [24] train-error:0.005136
## [25] train-error:0.004908
## [26] train-error:0.004565
## [27] train-error:0.003880
## [28] train-error:0.003766
## [29] train-error:0.003538
## [30] train-error:0.003424
## [31] train-error:0.003310
## [32] train-error:0.003081
## [33] train-error:0.002853
## [34] train-error:0.002967
## [35] train-error:0.002967
## [36] train-error:0.002739
## [37] train-error:0.002625
## [38] train-error:0.002511
## [39] train-error:0.002168
## [40] train-error:0.001940
## [41] train-error:0.002168
## [42] train-error:0.002054
## [43] train-error:0.002054
## [44] train-error:0.001712
## [45] train-error:0.001598
## [46] train-error:0.001484
## [47] train-error:0.001484
## [48] train-error:0.001370
## [49] train-error:0.001370
## [50] train-error:0.001141
## [51] train-error:0.001027
## [52] train-error:0.001027
## [53] train-error:0.000913
## [54] train-error:0.000913
## [55] train-error:0.000913
## [56] train-error:0.000799
## [57] train-error:0.000799
## [58] train-error:0.000685
## [59] train-error:0.000571
## [60] train-error:0.000685
## [61] train-error:0.000799
## [62] train-error:0.000571
## [63] train-error:0.000457
## [64] train-error:0.000457
## [65] train-error:0.000342
## [66] train-error:0.000342
## [67] train-error:0.000228
## [68] train-error:0.000228
## [69] train-error:0.000228
## [70] train-error:0.000342
## [71] train-error:0.000228
## [72] train-error:0.000228
## [73] train-error:0.000228
## [74] train-error:0.000228
## [75] train-error:0.000228
## [76] train-error:0.000228
## [77] train-error:0.000228
## [78] train-error:0.000228
## [79] train-error:0.000228
## [80] train-error:0.000228
## [81] train-error:0.000228
## [82] train-error:0.000228
## [83] train-error:0.000228
## [84] train-error:0.000228
## [85] train-error:0.000228
## [86] train-error:0.000228
## [87] train-error:0.000228
## [88] train-error:0.000228
## [89] train-error:0.000228
## [90] train-error:0.000228
## [91] train-error:0.000228
## [92] train-error:0.000228
## [93] train-error:0.000228
## [94] train-error:0.000228
## [95] train-error:0.000228
## [96] train-error:0.000228
## [97] train-error:0.000228
## [98] train-error:0.000228
## [99] train-error:0.000228
## [100]    train-error:0.000228
xgb$evaluation_log %>% as_tibble() %>% ggplot(aes(x = iter, y = train_error)) + geom_line() + labs(title = "XGBoost Training Error") xgb_probs = predict(xgb, doc_term_test_tfidf, type = 'response') labeled_documents$reuters_test\$earn_ind %>%
as.integer() %>%
glmnet::auc(., prob = xgb_probs)
## [1] 0.9988267