Tidy Text Mining in R

Danny Morris

2019/02/10

R Packages

# install.packages(...)

library(tidytext)
library(SnowballC)
library(textstem)
## Warning: package 'textstem' was built under R version 3.5.3
## Loading required package: koRpus.lang.en
## Warning: package 'koRpus.lang.en' was built under R version 3.5.3
## Loading required package: koRpus
## Warning: package 'koRpus' was built under R version 3.5.3
## Loading required package: sylly
## Warning: package 'sylly' was built under R version 3.5.3
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
library(tidyverse)
## -- Attaching packages ----------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.1  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.5.3
## -- Conflicts -------------------------------- tidyverse_conflicts() --
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x readr::tokenize() masks koRpus::tokenize()
library(analogue)
## Warning: package 'analogue' was built under R version 3.5.3
## Loading required package: vegan
## Warning: package 'vegan' was built under R version 3.5.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 3.5.3
## Loading required package: lattice
## This is vegan 2.5-4
## analogue version 0.17-1

Data

reuters <- readr::read_lines("reuters-train") %>%
  tibble(txt = .) 
reuters %>% head()
## # A tibble: 6 x 1
##   txt                                                                      
##   <chr>                                                                    
## 1 "<REUTERS TOPICS=\"YES\" LEWISSPLIT=\"TRAIN\" CGISPLIT=\"TRAINING-SET\" ~
## 2 <DATE>26-FEB-1987 15:01:01.79</DATE>                                     
## 3 <TOPICS><D>cocoa</D></TOPICS>                                            
## 4 <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>              
## 5 <PEOPLE></PEOPLE>                                                        
## 6 <ORGS></ORGS>

Tokenization

The unnest_tokens() function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, ngrams, sentences, lines, paragraphs, or separation around a regex pattern.

Reuters Words

word_tokens <- reuters %>%
  unnest_tokens(word, txt) 

word_tokens %>% head()
## # A tibble: 6 x 1
##   word      
##   <chr>     
## 1 reuters   
## 2 topics    
## 3 yes       
## 4 lewissplit
## 5 train     
## 6 cgisplit

Reuter Documents

For the Reuters data, individual documents begin with the </REUTERS> tag. Separating the text around this tag will result in one row per document, with each row containing the full document text.

document_tokens <- reuters %>%
  unnest_tokens(output = article, 
                input = txt, 
                token = "regex", 
                pattern = "</REUTERS>")

document_tokens %>% head()
## # A tibble: 6 x 1
##   article                                                                  
##   <chr>                                                                    
## 1 "<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"training-set\" ~
## 2 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"training-set\~
## 3 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"training-set\~
## 4 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"training-set\~
## 5 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"training-set\~
## 6 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"training-set\~

Stemming

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always. There are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.

Use wordStem() function from the SnowballC package for stemming a vector of words.

SnowballC::wordStem(c("talking", "ran"))
## [1] "talk" "ran"
SnowballC::wordStem("He stopped talking loudly")
## [1] "He stopped talking loudli"
word_tokens %>%
  filter(word %in% c('reuters', 'companies', 'authorized')) %>%
  distinct() %>%
  mutate(word_stem = SnowballC::wordStem(word))
## # A tibble: 3 x 2
##   word       word_stem
##   <chr>      <chr>    
## 1 reuters    reuter   
## 2 companies  compani  
## 3 authorized author

Lemmatization

Lemmatization takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. The key to this methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word.

Use lemmatize_words() from the textstem package for lemmatizing a vector of individual words. Use lemmatize_strings() to lemmatize words within a string without extracting the words.

textstem::lemmatize_words(c("talking", "ran"))
## [1] "talk" "run"
textstem::lemmatize_strings("He stopped talking loudly")
## [1] "He stop talk loudly"
word_tokens %>%
  filter(word %in% c('reuters', 'companies', 'authorized')) %>%
  distinct() %>%
  mutate(word_lemma = textstem::lemmatize_words(word))
## # A tibble: 3 x 2
##   word       word_lemma
##   <chr>      <chr>     
## 1 reuters    reuters   
## 2 companies  company   
## 3 authorized authorize

Developing a stemmer is far simpler than building a lemmatizer, but the tradeoff is loss of quality. For lemmatization, deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.

Stopwords

Stopwords are commonly used words that provide little to no information about the text and are best ignored. “The”, “so” are examples.

word_tokens %>%
  inner_join(stop_words, by = "word") %>%
  count(word) %>%
  arrange(desc(n))
## # A tibble: 660 x 2
##    word       n
##    <chr>  <int>
##  1 the   196413
##  2 of    106221
##  3 to    103692
##  4 in     79146
##  5 and    76635
##  6 a      74727
##  7 said   47594
##  8 d      44786
##  9 for    38232
## 10 it     30801
## # ... with 650 more rows

Application: Extracting Topics from Reuters Documents

Among other things, each Reuters document is defined by a list of topics. Topics are found within the “” tags, and all topics appear on a single line within the text. Documents can have no topics, a single topic, or multiple topics.

The goal of this application is to extract all topics from each document for analysis.

Extract <TOPICS> Tags

reuters_topics <- reuters %>%
  filter(str_detect(txt, "<TOPICS>")) %>%
  mutate(doc_id = row_number())

reuters_topics %>% head()
## # A tibble: 6 x 2
##   txt                                                                doc_id
##   <chr>                                                               <int>
## 1 <TOPICS><D>cocoa</D></TOPICS>                                           1
## 2 <TOPICS><D>grain</D><D>wheat</D><D>corn</D><D>barley</D><D>oat</D~      2
## 3 <TOPICS><D>veg-oil</D><D>linseed</D><D>lin-oil</D><D>soy-oil</D><~      3
## 4 <TOPICS></TOPICS>                                                       4
## 5 <TOPICS><D>earn</D></TOPICS>                                            5
## 6 <TOPICS><D>acq</D></TOPICS>                                             6

Document-Topic Co-Occurence Matrix

reuters_topics %>%
  mutate(txt = str_replace_all(txt, "D|TOPICS|-", "")) %>%
  unnest_tokens(word, txt) %>%
  mutate(topic_ind = 1) %>%
  distinct() %>%
  spread(word, topic_ind, fill = 0) 
## # A tibble: 7,775 x 116
##    doc_id   acq  alum austdlr barley   bop   can carcass castoroil
##     <int> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>   <dbl>     <dbl>
##  1      1     0     0       0      0     0     0       0         0
##  2      2     0     0       0      1     0     0       0         0
##  3      3     0     0       0      0     0     0       0         0
##  4      5     0     0       0      0     0     0       0         0
##  5      6     1     0       0      0     0     0       0         0
##  6      7     0     0       0      0     0     0       0         0
##  7      8     1     0       0      0     0     0       0         0
##  8      9     0     0       0      0     0     0       0         0
##  9     10     0     0       0      0     0     0       0         0
## 10     11     0     0       0      0     0     0       0         0
## # ... with 7,765 more rows, and 107 more variables: castorseed <dbl>,
## #   citruspulp <dbl>, cocoa <dbl>, coconut <dbl>, coconutoil <dbl>,
## #   coffee <dbl>, copper <dbl>, copracake <dbl>, corn <dbl>,
## #   cornglutenfeed <dbl>, cornoil <dbl>, cotton <dbl>, cottonoil <dbl>,
## #   cpi <dbl>, cpu <dbl>, crude <dbl>, cruzado <dbl>, dfl <dbl>,
## #   dkr <dbl>, dlr <dbl>, dmk <dbl>, earn <dbl>, fishmeal <dbl>,
## #   fuel <dbl>, gas <dbl>, gnp <dbl>, gold <dbl>, grain <dbl>,
## #   groundnut <dbl>, groundnutoil <dbl>, heat <dbl>, hog <dbl>,
## #   housing <dbl>, income <dbl>, instaldebt <dbl>, interest <dbl>,
## #   inventories <dbl>, ipi <dbl>, ironsteel <dbl>, jet <dbl>, jobs <dbl>,
## #   lcattle <dbl>, lead <dbl>, lei <dbl>, linmeal <dbl>, linoil <dbl>,
## #   linseed <dbl>, lit <dbl>, livestock <dbl>, lumber <dbl>,
## #   mealfeed <dbl>, moneyfx <dbl>, moneysupply <dbl>, naphtha <dbl>,
## #   natgas <dbl>, nickel <dbl>, nkr <dbl>, nzdlr <dbl>, oat <dbl>,
## #   oilseed <dbl>, orange <dbl>, palladium <dbl>, palmkernel <dbl>,
## #   palmoil <dbl>, peseta <dbl>, petchem <dbl>, platinum <dbl>,
## #   plywood <dbl>, porkbelly <dbl>, potato <dbl>, propane <dbl>,
## #   rand <dbl>, rapemeal <dbl>, rapeoil <dbl>, rapeseed <dbl>,
## #   redbean <dbl>, reserves <dbl>, retail <dbl>, rice <dbl>,
## #   ringgit <dbl>, rubber <dbl>, rupiah <dbl>, rye <dbl>, saudriyal <dbl>,
## #   ship <dbl>, silver <dbl>, skr <dbl>, sorghum <dbl>, soybean <dbl>,
## #   soymeal <dbl>, soyoil <dbl>, stg <dbl>, strategicmetal <dbl>,
## #   sugar <dbl>, sunmeal <dbl>, sunoil <dbl>, sunseed <dbl>,
## #   tapioca <dbl>, tea <dbl>, tin <dbl>, ...

It appears about 2,000 documents have no topics.

Documents Containing Topic “earn”

earn_topics <- reuters_topics %>%
  filter(str_detect(txt, "earn")) 

earn_docs <- document_tokens %>%
  mutate(doc_id = row_number()) %>%
  semi_join(earn_topics, by = "doc_id") 

earn_docs
## # A tibble: 2,877 x 2
##    article                                                           doc_id
##    <chr>                                                              <int>
##  1 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~      5
##  2 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~      7
##  3 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~      8
##  4 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~      9
##  5 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~     10
##  6 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~     11
##  7 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~     15
##  8 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~     16
##  9 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~     18
## 10 "\n<reuters topics=\"yes\" lewissplit=\"train\" cgisplit=\"train~     22
## # ... with 2,867 more rows

Topics Co-occurring with “earn”

earn_topics %>%
  mutate(txt = str_replace_all(txt, "D|TOPICS", "")) %>%
  unnest_tokens(word, txt) %>%
  mutate(topic_ind = 1) %>%
  spread(word, topic_ind, fill = 0) %>%
  select(doc_id, earn, everything()) %>%
  head()
## # A tibble: 6 x 15
##   doc_id  earn   acq  alum  chem copper crude   gas  iron metal   nat   pet
##    <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1      5     1     0     0     0      0     0     0     0     0     0     0
## 2      7     1     0     0     0      0     0     0     0     0     0     0
## 3      8     1     1     0     0      0     0     0     0     0     0     0
## 4      9     1     0     0     0      0     0     0     0     0     0     0
## 5     10     1     0     0     0      0     0     0     0     0     0     0
## 6     11     1     0     0     0      0     0     0     0     0     0     0
## # ... with 3 more variables: ship <dbl>, steel <dbl>, strategic <dbl>

Lemmatized Words Appearing in Documents Containing “earn”

earn_docs %>%
  unnest_tokens(word, article) %>%
  filter(!str_detect(word, "[^[:alpha:]]")) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(lemma = textstem::lemmatize_words(word)) %>%
  count(lemma) %>%
  arrange(desc(n)) %>%
  head()
## # A tibble: 6 x 2
##   lemma        n
##   <chr>    <int>
## 1 topic     8631
## 2 mln       7779
## 3 company   6949
## 4 unknown   5876
## 5 exchange  5855
## 6 date      5806

Application: Dictionary-based Sentiment Analysis

tidytext::sentiments
## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

The three general-purpose lexicons are

AFINN - from Finn Årup Nielsen, bing - from Bing Liu and collaborators, and nrc - from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

Get Text Within Body of Documents

Separate and extract text between <BODY> tags.

body_tokens <- reuters %>%
  unnest_tokens(output = body, 
                input = txt, 
                token = "regex", 
                pattern = "<BODY>")

body_text <- body_tokens %>%
  mutate(body_id = row_number()) %>%
  mutate(text = gsub("</body>.*", "", body)) %>%
  mutate(text = trimws(text)) %>%
  dplyr::slice(-1) %>%
  select(body_id, text) 

body_text %>% head()
## # A tibble: 6 x 2
##   body_id text                                                             
##     <int> <chr>                                                            
## 1       2 "showers continued throughout the week in\nthe bahia cocoa zone,~
## 2       3 "the u.s. agriculture department\nreported the farmer-owned rese~
## 3       4 "argentine grain board figures show\ncrop registrations of grain~
## 4       5 "moody's investors service inc said it\nlowered the debt and pre~
## 5       6 "champion products inc said its\nboard of directors approved a t~
## 6       7 "computer terminal systems inc said\nit has completed the sale o~

Tokenize Words

  • tokenize
  • remove non-alphabetic characters
  • remove stopwords
  • lemmatize
body_words <- body_text %>%
  unnest_tokens(word, text) %>%
  filter(!str_detect(word, "[^[:alpha:]]")) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(word_lemma = textstem::lemmatize_words(word)) %>%
  select(-word) 

body_words %>% head()
## # A tibble: 6 x 2
##   body_id word_lemma
##     <int> <chr>     
## 1       2 shower    
## 2       2 continue  
## 3       2 week      
## 4       2 bahia     
## 5       2 cocoa     
## 6       2 zone

Word Sentiment Categorization

Word sentiments based on the “nrc” lexicon.

word_sentiment <- body_words %>%
  left_join(get_sentiments("nrc"), by = c("word_lemma" = "word"))

word_sentiment %>% head()
## # A tibble: 6 x 3
##   body_id word_lemma sentiment   
##     <int> <chr>      <chr>       
## 1       2 shower     <NA>        
## 2       2 continue   anticipation
## 3       2 continue   positive    
## 4       2 continue   trust       
## 5       2 week       <NA>        
## 6       2 bahia      <NA>

Body-Sentiment Word Count Matrix

For each body (document), count the number of words associated with each sentiment from the nrc lexicon.

body_sentiment_counts <- word_sentiment %>%
  filter(!is.na(sentiment)) %>%
  group_by(body_id) %>%
  count(sentiment) %>%
  arrange(body_id, desc(n)) %>%
  spread(sentiment, n, fill = 0) %>%
  ungroup()

body_sentiment_counts %>% head()
## # A tibble: 6 x 11
##   body_id anger anticipation disgust  fear   joy negative positive sadness
##     <int> <dbl>        <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl>   <dbl>
## 1       2     1           14       0     4     4        6       16       5
## 2       3     0            0       0     0     0        0        8       0
## 3       4     0            2       0     0     0        7        2       0
## 4       5     2            1       0     2     0        9        4       8
## 5       6     1            6       0     0     5        2        8       1
## 6       7     0           11       0     7     9        6       16       5
## # ... with 2 more variables: surprise <dbl>, trust <dbl>
convert_to_binary <- function(x) {
  ifelse(x != 0, 1, 0)
}

body_sentiment_binary <- body_sentiment_counts %>%
  mutate_at(vars(-body_id), funs(convert_to_binary))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.

Application: Clustering Body-Sentiment Word Count Matrix

Small Random Sample

sample_documents <- body_sentiment_binary %>%
  sample_n(1000) %>%
  select(-body_id)

Document Distance

doc_distances <- sample_documents %>%
  analogue::distance(method = "chi.square") %>%
  as.dist()

Complete-linkage Hierarchical Clustering

complete_hcl <- hclust(doc_distances, method = "complete")

plot(complete_hcl)

There appears to be some meaningful structure to the data.

Cluster Exemplars

n_clusters <- 6

labeled_documents <- sample_documents %>%
  mutate(clus_id = cutree(complete_hcl, n_clusters))

to_pct <- function(x) {
  sum(x) / length(x)
}

labeled_documents %>%
  group_by(clus_id) %>%
  summarise_all(
    funs(to_pct)
  ) %>%
  gather(sentiment, score, -clus_id) %>%
  ggplot(aes(x = sentiment, y = score)) +
  facet_wrap(~clus_id) +
  geom_col() +
  geom_hline(aes(yintercept = 0.5)) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 90)
  )

labeled_documents %>%
  group_by(clus_id) %>%
  summarise_all(
    funs(median)
  ) 
## # A tibble: 6 x 11
##   clus_id anger anticipation disgust  fear   joy negative positive sadness
##     <int> <dbl>        <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl>   <dbl>
## 1       1     1            1       1     1     1        1        1       1
## 2       2     0            1       0     0     1        0        1       0
## 3       3     1            1       0     1     0        1        1       1
## 4       4     1            1       0     1     1        1        1       1
## 5       5     1            1       0     0     1        1        1       0
## 6       6     0            0       0     0     0        0        1       0
## # ... with 2 more variables: surprise <dbl>, trust <dbl>