2.6 Units other than words

Some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole.

We may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case.

PandP_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

PandP_sentences
#> # A tibble: 7,066 x 1
#>   sentence                                                                      
#>   <chr>                                                                         
#> 1 "pride and prejudice  by jane austen    chapter 1   it is a truth universally~
#> 2 "however little known the feelings or views of such a man may be on his first~
#> 3 "\"my dear mr."                                                               
#> 4 "bennet,\" said his lady to him one day, \"have you heard that netherfield pa~
#> 5 "mr."                                                                         
#> 6 "bennet replied that he had not."                                             
#> # ... with 7,060 more rows

The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII. One possibility, if this is important, is to try using iconv(), with something like iconv(text, to = 'latin1') in a mutate statement before unnesting.

tibble(text = prideprejudice) %>% 
  mutate(text = iconv(text, to = "ASCII")) %>% 
  unnest_tokens(sentence, text, token = "sentences")
#> # A tibble: 7,066 x 1
#>   sentence                                                                      
#>   <chr>                                                                         
#> 1 "pride and prejudice  by jane austen    chapter 1   it is a truth universally~
#> 2 "however little known the feelings or views of such a man may be on his first~
#> 3 "\"my dear mr."                                                               
#> 4 "bennet,\" said his lady to him one day, \"have you heard that netherfield pa~
#> 5 "mr."                                                                         
#> 6 "bennet replied that he had not."                                             
#> # ... with 7,060 more rows

Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.


austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

# 275 rows
austen_chapters
#> # A tibble: 275 x 2
#>   book             chapter                                                      
#>   <fct>            <chr>                                                        
#> 1 Sense & Sensibi~ "sense and sensibility\n\nby jane austen\n\n(1811)\n\n\n\n\n"
#> 2 Sense & Sensibi~ "\n\n\nthe family of dashwood had long been settled in susse~
#> 3 Sense & Sensibi~ "\n\n\nmrs. john dashwood now installed herself mistress of ~
#> 4 Sense & Sensibi~ "\n\n\nmrs. dashwood remained at norland several months; not~
#> 5 Sense & Sensibi~ "\n\n\n\"what a pity it is, elinor,\" said marianne, \"that ~
#> 6 Sense & Sensibi~ "\n\n\nno sooner was her answer dispatched, than mrs. dashwo~
#> # ... with 269 more rows

# 275 rows
tidy_books %>%
  distinct(book, chapter)
#> # A tibble: 275 x 2
#>   book                chapter
#>   <fct>                 <int>
#> 1 Sense & Sensibility       0
#> 2 Sense & Sensibility       1
#> 3 Sense & Sensibility       2
#> 4 Sense & Sensibility       3
#> 5 Sense & Sensibility       4
#> 6 Sense & Sensibility       5
#> # ... with 269 more rows

In the austen_chapters data frame, each row corresponds to one chapter.

Near the beginning of this chapter, we used a similar regex to find where all the chapters were in Austen’s novels for a tidy data frame organized by one-word-per-row (Section 2.2). Using a regex as the token is somewhat similar to

tidy_books %>% 
  group_by(book, chapter) %>% 
  summarize(str_c(word, collapse = " "))
#> # A tibble: 275 x 3
#> # Groups:   book [6]
#>   book           chapter `str_c(word, collapse = " ")`                          
#>   <fct>            <int> <chr>                                                  
#> 1 Sense & Sensi~       0 sense and sensibility by jane austen 1811              
#> 2 Sense & Sensi~       1 chapter 1 the family of dashwood had long been settled~
#> 3 Sense & Sensi~       2 chapter 2 mrs john dashwood now installed herself mist~
#> 4 Sense & Sensi~       3 chapter 3 mrs dashwood remained at norland several mon~
#> 5 Sense & Sensi~       4 chapter 4 what a pity it is elinor said marianne that ~
#> 6 Sense & Sensi~       5 chapter 5 no sooner was her answer dispatched than mrs~
#> # ... with 269 more rows

We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a data frame of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words?

bing_negative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

chapter_words <- tidy_books %>% 
  count(book, chapter)

tidy_books %>%
  semi_join(bing_negative) %>%
  count(book, chapter, name = "negative_words") %>% 
  left_join(chapter_words) %>%
  mutate(ratio = negative_words / n) %>%
  filter(chapter != 0) %>%
  group_by(book) %>% 
  top_n(1) 
#> # A tibble: 6 x 5
#> # Groups:   book [6]
#>   book                chapter negative_words     n  ratio
#>   <fct>                 <int>          <int> <int>  <dbl>
#> 1 Sense & Sensibility      43            161  3405 0.0473
#> 2 Pride & Prejudice        34            111  2104 0.0528
#> 3 Mansfield Park           46            173  3685 0.0469
#> 4 Emma                     15            151  3340 0.0452
#> 5 Northanger Abbey         21            149  2982 0.0500
#> 6 Persuasion                4             62  1807 0.0343