2.1 The sentiments dataset | Notes for “Text Mining with R: A Tidy Approach”

2.1 The `sentiments` dataset

There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains several sentiment lexicons. Three general-purpose lexicons are

AFINN from Finn Årup Nielsen
bing from Bing Liu and collaborators
nrc from Saif Mohammad and Peter Turney.
loughran: he Loughran and McDonald dictionary of financial sentiment terms. This dictionary was developed based on analyses of financial reports, and intentionally avoids words like “share” and “fool”, as well as subtler terms like “liability” and “risk” that may not have a negative meaning in a financial context.

All three of these lexicons are based on unigrams. These lexicons contain many English words and the words are assigned scores for positive / negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words into classes of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The loughran lexicon divided words into constraining, litigious, negative, positive, superfluous and uncertainty

get_sentiments("nrc")
#> # A tibble: 13,901 x 2
#>   word      sentiment
#>   <chr>     <chr>    
#> 1 abacus    trust    
#> 2 abandon   fear     
#> 3 abandon   negative 
#> 4 abandon   sadness  
#> 5 abandoned anger    
#> 6 abandoned fear     
#> # ... with 1.39e+04 more rows

# install.packages("textdata")
get_sentiments("bing")
#> # A tibble: 6,786 x 2
#>   word       sentiment
#>   <chr>      <chr>    
#> 1 2-faces    negative 
#> 2 abnormal   negative 
#> 3 abolish    negative 
#> 4 abominable negative 
#> 5 abominably negative 
#> 6 abominate  negative 
#> # ... with 6,780 more rows

get_sentiments("afinn")
#> # A tibble: 2,477 x 2
#>   word       value
#>   <chr>      <dbl>
#> 1 abandon       -2
#> 2 abandoned     -2
#> 3 abandons      -2
#> 4 abducted      -2
#> 5 abduction     -2
#> 6 abductions    -2
#> # ... with 2,471 more rows

get_sentiments("loughran") %>% 
  filter(sentiment == "superfluous")
#> # A tibble: 21 x 2
#>   word         sentiment  
#>   <chr>        <chr>      
#> 1 aegis        superfluous
#> 2 amorphous    superfluous
#> 3 anticipatory superfluous
#> 4 appertaining superfluous
#> 5 assimilate   superfluous
#> 6 assimilating superfluous
#> # ... with 15 more rows

Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text.

One caveat is that the size of the chunk of text that we use to add up unigram sentiment scores can have an effect on an analysis. A text the size of many paragraphs can often have positive and negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better