1.3 Compare word frequency

As a common task in text analysis, compariosn of word frequencies is often employed as a tool to extract linguistic characteristics. A rule of thumb is to compare word proportions instead of raw counts.

In this example, we compare novels of Jane Austen, H.G. Wells, and the Bronte Sisters.

austen <- austen_books() %>% 
  select(-book) %>% 
  mutate(author = "Jane Austen")
bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767)) %>%
  select(-gutenberg_id) %>% 
  mutate(author = "Brontë Sisters")
hgwells <- gutenberg_download(c(35, 36, 5230, 159)) %>% 
  select(-gutenberg_id) %>% 
  mutate(author = "H.G. Wells")

tidy_book <- function(author) {
  author %>% 
    unnest_tokens(word, text) %>% 
    anti_join(stop_words)
}



books <- bind_rows(tidy_book(austen),
          tidy_book(bronte),
          tidy_book(hgwells)) %>% 
  mutate(word = str_extract(word, "[:alpha:]+")) %>% 
  count(author, word, sort = TRUE)

books
#> # A tibble: 46,956 x 3
#>   author         word       n
#>   <chr>          <chr>  <int>
#> 1 Jane Austen    miss    1860
#> 2 Jane Austen    time    1339
#> 3 Bront<eb> Sisters time    1065
#> 4 Jane Austen    fanny    977
#> 5 Jane Austen    emma     866
#> 6 Jane Austen    sister   865
#> # ... with 4.695e+04 more rows

Now, our goal is to use Jane Austen as a reference to which the other two authors are compared to in terms of word frequency. The data manipulation requires a bit trick, after computing proportions of word usage, we first pivot_wider three authors altogether, an then pivot_longer the other two authors back.

comparison_df <- books %>%
  add_count(author, wt = n, name = "total_word") %>% 
  mutate(proportion = n / total_word) %>% 
  select(-total_word, -n) %>% 
  pivot_wider(names_from = author, values_from = proportion, 
              values_fill = list(proportion = 0)) %>%
  pivot_longer(3:4, names_to = "other", values_to = "proportion")

comparison_df
#> # A tibble: 56,002 x 4
#>   word  `Jane Austen` other          proportion
#>   <chr>         <dbl> <chr>               <dbl>
#> 1 miss        0.00855 Bront<eb> Sisters  0.00342  
#> 2 miss        0.00855 H.G. Wells      0.000120 
#> 3 time        0.00615 Bront<eb> Sisters  0.00424  
#> 4 time        0.00615 H.G. Wells      0.00682  
#> 5 fanny       0.00449 Bront<eb> Sisters  0.0000438
#> 6 fanny       0.00449 H.G. Wells      0        
#> # ... with 5.6e+04 more rows

library(scales)

comparison_df %>% 
  filter(proportion > 1 / 1e5) %>% 
  ggplot(aes(proportion, `Jane Austen`)) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(aes(color = abs(`Jane Austen` - proportion)),
              alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + 
  scale_x_log10(labels = label_percent()) +
  scale_y_log10(labels = label_percent()) + 
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") + 
  facet_wrap(~ other) + 
  guides(color = FALSE)

Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “brother” at the high frequency end). Words that are far from the line are words that are found more in one set of texts than another. For example, in the Austen-Brontë panel, words like “elizabeth”, “emma”, and “fanny” (all proper nouns) are found in Austen’s texts but not much in the Brontë texts, while words like “arthur” and “dog” are found in the Brontë texts but not the Austen texts. In comparing H.G. Wells with Jane Austen, Wells uses words like “beast”, “guns”, “feet”, and “black” that Austen does not, while Austen uses words like “family”, “friend”, “letter”, and “dear” that Wells does not.

Notice that the words in the Austen-Brontë panel are closer to the zero-slope line than in the Austen-Wells panel. Also notice that the words extend to lower frequencies in the Austen-Brontë panel; there is empty space in the Austen-Wells panel at low frequency. These characteristics indicate that Austen and the Brontë sisters use more similar words than Austen and H.G. Wells. Also, we see that not all the words are found in all three sets of texts and there are fewer data points in the panel for Austen and H.G. Wells.

Furhter, we can conduct a simple correlation test

cor.test(data = filter(comparison_df, other == "Brontë Sisters"),
         ~ proportion + `Jane Austen`)
#> 
#> 	Pearson's product-moment correlation
#> 
#> data:  proportion and Jane Austen
#> t = 169, df = 27999, p-value <0.0000000000000002
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.705 0.716
#> sample estimates:
#>   cor 
#> 0.711

cor.test(data = filter(comparison_df, other == "H.G. Wells"),
         ~ proportion + `Jane Austen`)
#> 
#> 	Pearson's product-moment correlation
#> 
#> data:  proportion and Jane Austen
#> t = 72, df = 27999, p-value <0.0000000000000002
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.383 0.403
#> sample estimates:
#>   cor 
#> 0.393