1.2 The gutenbergr package | Notes for “Text Mining with R: A Tidy Approach”

1.2 The `gutenbergr` package

library(gutenbergr)

The gutenbergr package provides access to the public domain works from the Project Gutenberg collection. The package includes tools both for downloading books (stripping out the unhelpful header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find works of interest. In this book, we will mostly use the function gutenberg_download() that downloads one or more works from Project Gutenberg by ID.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

gutenberg_metadata
#> # A tibble: 51,997 x 8
#>   gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#>          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#> 1            0  <NA> <NA>                 NA en       <NA>             Publi~
#> 2            1 "The~ Jeffe~             1638 en       United States L~ Publi~
#> 3            2 "The~ Unite~                1 en       American Revolu~ Publi~
#> 4            3 "Joh~ Kenne~             1666 en       <NA>             Publi~
#> 5            4 "Lin~ Linco~                3 en       US Civil War     Publi~
#> 6            5 "The~ Unite~                1 en       American Revolu~ Publi~
#> # ... with 5.199e+04 more rows, and 1 more variable: has_text <lgl>

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#>   gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#>          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#> 1          768 Wuth~ Bront~              405 en       Gothic Fiction/~ Publi~
#> # ... with 1 more variable: has_text <lgl>

gutenberg_download(768)
#> # A tibble: 12,085 x 2
#>   gutenberg_id text               
#>          <int> <chr>              
#> 1          768 "WUTHERING HEIGHTS"
#> 2          768 ""                 
#> 3          768 ""                 
#> 4          768 "CHAPTER I"        
#> 5          768 ""                 
#> 6          768 ""                 
#> # ... with 1.208e+04 more rows

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering. It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
#> # A tibble: 10 x 8
#>   gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#>          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#> 1          105 Pers~ Auste~               68 en       <NA>             Publi~
#> 2          121 Nort~ Auste~               68 en       Gothic Fiction   Publi~
#> 3          141 Mans~ Auste~               68 en       <NA>             Publi~
#> 4          158 Emma  Auste~               68 en       <NA>             Publi~
#> 5          161 Sens~ Auste~               68 en       <NA>             Publi~
#> 6          946 Lady~ Auste~               68 en       <NA>             Publi~
#> # ... with 4 more rows, and 1 more variable: has_text <lgl>