1.2 The gutenbergr package

library(gutenbergr)

The gutenbergr package provides access to the public domain works from the Project Gutenberg collection. The package includes tools both for downloading books (stripping out the unhelpful header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find works of interest. In this book, we will mostly use the function gutenberg_download() that downloads one or more works from Project Gutenberg by ID.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

gutenberg_metadata
#> # A tibble: 51,997 x 8
#>   gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#>          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#> 1            0  <NA> <NA>                 NA en       <NA>             Publi~
#> 2            1 "The~ Jeffe~             1638 en       United States L~ Publi~
#> 3            2 "The~ Unite~                1 en       American Revolu~ Publi~
#> 4            3 "Joh~ Kenne~             1666 en       <NA>             Publi~
#> 5            4 "Lin~ Linco~                3 en       US Civil War     Publi~
#> 6            5 "The~ Unite~                1 en       American Revolu~ Publi~
#> # ... with 5.199e+04 more rows, and 1 more variable: has_text <lgl>

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#>   gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#>          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#> 1          768 Wuth~ Bront~              405 en       Gothic Fiction/~ Publi~
#> # ... with 1 more variable: has_text <lgl>

gutenberg_download(768)
#> # A tibble: 12,085 x 2
#>   gutenberg_id text               
#>          <int> <chr>              
#> 1          768 "WUTHERING HEIGHTS"
#> 2          768 ""                 
#> 3          768 ""                 
#> 4          768 "CHAPTER I"        
#> 5          768 ""                 
#> 6          768 ""                 
#> # ... with 1.208e+04 more rows

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering. It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
#> # A tibble: 10 x 8
#>   gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
#>          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#> 1          105 Pers~ Auste~               68 en       <NA>             Publi~
#> 2          121 Nort~ Auste~               68 en       Gothic Fiction   Publi~
#> 3          141 Mans~ Auste~               68 en       <NA>             Publi~
#> 4          158 Emma  Auste~               68 en       <NA>             Publi~
#> 5          161 Sens~ Auste~               68 en       <NA>             Publi~
#> 6          946 Lady~ Auste~               68 en       <NA>             Publi~
#> # ... with 4 more rows, and 1 more variable: has_text <lgl>