B.3 Extracting text from pdf and other files

library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "data/1403.2805.pdf", mode = "wb")
txt <- pdf_text("data/1403.2805.pdf")

# all 29 pages  
length(txt) 
#> [1] 29

cat(txt[[1]])
#>                                               The jsonlite Package: A Practical and Consistent Mapping
#>                                                                    Between JSON Data and R Objects
#>                                                                                     Jeroen Ooms
#> arXiv:1403.2805v1 [stat.CO] 12 Mar 2014
#>                                                                               UCLA Department of Statistics
#>                                                                                              Abstract
#>                                                   A naive realization of JSON data in R maps JSON arrays to an unnamed list, and JSON objects to a
#>                                                named list. However, in practice a list is an awkward, inefficient type to store and manipulate data.
#>                                                Most statistical applications work with (homogeneous) vectors, matrices or data frames. Therefore JSON
#>                                                packages in R typically define certain special cases of JSON structures which map to simpler R types.
#>                                                Currently there exist no formal guidelines, or even consensus between implementations on how R data
#>                                                should be represented in JSON. Furthermore, upon closer inspection, even the most basic data structures
#>                                                in R actually do not perfectly map to their JSON counterparts and leave some ambiguity for edge cases.
#>                                                These problems have resulted in different behavior between implementations and can lead to unexpected
#>                                                output. This paper explicitly describes a mapping between R classes and JSON data, highlights potential
#>                                                problems, and proposes conventions that generalize the mapping to cover all common structures. We
#>                                                emphasize the importance of type consistency when using JSON to exchange dynamic data, and illustrate
#>                                                using examples and anecdotes. The jsonlite R package is used throughout the paper as a reference
#>                                                implementation.
#>                                           1    Introduction
#>                                           JavaScript Object Notation (JSON) is a text format for the serialization of structured data (Crockford, 2006a).
#>                                           It is derived from the object literals of JavaScript, as defined in the ECMAScript Programming Language
#>                                           Standard, Third Edition (ECMA, 1999). Design of JSON is simple and concise in comparison with other
#>                                           text based formats, and it was originally proposed by Douglas Crockford as a “fat-free alternative to XML”
#>                                           (Crockford, 2006b). The syntax is easy for humans to read and write, easy for machines to parse and generate
#>                                           and completely described in a single page at http://www.json.org. The character encoding of JSON text
#>                                           is always Unicode, using UTF-8 by default (Crockford, 2006a), making it naturally compatible with non-
#>                                           latin alphabets. Over the past years, JSON has become hugely popular on the internet as a general purpose
#>                                           data interchange format. High quality parsing libraries are available for almost any programming language,
#>                                           making it easy to implement systems and applications that exchange data over the network using JSON. For
#>                                           R (R Core Team, 2013), several packages that assist the user in generating, parsing and validating JSON
#>                                           are available through CRAN, including rjson (Couture-Beil, 2013), RJSONIO (Lang, 2013), and jsonlite
#>                                           (Ooms et al., 2014).
#>                                           The emphasis of this paper is not on discussing the JSON format or any particular implementation for using
#>                                                                                                  1

enframe(txt) %>% 
  rename(page = name) %>% 
  nest_paragraphs(input = value, width = 100) %>%
  head()
#>                                                                                                   text
#> 1 The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects Jeroen Ooms
#> 2   arXiv:1403.2805v1 [stat.CO] 12 Mar 2014 UCLA Department of Statistics Abstract A naive realization
#> 3    of JSON data in R maps JSON arrays to an unnamed list, and JSON objects to a named list. However,
#> 4    in practice a list is an awkward, inefficient type to store and manipulate data. Most statistical
#> 5    applications work with (homogeneous) vectors, matrices or data frames. Therefore JSON packages in
#> 6  R typically define certain special cases of JSON structures which map to simpler R types. Currently
#>   page
#> 1    1
#> 2    1
#> 3    1
#> 4    1
#> 5    1
#> 6    1

B.3.1 Office documents

B.3.2 Images