5.3 Tidying corpus objects with metadata

Notwithstanding discrepancies in their form, document term matrix and one-token-per-row data frame are exchangable for they both store the same information after tokenization. A corpus object, however, is a data structure for text data before tokenization. One common example is Corpus objects from the tm package. These store text alongside metadata, which may include an ID, date/time, title, or language for each document.

The tm package comes with the acq corpus, containing 50 articles from the news service Reuters.

data("acq")
acq
#> <<VCorpus>>
#> Metadata:  corpus specific: 0, document level (indexed): 0
#> Content:  documents: 50

A corpus object is structured like a list, with each item containing both text and metadata, wh.

acq[[1]]
#> <<PlainTextDocument>>
#> Metadata:  15
#> Content:  chars: 1287
acq[[1]]$content
#> [1] "Computer Terminal Systems Inc said\nit has completed the sale of 200,000 shares of its common\nstock, and warrants to acquire an additional one mln shares, to\n<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.\n    The company said the warrants are exercisable for five\nyears at a purchase price of .125 dlrs per share.\n    Computer Terminal said Sedio also has the right to buy\nadditional shares and increase its total holdings up to 40 pct\nof the Computer Terminal's outstanding common stock under\ncertain circumstances involving change of control at the\ncompany.\n    The company said if the conditions occur the warrants would\nbe exercisable at a price equal to 75 pct of its common stock's\nmarket price at the time, not to exceed 1.50 dlrs per share.\n    Computer Terminal also said it sold the technolgy rights to\nits Dot Matrix impact technology, including any future\nimprovements, to <Woodco Inc> of Houston, Tex. for 200,000\ndlrs. But, it said it would continue to be the exclusive\nworldwide licensee of the technology for Woodco.\n    The company said the moves were part of its reorganization\nplan and would help pay current operation costs and ensure\nproduct delivery.\n    Computer Terminal makes computer generated labels, forms,\ntags and ticket printers and terminals.\n Reuter"
acq[[1]]$meta
#>   author       : character(0)
#>   datetimestamp: 1987-02-26 15:18:06
#>   description  : 
#>   heading      : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
#>   id           : 10
#>   language     : en
#>   origin       : Reuters-21578 XML
#>   topics       : YES
#>   lewissplit   : TRAIN
#>   cgisplit     : TRAINING-SET
#>   oldid        : 5553
#>   places       : usa
#>   people       : character(0)
#>   orgs         : character(0)
#>   exchanges    : character(0)

We can thus use the tidy() method to construct a table with one row per document, including the metadata

acq_tidy <- tidy(acq)
acq_tidy
#> # A tibble: 50 x 16
#>   author datetimestamp       description heading id    language origin topics
#>   <chr>  <dttm>              <chr>       <chr>   <chr> <chr>    <chr>  <chr> 
#> 1 <NA>   1987-02-26 23:18:06 ""          COMPUT~ 10    en       Reute~ YES   
#> 2 <NA>   1987-02-26 23:19:15 ""          OHIO M~ 12    en       Reute~ YES   
#> 3 <NA>   1987-02-26 23:49:56 ""          MCLEAN~ 44    en       Reute~ YES   
#> 4 By Ca~ 1987-02-26 23:51:17 ""          CHEMLA~ 45    en       Reute~ YES   
#> 5 <NA>   1987-02-27 00:08:33 ""          <COFAB~ 68    en       Reute~ YES   
#> 6 <NA>   1987-02-27 00:32:37 ""          INVEST~ 96    en       Reute~ YES   
#> # ... with 44 more rows, and 8 more variables: lewissplit <chr>,
#> #   cgisplit <chr>, oldid <chr>, places <named list>, people <lgl>, orgs <lgl>,
#> #   exchanges <lgl>, text <chr>