Notes for Text Mining with R
Preface
I Text Mining with R
1
Tidy text format
1.1
The
unnest_tokens()
function
1.2
The
gutenbergr
package
1.3
Compare word frequency
1.4
Other tokenization methods
2
Sentiment analysis with tidy data
2.1
The
sentiments
dataset
2.2
Sentiment analysis with inner join
2.3
Comparing 3 different dictionaries
2.4
Most common positive and negative words
2.5
Wordclouds
2.6
Units other than words
3
Analyzing word and document frequency
3.1
tf-idf
3.1.1
Term frequency in Jane Austen’s novels
3.1.2
Zipf’s law
3.1.3
Word rank slope chart
3.1.4
The
bind_tf_idf()
function
3.2
Weighted log odds ratio
3.2.1
Log odds ratio
3.2.2
Model-based approach: Weighted log odds ratio
3.2.3
Discussions
3.2.4
bind_log_odds()
3.3
A corpus of physics texts
4
Relationships between words: n-grams and correlations
4.1
Tokenizing by n-gram
4.1.1
Filtering n-grams
4.1.2
Analyzing bigrams
4.1.3
Using bigrams to provide context in sentiment analysis
4.1.4
Visualizing a network of bigrams with
ggraph
4.1.5
Visualizing “friends”
4.2
Counting and correlating pairs of words with
widyr
4.2.1
Counting and correlating among sections
4.2.2
Pairwise correlation
5
Converting to and from non-tidy formats
5.1
Tidying a document-term matrix
5.2
Casting tidy text data into a matrix
5.3
Tidying corpus objects with metadata
6
Topic modeling
6.1
Latent Dirichlet Allocation
6.1.1
Example: Associated Press
6.2
Example: the great library heist
6.2.1
LDA on chapters
6.2.2
Per-document classification
6.2.3
By word assignments:
augment()
6.3
Tuning number of topics
7
Text classification
References
Appendix
A
Reviews on regular expressions
A.1
Metacharacters and POSIX character classes
A.2
Unicode Code Points, Categories, Blocks, and Scripts
A.2.1
Unicode categories
A.2.2
Unicode scripts
A.2.3
Unicode blocks
A.3
Greedy and lazy quantifiers
A.4
Looking ahead and back
A.5
Backreferences
B
Text processing examples in R
B.1
Replacing and removing
B.2
Combining and splitting
B.3
Extracting text from pdf and other files
B.3.1
Office documents
B.3.2
Images
Written with bookdown
Notes for “Text Mining with R: A Tidy Approach”
B.1
Replacing and removing