Text Mining with R Notes

Textbook

Tutorial

Definitons

  • A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

Key functions

  • unnest_tokens(): do tokenization and get one-word-per-row format.

  • anti_join(get_stopwords()): We can remove stop words (accessible in a tidy form with the function get_stopwords()) with an anti_join.

R code

# Loading necessary libraries
library(sentimentr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)

# Example text
mytext <- c("The phone has scratches.", "The phone has no scratches.")

# Converting text into sentences
mytext <- get_sentences(mytext)

# Performing sentiment analysis
sentiment(mytext)
##    element_id sentence_id word_count  sentiment
## 1:          1           1          4 -0.3000000
## 2:          2           1          5  0.2683282
library(syuzhet)
## 
## Attaching package: 'syuzhet'
## The following object is masked from 'package:sentimentr':
## 
##     get_sentences
# Example sentences
sentences <- c("The phone has scratches.", "The phone has no scratches.")

# Get sentiment scores
sentiment_scores <- get_nrc_sentiment(sentences)

# View scores
sentiment_scores
##   anger anticipation disgust fear joy sadness surprise trust negative positive
## 1     0            0       0    0   0       0        0     0        0        0
## 2     0            0       0    0   0       0        0     0        0        0
# not work well
Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!