Sentiment Analysis Using R

Emoji Sentiment Analysis in R

Check this discussion: VaderSentiment: unable to update emoji sentiment score

"Vader transforms emojis to their word representation prior to extracting sentiment"

Basically from what I tested out emoji's values are hidden but part of the score and can influence it. If you need the score for a specific emoji you can check library(lexicon) and run data.frame(hash_emojis_identifier) (dataframe that contains identifiers for emojis and matches them to a lexicon format) and data.frame(hash_sentiment_emojis) to get each emoji sentiment value. It is not possible though to determine from that what was the impact of a series of emojis over the total message score without knowing how vader calculates their cumulative impact on the score itself using libraries such as vader, lexicon.

You can evaluate the impact of the emoji though by doing a simple difference between the total score value of the message with emojis and the score without it:

allvals <- NULL
for (i in 1:length(data_sample)){
outs <- vader_df(data_sample[i])
allvals <- rbind(allvals,outs)
}
allvalswithout <- NULL
for (i in 1:length(data_samplewithout)){
outs <- vader_df(data_samplewithout[i])
allvalswithout <- rbind(allvalswithout,outs)
}

emojiscore <- allvals$compound-allvalswithout$compound

Then:

allvals <- cbind(allvals,emojiscore) 

Now for large datasets it would be ideal to automate the process of removing emojis out of texts. Here i just removed it manually to propose this kind of approach to the problem.

Dutch sentiment analysis using R

Sentiment analysis (using a dictionary) is basically just a pattern matching task. I think this becomes clear when using the tidytext package and reading the book about it.

So I wouldn't bother with such a complex setup here. Instead, I would convert the dictionary they are using (which is from here) into a data.frame and then use tidytext. Unfortunately, the dictionary is stored in XML format and I'm not very familiar with that, so the code looks a little hacky:

library(tidyverse)
library(xml2)
library(tidytext)

sentiment_nl <- read_xml(
"https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>%
as_list() %>%
.[[1]] %>%
map_df(function(x) {
tibble::enframe(attributes(x))
}) %>%
mutate(id = cumsum(str_detect("form", name))) %>%
unnest(value) %>%
pivot_wider(id_cols = id) %>%
mutate(form = tolower(form), # lowercase all words to ignore case during matching
polarity = as.numeric(polarity),
subjectivity = as.numeric(subjectivity),
intensity = as.numeric(intensity),
confidence = as.numeric(confidence))

But the output is correct for the purpose:

head(sentiment_nl)
#> # A tibble: 6 x 11
#> id form cornetto_id cornetto_synset… wordnet_id pos sense polarity
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 amst… r_a-16677 "" "" JJ van … 0
#> 2 2 ange… r_a-8929 "" "" JJ Enge… 0.1
#> 3 3 arab… r_a-16693 "" "" JJ van … 0
#> 4 4 arde… r_a-17252 "" "" JJ van … 0
#> 5 5 arnh… r_a-16698 "" "" JJ van … 0
#> 6 6 asse… r_a-16700 "" "" JJ van … 0
#> # … with 3 more variables: subjectivity <dbl>, intensity <dbl>,
#> # confidence <dbl>

Now we can use the functions from tidytext and the broader tidyverse to lookup the words in the dictionary and attach the score to each word. summarise() is used to get exactly one value per text (that's also why you need the text_id).

df <- data.frame(text = c("Het eten was heerlijk en de bediening was fantastisch", 
"Verschrikkelijk. Ik had een vlieg in mijn soep",
"Het was oké. De bediening kon wat beter, maar het eten was wel lekker. Leuk sfeertje wel!",
"Ondanks dat het druk was toch op tijd ons eten gekregen. Complimenten aan de kok voor het op smaak brengen van mijn biefstuk"))

df %>%
mutate(text_id = row_number()) %>%
unnest_tokens(output = word, input = text, drop = FALSE) %>%
inner_join(sentiment_nl, by = c("word" = "form")) %>%
group_by(text_id) %>%
summarise(text = head(text, 1),
polarity = mean(polarity),
subjectivity = mean(subjectivity),
.groups = "drop")
#> # A tibble: 4 x 4
#> text_id text polarity subjectivity
#> <int> <chr> <dbl> <dbl>
#> 1 1 Het eten was heerlijk en de bediening was fanta… 0.56 0.72
#> 2 2 Verschrikkelijk. Ik had een vlieg in mijn soep -0.5 0.9
#> 3 3 Het was oké. De bediening kon wat beter, maar h… 0.6 0.98
#> 4 4 Ondanks dat het druk was toch op tijd ons eten … -0.233 0.767

As I said, more on this (and NLP) is explained on tidytextmining.com, so don't worry if this looks complicated to you now.

Sentiment analysis using R

And there is this package:

sentiment: Tools for Sentiment Analysis

sentiment is an R package with tools for sentiment analysis including bayesian classifiers for positivity/negativity and emotion classification.

Update 14 Dec 2012: it has been removed to the archive...

Update 15 Mar 2013: the qdap package has a polarity function, based on Jeffery Breen's work

Vader Sentiment Analysis in R

I have this idea to get all the outputs of data_sample given by get_vader(), but you will need to modify your code a bit to use vader_df():

allvals <- NULL
for (i in 1:length(data_sample)){
outs <- vader_df(data_sample[i])
allvals <- rbind(allvals,outs)
}

r sentiment analysis applied to a whole column

We can use sapply to apply sentiment function to each text individually.

library(sentimentr)

tweets$text <- as.character(tweets$text)
tweets$sentiment_score <- sapply(tweets$text, function(x)
mean(sentiment(x)$sentiment))

Missing rows after sentiment analysis using dplyr in R

As pointed out in @Bas comment, some word forms are missing from the dictionary. You can solve this by getting a better dictionary, stemming or lemmatization.

Ideally, you would use a lemmatizer, which is superior to stemming. However, I think in the example you've given a stemmer is working fine. So you can use this to construct the dictionary:

library(tidyverse)
library(xml2)
library(tidytext)
library(textstem)

sentiment_nl <- read_xml(
"https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>%
as_list() %>%
.[[1]] %>%
map_df(function(x) {
tibble::enframe(attributes(x))
}) %>%
mutate(id = cumsum(str_detect("form", name))) %>%
unnest(value) %>%
pivot_wider(id_cols = id) %>%
mutate(form = tolower(form),
stem = textstem::stem_words(form), # this is the new line
polarity = as.numeric(polarity),
subjectivity = as.numeric(subjectivity),
intensity = as.numeric(intensity),
confidence = as.numeric(confidence))

And then also stem the words in the text before matching on the stems:

df %>% 
unnest_tokens(output = word, input = text, drop = FALSE) %>%
mutate(stem = textstem::stem_words(word)) %>%
inner_join(sentiment_nl, by = "stem") %>%
group_by(identifier) %>%
summarise(text = head(text, 1),
polarity = mean(polarity),
subjectivity = mean(subjectivity),
.groups = "drop")
#> # A tibble: 6 x 4
#> identifier text polarity subjectivity
#> <chr> <chr> <dbl> <dbl>
#> 1 1 Het was oké. De bediening kon wat beter, maa… 0.6 0.98
#> 2 3 Slechte bediening, van begin tot eind -0.7 0.9
#> 3 4 Het eten was heerlijk en de bediening was fa… 0.56 0.72
#> 4 5 Ondanks dat het druk was toch op tijd ons et… -0.233 0.767
#> 5 6 Geweldige service en beleefde bediening 0.7 0.95
#> 6 7 Verschrikkelijk. Ik had een vlieg in mijn so… -0.3 0.733

Calculate sentiment of each row in a big dataset using R

The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?

I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.

library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])

produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.

One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).

batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}

batch_sentiment_by(reviews)

Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.



Related Topics



Leave a reply



Submit