Emoticons in Twitter Sentiment Analysis in R

Emoticons in Twitter Sentiment Analysis in r

This should get rid of the emoticons, using iconv as suggested by ndoogan.

Some reproducible data:

require(twitteR) 
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem") 

# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))

# inspect, yes there are some odd characters in row five
head(df)

                                                                                                                                                text
1                                                                      ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 “@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3                      E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                                #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5  I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6                                                                                         What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Here's the key line that will remove the emoticons:

# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

Now inspect again, to see if the odd characters are gone (see row 5)

head(df)    
                                                                                                                               text
1                                                                     ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3                     E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                               #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5                                                                                 I use emoticons too much. #addicted #admittingit #emoticons  haha
6                                                                                        What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Emoji Sentiment Analysis in R

Check this discussion: VaderSentiment: unable to update emoji sentiment score

"Vader transforms emojis to their word representation prior to extracting sentiment"

Basically from what I tested out emoji's values are hidden but part of the score and can influence it. If you need the score for a specific emoji you can check library(lexicon) and run data.frame(hash_emojis_identifier) (dataframe that contains identifiers for emojis and matches them to a lexicon format) and data.frame(hash_sentiment_emojis) to get each emoji sentiment value. It is not possible though to determine from that what was the impact of a series of emojis over the total message score without knowing how vader calculates their cumulative impact on the score itself using libraries such as vader, lexicon.

You can evaluate the impact of the emoji though by doing a simple difference between the total score value of the message with emojis and the score without it:

allvals <- NULL
for (i in 1:length(data_sample)){
outs <-  vader_df(data_sample[i])
allvals <- rbind(allvals,outs)
}
allvalswithout <- NULL
for (i in 1:length(data_samplewithout)){
outs <-  vader_df(data_samplewithout[i])
allvalswithout <- rbind(allvalswithout,outs)
}

emojiscore <- allvals$compound-allvalswithout$compound

Then:

allvals <- cbind(allvals,emojiscore)

Now for large datasets it would be ideal to automate the process of removing emojis out of texts. Here i just removed it manually to propose this kind of approach to the problem.

Analysis of emojis used in tweets with specific hashtags or keywords in R

When setting my access keys and tokens, I had mistakenly used access_token_secret rather than access_secret.

R tweets with emojis

In R there tends to be a package for most things. And in this case textclean and with it comes the lexicon package which has a lot of dictionaries. Using textclean you have 2 functions you can use, replace_emoji and replace_emoji_identifier

text = c("text text. \U0001f600", "i am here\U0001f600")

# replace emoji with identifier:
textclean::replace_emoji_identifier(text)
[1] "text text. lexiconvygwtlyrpywfarytvfis " "i am here lexiconvygwtlyrpywfarytvfis " 

# replace emoji with text representation
textclean::replace_emoji(text)
[1] "text text. grinning face " "i am here grinning face "

Next you could use sentimentr to use sentiment scoring on the emoji's or for text analysis quanteda. If you just want to check the presence as in your expected output:

grepl("lexicon[[:alpha:]]{20}", textclean::replace_emoji_identifier(text))
[1] TRUE TRUE

remove emoticons in R using tm package

You can use gsub to get rid of all non-ASCII characters.

Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
    "See you soon brother ☮ ",
    "A boring old-fashioned message" ) 

gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place    "
[2] "See you soon brother  "                                  
[3] "A boring old-fashioned message"

Details:
You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.

Removing emojis in R

It depends a bit on how exactly your strings look like.

In your case, using plain regex may work. Replacing the emoji with a space may be preferable than just removing it, otherwise you risk ending up with two words merged into one.

stringr::str_replace_all(string = "It's mind-blowing! <U+0001F603>",
                         pattern = '<U.*>',
                         replacement = " ")

you may want to add stringr::str_squish() to drop redundant spaces.

Twitter emoji encoding problems with twitteR and R

I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.

You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:

as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.

So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.

The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.

Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.

Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).

So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.

As an example:

unicode <- 0x1F4Af

# Multibyte Version
intToUtf8(unicode)

# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)

Returns:

[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"

Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:

[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"

PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode

unicode2hilo <- function(unicode){
   hi = floor((unicode - 0x10000)/0x400) + 0xd800
   lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
   hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
   return(hilo)
}

hilo2unicode <- function(hi,lo){
   unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
   unicode = paste('0x', as.hexmode(unicode), sep = '')
   return(unicode)
}

PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.

PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.

Emoticons in Twitter Sentiment Analysis in R