Twitter Emoji Encoding Problems with Twitter and R

Twitter emoji encoding problems with twitteR and R

I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.

You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:

as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.

So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.

The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.

Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.

Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).

So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.

As an example:

unicode <- 0x1F4Af

# Multibyte Version
intToUtf8(unicode)

# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)

Returns:

[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"

Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:

[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"

PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode

unicode2hilo <- function(unicode){
   hi = floor((unicode - 0x10000)/0x400) + 0xd800
   lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
   hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
   return(hilo)
}

hilo2unicode <- function(hi,lo){
   unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
   unicode = paste('0x', as.hexmode(unicode), sep = '')
   return(unicode)
}

PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.

PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.

Emoji in R [UTF-8 encoding]

The string is invalid UTF-8, as indicated. What you have there is UTF-16 encoded with UTF-8. So \xED\xA0\xBD is the high surrogate U+D83D, -- and \xED\xB2\x83 is the low surrogate U+DC83

If you apply the magical High,Low -> Codepoint formula, you'll end up with the actual codepoint:

(0xD83D - 0xD800) * 0x400 + 0xDC83 - 0xDC00 + 0x10000 = 0x1F483

You'll see this is the dancer emoji. Unfortunately I don't have a suggestion for you, as I'm not that familiar with R. But I can say you'd certainly want to get yourself in a position where this data is double encoded! Hope that helps bump you along the correct direction.

Emoticons in Twitter Sentiment Analysis in r

This should get rid of the emoticons, using iconv as suggested by ndoogan.

Some reproducible data:

require(twitteR) 
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem") 

# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))

# inspect, yes there are some odd characters in row five
head(df)

                                                                                                                                                text
1                                                                      ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 “@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3                      E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                                #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5  I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6                                                                                         What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Here's the key line that will remove the emoticons:

# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

Now inspect again, to see if the odd characters are gone (see row 5)

head(df)    
                                                                                                                               text
1                                                                     ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3                     E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                               #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5                                                                                 I use emoticons too much. #addicted #admittingit #emoticons  haha
6                                                                                        What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Emoji Sentiment Analysis in R

Check this discussion: VaderSentiment: unable to update emoji sentiment score

"Vader transforms emojis to their word representation prior to extracting sentiment"

Basically from what I tested out emoji's values are hidden but part of the score and can influence it. If you need the score for a specific emoji you can check library(lexicon) and run data.frame(hash_emojis_identifier) (dataframe that contains identifiers for emojis and matches them to a lexicon format) and data.frame(hash_sentiment_emojis) to get each emoji sentiment value. It is not possible though to determine from that what was the impact of a series of emojis over the total message score without knowing how vader calculates their cumulative impact on the score itself using libraries such as vader, lexicon.

You can evaluate the impact of the emoji though by doing a simple difference between the total score value of the message with emojis and the score without it:

allvals <- NULL
for (i in 1:length(data_sample)){
outs <-  vader_df(data_sample[i])
allvals <- rbind(allvals,outs)
}
allvalswithout <- NULL
for (i in 1:length(data_samplewithout)){
outs <-  vader_df(data_samplewithout[i])
allvalswithout <- rbind(allvalswithout,outs)
}

emojiscore <- allvals$compound-allvalswithout$compound

Then:

allvals <- cbind(allvals,emojiscore)

Now for large datasets it would be ideal to automate the process of removing emojis out of texts. Here i just removed it manually to propose this kind of approach to the problem.

R, after utf8 filtering still weird characters

I found a way to filter the emoticons. After a lot of searching i found that there was a function that converts a character vector between encodings. iconv documentation

...
text = tweets_df$text    
# remove emoticons
text <- sapply(text,function(row) iconv(row, "latin1", "ASCII", sub=""))
corpus = Corpus(VectorSource(text))
...

Extract emojis from tweets in R

I wrote a function for this purpose in my package rwhatsapp.

As your example is a whatsapp dataset, you can test it directly using the package (install via remotes::install_github("JBGruber/rwhatsapp"))

df <- rwhatsapp::rwa_read("_chat.txt")
#> Warning in readLines(x, encoding = encoding, ...): incomplete final line found
#> on '_chat.txt'
df
#> # A tibble: 392 x 6
#>    time                author    text             source       emoji  emoji_name
#>    <dttm>              <fct>     <chr>            <chr>        <list> <list>    
#>  1 2015-06-25 01:42:12 <NA>      : ‎Vishnu Gaud …  /home/johan… <NULL> <NULL>    
#>  2 2015-06-25 01:42:12 <NA>      : ‎You were added /home/johan… <NULL> <NULL>    
#>  3 2016-12-18 01:57:38 Shahain   :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  4 2016-12-21 21:54:46 Pankaj S… :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  5 2016-12-21 21:57:45 Shahain   :Wow             /home/johan… <NULL> <NULL>    
#>  6 2016-12-21 22:48:51 Sakshi    :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  7 2016-12-21 22:49:00 Sakshi    :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  8 2016-12-21 22:50:12 Neha Wip… :Awsum   /home/johan… <chr … <chr [4]> 
#>  9 2016-12-21 22:51:21 Sakshi    :              /home/johan… <chr … <chr [1]> 
#> 10 2016-12-21 22:57:01 Ganguly   :        /home/johan… <chr … <chr [4]> 
#> # … with 382 more rows

I extract the emojis from text and store them in a list column as each text can contain multiple emojis. Use unnest to unnest the list column.

library(tidyverse)
df %>% 
  select(time, emoji) %>% 
  unnest(emoji)
#> # A tibble: 654 x 2
#>    time                emoji
#>    <dttm>              <chr>
#>  1 2016-12-21 22:50:12    
#>  2 2016-12-21 22:50:12    
#>  3 2016-12-21 22:50:12    
#>  4 2016-12-21 22:50:12    
#>  5 2016-12-21 22:51:21    
#>  6 2016-12-21 22:57:01    
#>  7 2016-12-21 22:57:01    
#>  8 2016-12-21 22:57:01    
#>  9 2016-12-21 22:57:01    
#> 10 2016-12-21 23:28:51    
#> # … with 644 more rows

You can use this function with any text. The only thing you need to do first is to store the text in a data.frame in a column called text (I use tibble here as it prints nicer):

df <- tibble::tibble(
  text = readLines("/home/johannes/_chat.txt")
)
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
rwhatsapp::lookup_emoji(df, text_field = "text")
#> # A tibble: 764 x 3
#>    text                                                emoji     emoji_name
#>    <chr>                                               <list>    <list>    
#>  1 25/6/15, 1:42:12 AM: ‎Vishnu Gaud created this group <NULL>    <NULL>    
#>  2 25/6/15, 1:42:12 AM: ‎You were added                 <NULL>    <NULL>    
#>  3 18/12/16, 1:57:38 AM: Shahain: <‎image omitted>      <NULL>    <NULL>    
#>  4 21/12/16, 9:54:46 PM: Pankaj Sinha: <‎image omitted> <NULL>    <NULL>    
#>  5 21/12/16, 9:57:45 PM: Shahain: Wow                  <NULL>    <NULL>    
#>  6 21/12/16, 10:48:51 PM: Sakshi: <‎image omitted>      <NULL>    <NULL>    
#>  7 21/12/16, 10:49:00 PM: Sakshi: <‎image omitted>      <NULL>    <NULL>    
#>  8 21/12/16, 10:50:12 PM: Neha Wipro: Awsum    <chr [4]> <chr [4]> 
#>  9 21/12/16, 10:51:21 PM: Sakshi:                    <chr [1]> <chr [1]> 
#> 10 21/12/16, 10:57:01 PM: Ganguly:             <chr [4]> <chr [4]> 
#> # … with 754 more rows

more details

The way this works under the hood is with a simple dictionary and matching approach. First I split the text into characters and put the characters in a data.frame together with the line id (this is a rewrite of unnest_tokens from tidytext):

lines <- readLines("/home/johannes/_chat.txt")
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
id <- seq_along(lines)
l <- stringi::stri_split_boundaries(lines, type = "character")

out <- tibble(id = rep(id, sapply(l, length)), emoji = unlist(l))

Then I match the characters with a dataset of emoji characters (see ?rwhatsapp::emojis for more infos):

out <- add_column(out,
                  emoji_name = rwhatsapp::emojis$name[
                    match(out$emoji,
                          rwhatsapp::emojis$emoji)
                    ])
out
#> # A tibble: 28,652 x 3
#>       id emoji emoji_name
#>    <int> <chr> <chr>     
#>  1     1 "2"   <NA>      
#>  2     1 "5"   <NA>      
#>  3     1 "/"   <NA>      
#>  4     1 "6"   <NA>      
#>  5     1 "/"   <NA>      
#>  6     1 "1"   <NA>      
#>  7     1 "5"   <NA>      
#>  8     1 ","   <NA>      
#>  9     1 " "   <NA>      
#> 10     1 "1"   <NA>      
#> # … with 28,642 more rows

Now the new column contains either an emoji or NA when no emoji was found. Removing NAs just the emojis are left.

out <- out[!is.na(out$emoji_name), ]
out
#> # A tibble: 656 x 3
#>       id emoji emoji_name                       
#>    <int> <chr> <chr>                            
#>  1     8     grinning face                    
#>  2     8     grinning face                    
#>  3     8     thumbs up: medium-light skin tone
#>  4     8     thumbs up: medium-light skin tone
#>  5     9     see-no-evil monkey               
#>  6    10     slightly smiling face            
#>  7    10     slightly smiling face            
#>  8    10     thumbs up: light skin tone       
#>  9    10     thumbs up: light skin tone       
#> 10    11     face with tears of joy           
#> # … with 646 more rows

The disadvantage of this approach is that you rely on the completeness of your emoji data. However, the dataset in the pacakge includes all known emojis from the unicode website (version 13).

Remove unicode encoded emojis from Twitter tweet

My suggestion would be to create an array of values you would like to replace and you need to escape the \ by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.

import re
to_remove_arr = [u"\ud83d\udcf8", u"\ud83c\uddeb\ud83c\uddf7"]
pattern_str = "|".join(to_remove_arr)    
text = re.sub(pattern_str, "", text)

Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:

text = text.encode("ascii", "ignore").decode()

Edit: to remove only emojis I found:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    return RE_EMOJI.sub(r'', text)

Twitter Emoji Encoding Problems with Twitter and R