How to Clean Twitter Data in R

How do I clean twitter data in R?

Using gsub and

stringr package

I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls .

  clean_tweet = gsub("&", "", unclean_tweet)
clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
clean_tweet = gsub("@\\w+", "", clean_tweet)
clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
clean_tweet = gsub("http\\w+", "", clean_tweet)
clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)

ref: ( Hicks , 2014)
After the above
I did the below.

 #get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"@[a-z,A-Z]*","")

ref: (Stanton 2013)

Before doing any of the above I collapsed the whole string into a single long character using the below.

paste(mytweets, collapse=" ")

This cleaning process has worked for me quite well as opposed to the tm_map transforms.

All that I am left with now is a set of proper words and a very few improper words.
Now, I only have to figure out how to remove the non proper english words.
Probably i will have to subtract my set of words from a dictionary of words.

twitter data cleaning in R

I find the rtweet package much easier to work with than twitteR, which is no longer up to date.

library(rtweet)
tweets <- search_tweets("urban park", n = 2000, lang = "en", full_text = TRUE)

This returns a data frame. One of the column names is is_retweet, which makes filtering for retweets easy. Or just use include_rts = FALSE in search_tweets().

library(dplyr)
tweets <- tweets %>%
filter(is_retweet == FALSE)

I normally use the tidytext package for text analysis. For example, to split tweet text into words, filter for words that you don't want and remove common "stop words":

tweets <- tweets %>% 
filter(is_retweet == FALSE) %>%
select(text) %>%
unnest_tokens(word, text) %>%
select(word) %>%
filter(!word %in% c("https", "t.co", "amp"), # and whatever else to ignore
!word %in% tolower(tweets$screen_name), # remove user names
!grepl("^\\d+$", word)) %>% # remove numbers
anti_join(stop_words)

Streamlining cleaning Tweet text with Stringr

To answer your primary question, the clean_tweets() function is not working in the line "Clean <- tweets %>% clean_tweets" presumably because you are feeding it a dataframe. However, the function's internals (i.e., the str_ functions) require character vectors (strings).

cleaning issue

I say "presumably" here because I'm not sure what your tweets object looks like, so I can't be sure. However, at least on your test data, the following solves the problem.

df %>% 
mutate(clean = clean_tweets(text))

If you just wanted the character vector back, you could also do

clean_tweets(df$text)

emoji issue

Regarding the possibility of retaining emojis and assigning them sentiments, yes, I think you would proceed in essentially the way you have with the rest of the text: tokenize them, assign numeric values to each one, then aggregate.

Cleaning text of tweet messages

Looks like you tried to pass in the entire data.frame to gsub rather than just the text column. gsub prefers to work on character vectors. Instead you should do

data1[,2] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", data1[,2])

to just transform the second column.

Cleaning tweets, issue with random letters and numbers in

Base R:

gsub("<.+>", "", s)
[1] "France haven't had a lot of time on the ball #WorldCupFinal"
[2] "In case it wasn’t already obvious why we must support France C’mon Afrique! #AllezLesBleus #WorldCupFinal"
[3] "replica goes to the winner original goes to Zürich #WorldCupFinal"

If you have tweets like this, where the chain of <...> strings is interrupted by stuff you want to keep:

"replica goes to the winner original goes to Zürich <U+0001F643> #WorldCupFinal <U+0001F643>"

you have to do this:

gsub("<[^>]+>", "", s)

This makes sure the match does not extend from the first < to the very last > thus also removing stuff in-between.

Cleaning Twitter data pandas python

As you very well said, you are never storing the data back, let's create a function that does all the work and then pass it to the dataframe using map. It's more efficient than looping through each value in the dataframe and storing it into a list (option B).

def cleaner(tweet):
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
if w.lower() in words or not w.isalpha())
return tweet
trump_df['tweet'] = trump_df['tweet'].map(lambda x: cleaner(x))
trump_df.to_csv('') #specify location

This will overwrite the tweet column with the modifications.

Option B:

As stated, this will prove to be a bit more inefficient I'm thinking but it's as easy as creating a list previous to the for loop, filling it with each clean tweet.

clean_tweets = []
for tweet in trump_df['tweet']:
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
##Here's where all the cleaning takes place
clean_tweets.append(tweet)
trump_df['tweet'] = clean_tweets
trump_df.to_csv('') #Specify location


Related Topics



Leave a reply



Submit