How do I clean twitter data in R?
Using gsub and
stringr package
I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls .
clean_tweet = gsub("&", "", unclean_tweet)
clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
clean_tweet = gsub("@\\w+", "", clean_tweet)
clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
clean_tweet = gsub("http\\w+", "", clean_tweet)
clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)
ref: ( Hicks , 2014)
After the above
I did the below.
#get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"@[a-z,A-Z]*","")
ref: (Stanton 2013)
Before doing any of the above I collapsed the whole string into a single long character using the below.
paste(mytweets, collapse=" ")
This cleaning process has worked for me quite well as opposed to the tm_map transforms.
All that I am left with now is a set of proper words and a very few improper words.
Now, I only have to figure out how to remove the non proper english words.
Probably i will have to subtract my set of words from a dictionary of words.
twitter data cleaning in R
I find the rtweet package much easier to work with than twitteR
, which is no longer up to date.
library(rtweet)
tweets <- search_tweets("urban park", n = 2000, lang = "en", full_text = TRUE)
This returns a data frame. One of the column names is is_retweet
, which makes filtering for retweets easy. Or just use include_rts = FALSE
in search_tweets()
.
library(dplyr)
tweets <- tweets %>%
filter(is_retweet == FALSE)
I normally use the tidytext package for text analysis. For example, to split tweet text into words, filter for words that you don't want and remove common "stop words":
tweets <- tweets %>%
filter(is_retweet == FALSE) %>%
select(text) %>%
unnest_tokens(word, text) %>%
select(word) %>%
filter(!word %in% c("https", "t.co", "amp"), # and whatever else to ignore
!word %in% tolower(tweets$screen_name), # remove user names
!grepl("^\\d+$", word)) %>% # remove numbers
anti_join(stop_words)
Streamlining cleaning Tweet text with Stringr
To answer your primary question, the clean_tweets()
function is not working in the line "Clean <- tweets %>% clean_tweets
" presumably because you are feeding it a dataframe. However, the function's internals (i.e., the str_
functions) require character vectors (strings).
cleaning issue
I say "presumably" here because I'm not sure what your tweets
object looks like, so I can't be sure. However, at least on your test data, the following solves the problem.
df %>%
mutate(clean = clean_tweets(text))
If you just wanted the character vector back, you could also do
clean_tweets(df$text)
emoji issue
Regarding the possibility of retaining emojis and assigning them sentiments, yes, I think you would proceed in essentially the way you have with the rest of the text: tokenize them, assign numeric values to each one, then aggregate.
Cleaning text of tweet messages
Looks like you tried to pass in the entire data.frame to gsub
rather than just the text column. gsub
prefers to work on character vectors. Instead you should do
data1[,2] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", data1[,2])
to just transform the second column.
Cleaning tweets, issue with random letters and numbers in
Base R
:
gsub("<.+>", "", s)
[1] "France haven't had a lot of time on the ball #WorldCupFinal"
[2] "In case it wasn’t already obvious why we must support France C’mon Afrique! #AllezLesBleus #WorldCupFinal"
[3] "replica goes to the winner original goes to Zürich #WorldCupFinal"
If you have tweets like this, where the chain of <...>
strings is interrupted by stuff you want to keep:
"replica goes to the winner original goes to Zürich <U+0001F643> #WorldCupFinal <U+0001F643>"
you have to do this:
gsub("<[^>]+>", "", s)
This makes sure the match does not extend from the first <
to the very last >
thus also removing stuff in-between.
Cleaning Twitter data pandas python
As you very well said, you are never storing the data back, let's create a function that does all the work and then pass it to the dataframe using map
. It's more efficient than looping through each value in the dataframe and storing it into a list (option B).
def cleaner(tweet):
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
if w.lower() in words or not w.isalpha())
return tweet
trump_df['tweet'] = trump_df['tweet'].map(lambda x: cleaner(x))
trump_df.to_csv('') #specify location
This will overwrite the tweet
column with the modifications.
Option B:
As stated, this will prove to be a bit more inefficient I'm thinking but it's as easy as creating a list previous to the for
loop, filling it with each clean tweet.
clean_tweets = []
for tweet in trump_df['tweet']:
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
##Here's where all the cleaning takes place
clean_tweets.append(tweet)
trump_df['tweet'] = clean_tweets
trump_df.to_csv('') #Specify location
Related Topics
Interactively Change the Selectinput Choices
Creating Professional Looking Powerpoints in R
Reversed Order After Coord_Flip in R
How to Replace Empty String with Na in R Dataframe
How to Properly Document S4 "[" and "[<-" Methods Using Roxygen
In R, What Does "Loaded via a Namespace (And Not Attached)" Mean
How to Neatly Clean My R Workspace While Preserving Certain Objects
Order and Color of Bars in Ggplot2 Barplot
How to Extend '==' Behavior to Vectors That Include Nas
Add a Page Refresh Button by Using R Shiny
How to Write from R to the Clipboard on a MAC
How to Host a Shiny App on a Windows MAChine
Colorize Clusters in Dendogram with Ggplot2
Shade Region Between Two Lines with Ggplot
Extract Non Null Elements from a List in R