Hashtag Extract Function in R Programming

Select or extract words with a # in dataframe

Shot in the dark using dummy data

# Dummy data
data <- data.frame(title = c("#foo #bar",
"#qwerty #dvorak",
"#R>python"))
data$title <- as.character(data$title)
data
title
1 #foo #bar
2 #qwerty #dvorak
3 #R>python

# Extract hashtags
grep("#", unlist(strsplit(data$title, " ")), value = TRUE)
[1] "#foo" "#bar" "#qwerty" "#dvorak" "#R>python"

How do I extract hashtags from tweets in R?

Use "#\\S+" instead of "#\S+".

str_extract_all("Hello peopllz! My new home is #crazy gr8! #wow", "#\\S+")
# [[1]]
# [1] "#crazy" "#wow"  

There are two levels of parsing going on here. Before the low level regexp function within str_extract gets the pattern you want to search for (i.e. "#\S+") it is first parsed by R. R does not recognize \S as a valid escape character and throws an error. By escaping the slash with \\ you tell R to pass the \ and S as two normal characters to the regexp function, instead of interpreting it as one escape character.

Side track

This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of "\\computer". To search for it you would need to type str_extract(adr, "\\\\\\w+") which would turn into "\\\w+" internally and then search for.

spliting hashtags in a data.frame object with R

hashtags[!lengths(hashtags)] <- NA

This will replace your length zero lists with NAs. (better solution for this via Dirty Sock Sniffer)

hashtags <- unlist(hashtags)

will give you a column vector of the values. If you'd like a dataframe, you can use as.data.frame now.

hashtags_df <- as.data.frame(hashtags)

I don't know the best way to extract hashtags, etc., but this should answer the question as currently asked.

How to build a Corpus of hashtags (Text Mining)

Your problem is you are using str_split. You should try:

str_extract_all("This all are hashtag #hello #I #am #a #buch #of #hashtags", "#\\S+")

As results this list:
[[1]]
[1] "#hello" "#I" "#am" "#a" "#buch" "#of"
[7] "#hashtags"

If your desired result is a data frame use simplify = T:

str_extract_all("This all are hashtag #hello #I #am #a #buch #of #hashtags", "#\\S+", simplify = T)

As result:

     [,1]     [,2] [,3]  [,4] [,5]    [,6]  [,7]       
[1,] "#hello" "#I" "#am" "#a" "#buch" "#of" "#hashtags"

Extracting Tweets in R Based on Content (keywords)

You can construct on similar lines:

[hH]illary ?[Cc]linton

Demo: https://regex101.com/r/tEcDNY/2

word segmentation for hashtag using R

So ... This is an absolutely non trivial task and I think can not be solved generally. Since you are missing a delimiter between your words, you basically need to extract substrings and check them against a dictionary of your desired language.
A very crude method, that will only extract the longest matches from left to right it can find is using hunspell which is designed for spell checking but can be "misused" to maybe solve this task:

split_words <- function(cat.string){
split <- NULL
start.char <- 1
while(start.char < nchar(cat.string))
{
result <- NULL
for(cur.char in start.char:nchar(cat.string))
{
test.string <- substr(cat.string,start.char,cur.char)
test <- hunspell::hunspell(test.string)[[1]]
if(length(test) == 0) result <- test.string
}
if(is.null(result)) return("")
split <- c(split,result)
start.char <- start.char + nchar(result)
}
split
}

input <- c("#sometrendingtopic","#anothertrendingtopic","#someveryboringtopic")

# Clean the hashtag from the input
input <- sub("#","",input)
#apply word split
result <- lapply(input,split_words)
result
[[1]]
[1] "some" "trending" "topic"

[[2]]
[1] "another" "trending" "topic"

[[3]]
[1] "some" "very" "boring" "topic"

Please keep in mind that this method is far from perfect in multiple ways:

  1. It is relatively slow.
  2. It will greedily match from left to right. So if we for example have the hashtag
    input <- "#averyboringtopic" the result will be
[[3]]
[1] "aver" "y" "boring" "topic"

Since "aver" apparently is a possible word in this specific dictionary.
So: Use at your own risk and improve upon this!

Best HashTag Regex

It depends on whether you want to match hashtags inside other strings ("Some#Word") or things that probably aren't hashtags ("We're #1"). The regex you gave #\w+ will match in both these cases. If you slightly modify your regex to \B#\w\w+, you can eliminate these cases and only match hashtags of length greater than 1 on word boundaries.



Related Topics



Leave a reply



Submit