How to Remove Hashtag, @User, Link of a Tweet Using Regular Expression

how to remove hashtag, @user, link of a tweet using regular expression

The following example is a close approximation. Unfortunately there is no right way to do it just via regular expression. The following regex just strips of an URL (not just http), any punctuations, User Names or Any non alphanumeric characters. It also separates the word with a single space. If you want to parse the tweet as you are intending you need more intelligence in the system. Some precognitive self learning algorithm considering there is no standard tweet feed format.

Here is what I am proposing.

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

and here is the result on your examples

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>

and here are few examples where it is not perfect

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>

How to remove hashtag, user mentions & URLs from tweet. Twitter4j library(sentiment analysis) does not work properly with these noise words

Use regular expressions to filter out the #es before parsing a sentence through the sentiment analysis pipeline!
Use this:

String withoutHashTweet = originalTweet.replaceAll("[#]", "");

So "Hello great morning today #summermorning @evilpriest @holysinner " should return : "Hello great morning today summermorning @evilpriest @holysinner"

Similarly replace the hash in the code with @ to remove the respective sign

Python remove hashtag symbol and keep key words

>>> "this tweet is example #key1_key2_key3".replace("#", "").replace("_", " ")

This python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well

The pattern https?:\/\/.*[\r\n]*\S+ matches http(optional s)://

Then the .* part matches until the end of the string, then this part [\r\n]* matches 0+ newlines and \S+ will match 1+ non whitespace chars.

So the url is matched, followed by the rest of the string, a newline and 1+ non whitespace chars at the next line as well.

You could shorten the pattern to:

\bhttps?://\S+

Regex demo

Remove Twitter mentions from Pandas column

You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).

The right way of achieving your needs is using re module like:

import re
re.sub("@[A-Za-z0-9]+","", "@thisisauser text")
' text'

regex for Twitter username

(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9-_]+)

I've used this as it disregards emails.

Here is a sample tweet:

@Hello how are @you doing @my_friend, email @000 me @ whats.up@example.com @shahmirj

Matches:

  • @Hello
  • @you
  • @my_friend
  • @shahmirj

It will also work for hashtags, I use the same expression with the @ changed to #.



Related Topics



Leave a reply



Submit