Expression to remove URL links from Twitter tweet
Do this:
result = re.sub(r"http\S+", "", subject)
http
matches literal characters\S+
matches all non-whitespace characters (the end of the url)- we replace with the empty string
how to remove hashtag, @user, link of a tweet using regular expression
The following example is a close approximation. Unfortunately there is no right way to do it just via regular expression. The following regex just strips of an URL (not just http), any punctuations, User Names or Any non alphanumeric characters. It also separates the word with a single space. If you want to parse the tweet as you are intending you need more intelligence in the system. Some precognitive self learning algorithm considering there is no standard tweet feed format.
Here is what I am proposing.
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
and here is the result on your examples
>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>
and here are few examples where it is not perfect
>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>
How to remove hashtag, user mentions & URLs from tweet. Twitter4j library(sentiment analysis) does not work properly with these noise words
Use regular expressions to filter out the #es before parsing a sentence through the sentiment analysis pipeline!
Use this:
String withoutHashTweet = originalTweet.replaceAll("[#]", "");
So "Hello great morning today #summermorning @evilpriest @holysinner " should return : "Hello great morning today summermorning @evilpriest @holysinner"
Similarly replace the hash in the code with @ to remove the respective sign
Excluding link at the end while pulling tweets in tweepy Streaming
As far as my experience with twitter and tweepy goes, these URL's are included in a tweet's text whenever there is a URL of some sort in the actual tweet, so we can't really avoid getting them.
You could remove them after you get them, this is a simple regex that replaces the pattern of these URL's with a blank string.
import re
re.sub(r' https://t.co/\w{10}', '', tweet_text)
Related Topics
Typeerror: Unsupported Operand Type(S) for ** or Pow(): 'List' and 'Int'
Python - Split a List of Dicts into Individual Dicts
Importerror: No Module Named Psycopg2 After Install
Sqlalchemy: How to Filter Date Field
Numpy: Checking If a Value Is Nat
Split Datetime Column into a Date and Time Python
Cast String to Float Is Not Supported in Linear Model
How to Print Colored Text to the Terminal
How to Print Just the First Letters of Each Word
Python: How to Calculate the Average Word Length in a Sentence Using the .Split Command
Get All Rows That Have Same Value in Pandas
Is There a Memory Efficient and Fast Way to Load Big Json Files
Find First Non-Zero Value in Each Column of Pandas Dataframe
Cannot Find Reference 'Xxx' in _Init_.Py
Count Number of Empty Array Occurrences Within a 2D Array
How to Make My Discord.Py Bot Play Mp3 in Voice Channel
Print 5 Items in a Row on Separate Lines for a List
How to Use a Pre-Trained Neural Network With Grayscale Images