How do you extract a url from a string using python?
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>
Extracting a URL in Python
In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:
import re
myString = "This is my tweet check it out http://example.com/blah"
print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))
How to remove any URL within a string in Python
Python script:
import re
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
Output:
text1
text2
text3
text4
text5
text6
Test this code here.
Extract all urls in a string with python3
Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract.
Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).
You have a couple of examples here.
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']
It seems that this module also has an update()
method which lets you update the TLD list cache file
However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:
result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk']
You can then build another lists which hold the excluded domains / TLDs / etc:
allowed_protocols = ['protocol_1', 'protocol_2']
allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
allowed_domains = ['domain_1']
for each_url in results:
# here, check each url against your rules
Regular expression to extract URLs with difficult formatting
You can forbid period as the last symbol like that:
m = re.findall("((http:|https:)//[^ \<]*[^ \<\.])",line)
Regex to extract URLs from href attribute in HTML with Python
import re
url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']
Extracting URL link using regular expression re - string matching - Python
re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))
The [^\s<>"]+
part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:
<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>
Use RegEx in Python to extract URL and optional query string from web server log data
You can use
^(?P<url>[^?]+)(?P<querystr>\?.*)?$
Details
^
- start of string(?P<url>[^?]+)
- Group "url": any one or more chars other than?
(?P<querystr>\?.*)?
- an optional Group "querystr": a?
char and then any zero or more chars other than line break chars as many as possible$
- end of string.
See the regex demo.
Related Topics
How to Get 'Real-Time' Information Back from a Subprocess.Popen in Python (2.5)
How to Force a List to a Fixed Size
Python (And Python C API): _New_ Versus _Init_
Using Lxml and Iterparse() to Parse a Big (+- 1Gb) Xml File
Converting "Yield From" Statement to Python 2.7 Code
How to Find Out Whether a File Is at Its 'Eof'
Differencebetween an Opencv Bgr Image and Its Reverse Version Rgb Image[:,:,::-1]
How to Convert a Python Datetime.Datetime to Excel Serial Date Number
Opencv - Apply Mask to a Color Image
How to Initialize Weights in Pytorch
Restart Python-Script from Within Itself
How to Check That Multiple Keys Are in a Dict in a Single Pass
What Is the Point of Setlevel in a Python Logging Handler
High Performance Fuzzy String Comparison in Python, Use Levenshtein or Difflib
Failedpreconditionerror: Attempting to Use Uninitialized in Tensorflow
How to Copy Over an Excel Sheet to Another Workbook in Python
Is Close() Necessary When Using Iterator on a Python File Object