Regex to extract URLs from href attribute in HTML with Python
import re
url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://2.example']
How to use regex to extract links from HTML tag-attributes (src, data, href, or others)
(?:data-full-resolution|src|href|data)=\"(.*?)\"
Regex Explanation
(?:
Non-capturing groupdata-full-resolution|src|href|data
One ofdata-full-resolution
,src
,href
ordata
)
Close non-capturing group=\"
Match="
after an attribute name(
Capturing group.*?
Non-greedy capturing till the next quote
)
Close group\"
Match the close quote
See regex demo
Python Example
import re
html = """<a href="<link-href>"></a>
<img src="<link-src>" data-full-resolution="<link-data-full-resolution>" />
<object data="<link-data>"/>"""
print(re.findall(r"(?:data-full-resolution|src|href|data)=\"(.*?)\"", html)) # ['<link-href>', '<link-src>', '<link-data-full-resolution>', '<link-data>']
Where re.findall
returns the list of captured groups.
Python Regex to extract relative href links
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
test:
import re
s = """
<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>
br/>
<a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a>
br/>
<a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a>
br/>
"""
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
re.findall(pattern,s)
output:
[('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')]
How do you extract a url from a string using python?
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>
How to extract a link from the embedded link with python?
I think it would be better to use beautiful soup instead.
The text to parse is an iframe
tag with the src
. You are trying the retrieve the url after href=
and before &width
in the src
attribute.
After that, you would need to decode the url back to text.
First, you throw it into beautiful soup and get the attribute out of it:
text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)
src_attribute = soup.find("iframe")["src"]
And then there you could use regex here or use .split()
(quite hacky):
# Regex
link = re.search('.*?href=(.*)?&', src_attribute).group(1)
# .split()
link = src_attribute.split("href=")[1].split("&")[0]
Lastly, you would need to decode the url using urllib2
:
link = urllib2.unquote(link)
and you are done!
So the resulting code would be:
from bs4 import BeautifulSoup
import urllib2
import re
text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)
src_attribute = soup.find("iframe")["src"]
# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]
link = urllib2.unquote(link)
Extracting a URL in Python
In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:
import re
myString = "This is my tweet check it out http://example.com/blah"
print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))
In python, how to do regex which catches an url in a a href tag?
all that is not "
: [^"]
so you can put:[^"]*"
and get:'<a href="[^"]*"'
Related Topics
Flask Application Traceback Doesn't Show Up in Server Log
Display a 'Loading' Message While a Time Consuming Function Is Executed in Flask
Getting Segmentation Fault Core Dumped Error While Importing Robjects from Rpy2
Integration Testing for a Web App
Looking for Recommendation on How to Convert PDF into Structured Format
How to Validate a Date String Format in Python
Python: Maximum Recursion Depth Exceeded
Python MySQL Connector Database Query with %S Fails
What Is the Problem with Shadowing Names Defined in Outer Scopes
Which Is the Recommended Way to Plot: Matplotlib or Pylab
Permanent Fix for Opencv Videocapture
Passing a Matplotlib Figure to HTML (Flask)
How to Add Sum to Zero Constraint to Glm in Python
Does Ruby Have Something Like Python's List Comprehensions