Regex to Extract Urls from Href Attribute in HTML with Python

Regex to extract URLs from href attribute in HTML with Python

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://2.example']

How to use regex to extract links from HTML tag-attributes (src, data, href, or others)

(?:data-full-resolution|src|href|data)=\"(.*?)\"

Regex Explanation

  • (?: Non-capturing group
    • data-full-resolution|src|href|data One of data-full-resolution, src, href or data
  • ) Close non-capturing group
  • =\" Match =" after an attribute name
  • ( Capturing group
    • .*? Non-greedy capturing till the next quote
  • ) Close group
  • \" Match the close quote

See regex demo

Python Example

import re

html = """<a href="<link-href>"></a>
<img src="<link-src>" data-full-resolution="<link-data-full-resolution>" />
<object data="<link-data>"/>"""

print(re.findall(r"(?:data-full-resolution|src|href|data)=\"(.*?)\"", html)) # ['<link-href>', '<link-src>', '<link-data-full-resolution>', '<link-data>']

Where re.findall returns the list of captured groups.

Python Regex to extract relative href links

pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'

test:

import re
s = """
<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>
br/>
<a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a>
br/>
<a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a>
br/>
"""
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
re.findall(pattern,s)

output:

[('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')]

How do you extract a url from a string using python?

There may be few ways to do this but the cleanest would be to use regex

>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com

If there can be multiple links you can use something similar to below

>>> myString = "These are the links http://www.google.com  and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>

How to extract a link from the embedded link with python?

I think it would be better to use beautiful soup instead.

The text to parse is an iframe tag with the src. You are trying the retrieve the url after href= and before &width in the src attribute.

After that, you would need to decode the url back to text.

First, you throw it into beautiful soup and get the attribute out of it:

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

And then there you could use regex here or use .split() (quite hacky):

# Regex
link = re.search('.*?href=(.*)?&', src_attribute).group(1)

# .split()
link = src_attribute.split("href=")[1].split("&")[0]

Lastly, you would need to decode the url using urllib2:

link = urllib2.unquote(link)

and you are done!

So the resulting code would be:

from bs4 import BeautifulSoup
import urllib2
import re

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]

link = urllib2.unquote(link)

Extracting a URL in Python

In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:

import re

myString = "This is my tweet check it out http://example.com/blah"

print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))

In python, how to do regex which catches an url in a a href tag?

all that is not " : [^"]

so you can put:
[^"]*"

and get:
'<a href="[^"]*"'



Related Topics



Leave a reply



Submit