How to Get Href Links from HTML Using Python

How can I get href links from HTML using Python?

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

In Python 3 with BS4 it should be:

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
print(link.get('href'))

how to get href link by text in Python

You might want to try BeautifulSoup.

For example:

from bs4 import BeautifulSoup

sample_html = """
<a href="https://www.cnbeta.com/articles/science/1062069.htm"><strong>阅读全文</strong></a>
<a href="https://www.cnbeta.com/articles/science/1062068.htm"><strong>RANDOM TEXT!</strong></a>
"""

soup = BeautifulSoup(sample_html, "html.parser").find_all(lambda t: t.name == "a" and t.text.startswith("阅"))

print([a["href"] for a in soup])

Output:

['https://www.cnbeta.com/articles/science/1062069.htm']

Get href links from a tag

Looking at your HTML code, you can use CSS selector a.product-item. This will select all <a> tags with class="product-item":

from bs4 import BeautifulSoup

html_text = """
<div class="row product-layout-category product-layout-list">
<div class="product-col wow fadeIn animated" style="visibility: visible;">
<a href="the link I want" class="product-item">
<div class="product-item-image">
<img data-src="link to an image" alt="name of the product" title="name of the product" class="img-responsive lazy" src="link to an image">
</div>
<div class="product-item-desc">
<p><span><strong>brand</strong></span></p>
<p><span class="font-size-16">name of the product</span></p>
<p class="product-item-price>
<span>product price</span></p>
</div>
</a>
</div>
"""

soup = BeautifulSoup(html_text, "html.parser")

for link in soup.select("a.product-item"):
print(link.get("href")) # or link["href"]

Prints:

the link I want

Python, Beautifullsoup - get href link

You have to pull out the anchor tag <a> that contains the href:

import requests
from bs4 import BeautifulSoup
page = "https://mojmikolow.pl/informacje,0.html"
page = requests.get(page).content
data_entries = BeautifulSoup(page, "html.parser").find_all("section", {"class": "news"})

for data_entrie in data_entries:
link_tag = data_entrie.find('a',href=True)
get_link = link_tag.get('href')
print(get_link)

Scraping using Python Beautifulsoup getting the url of href that is a link

Similar to what's described here. I believe you're actually going to need some kind of webdriver automator (Selenium, etc.) to simulate the hover-over and get the data.

Get href link with selenium (python)

i in your case is a web element, and to extract the .text, you should not just print i, it should be print(i.text).

Moreover if you want to extract the href off of the a tag, then you should use .get_attribute('href')

Secondly, I think you should use CSS_SELECTOR

div.search-content-cards

instead of CLASS_NAME

Also a tag is descendant.

so your effective code should look like this:

el = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.search-content-cards")))
el_hrefs = el.find_elements_by_xpath(".//descendant::a[@href]")
for i in el_hrefs:
print(i.get_attribute('href'))

How can i extract Href and title from this HTML

Select your elements more specific e.g. with css selectors and iterate over your ResultSet to get the attributes of each of them as list of tuples:

[(a.get('title'),a.get('href')) for a in soup.select('h3 a[href][title]')]
Example
from bs4 import BeautifulSoup
html = '''
<h3 class="foo1">
<a href="someLink" title="someTitle">SomeTitle</a>
</h3>
<h3 class="foo1">
<a href="OtherLink" title="OtherTitle">OtherTitle</a>
</h3>
'''
soup = BeautifulSoup(html)

[(a.get('title'),a.get('href')) for a in soup.select('h3 a[href]')]

Output

[('someTitle', 'someLink'), ('OtherTitle', 'OtherLink')]


Related Topics



Leave a reply



Submit