Beautifulsoup Getting Href

Python Beautifulsoup, get href tag, in a tag

I believe that your problem lies in this line :

product_link = title.get('a')['href']

You already have a list of "a" elements, so you probably just need :

product_link = title['href']

retrieve links from web page using python and BeautifulSoup

Here's a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

python/beautifulsoup to find a href in p of specific div class

I have taken your data as HTML you can use css-selector to find tag

html="""<div class="post-body clearfix">
::before
<div class="entry-content clearfix">
::before
<a target="_blank" class="single-post-ad" href="https://dealli
te.uk/#pricing">...</a>
<ul>...</ul>
<p>
<em>Friday 15th October 2021. London, UK. </em>
"Fast-growing fintech "
<a href="http://withplum.com/" target="_blank" rel="noreferr
er noopener">Plum</a> == $0
" is today announcing a first close of new funding that will
supercharge the company's expansion, and cement Plum as
Europe's ultimate money management app. "
</p>"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")

main_data=soup.select("div.entry-content.clearfix > p > a")
main_data[0]['href']

Output:

'http://withplum.com/'

As per the Link:

import requests
res=requests.get("https://www.uktechnews.info/2021/10/13/humn-ai-secures-10-1-million-series-a-investment-led-by-bxr-group-and-shell-ventures/")
soup=BeautifulSoup(res.text,"html.parser")

Select element according to need

main_data=soup.select("div.entry-content.clearfix > p > a")
main_data[0]['href']

Output:

'http://humn.ai/'

BeautifulSoup getting href of a list - need to simplify the script - replace multiprocessing

The following is one way of getting that information, in an async fashion (should work on Colab notebooks). I got the dioceses urls from a different part of the site (Structured view - World Regions). I would expect the dioceses count there to match the count from the letters list.

from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
from datetime import datetime
import asyncio
import nest_asyncio

nest_asyncio.apply()

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

big_df_list = []

def all_dioceses():
dioceses = []
root_links = [f'https://www.catholic-hierarchy.org/diocese/qview{x}.html' for x in range(1, 8)]
with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
for x in root_links:
r = client.get(x)
soup = bs(r.text)
soup.select_one('ul#menu2').decompose()
for link in soup.select('ul > li > a'):
dioceses.append('https://www.catholic-hierarchy.org/diocese/' + link.get('href'))
return dioceses
# print(all_dioceses())

async def get_diocese_info(url):
async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
try:
r = await client.get(url)
soup = bs(r.text)
d_name = soup.select_one('h1[align="center"]').get_text(strip=True)
info_table = soup.select_one('div[id="d1"] > table')
d_bishops = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[0].select('li')])
d_extra_info = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[1].select('li')])
big_df_list.append((d_name, d_bishops, d_extra_info, url))
print('done', d_name)
except Exception as e:
print(url, e)

async def scrape_dioceses():
start_time = datetime.now()
tasks = asyncio.Queue()
for x in all_dioceses():
tasks.put_nowait(get_diocese_info(x))

async def worker():
while not tasks.empty():
await tasks.get_nowait()

await asyncio.gather(*[worker() for _ in range(100)])
end_time = datetime.now()
duration = end_time - start_time
print('diocese scraping took', duration)

asyncio.run(scrape_dioceses())
df = pd.DataFrame(big_df_list, columns = ['Name', 'Bishops', 'Info', 'Url'])
print(df)

Result in terminal:

done Eparchy of Mississauga (Syro-Malabar)
done Eparchy of Mar Addai of Toronto (Chaldean)
done Eparchy of Saint-Sauveur de Montr�al (Melkite Greek)
done Diocese of Calgary
done Archdiocese of Winnipeg
[...]
diocese scraping took 0:03:02.366096

Name Bishops Info Url
0 Eparchy of Mississauga (Syro-Malabar) JoseKalluvelil, Bishop Type of Jurisdiction: Eparchy | Elevated:22 December2018 | Immediately Subject to the Holy See | Syro-Malabar Catholic Church of the Chaldean Tradition | Country:Canada | Mailing Address: Syro-Malabar Apostolic Exarchate, 6630 Turner Valley Rd., Mississauga, ON L5V 2P1, Canada | Telephone: (905)858-8200 | Fax: 858-8208 https://www.catholic-hierarchy.org/diocese/dmism.html
1 Eparchy of Mar Addai of Toronto (Chaldean) Robert SaeedJarjis, Bishop | Bawai (Ashur)Soro, Bishop Emeritus Type of Jurisdiction: Eparchy | Erected:10 June2011 | Immediately Subject to the Holy See | Chaldean Catholic Church of the Chaldean Tradition | Country:Canada | Conference Region:Ontario | Mailing Address: 2 High Meadow Place, Toronto, ON M9L 2Z5, Canada | Telephone: (416)746-5816 | Fax: 746-5850 https://www.catholic-hierarchy.org/diocese/dtoch.html
2 Eparchy of Saint-Sauveur de Montr�al (Melkite Greek) MiladJawish, B.S., Bishop Type of Jurisdiction: Eparchy | Elevated:1 September1984 | Immediately Subject to the Holy See | Melkite Greek Catholic Church of the Byzantine Tradition | Country:Canada | Conference Region:Quebec | Web Site:http://www.melkite.com/ | Mailing Address: 10025 boul. de l'Arcadie, Montreal, QC H4N 2S1, Canada | Telephone: (514)272.6430 | Fax: 202.1274 https://www.catholic-hierarchy.org/diocese/dmome.html
3 Diocese of Calgary William TerrenceMcGrattan, Bishop | Frederick BernardHenry, Bishop Emeritus Type of Jurisdiction: Diocese | Erected:30 November1912 | Metropolitan: Archdiocese ofEdmonton | Rite: Latin (or Roman) | Province: Alberta | Country:Canada | Square Kilometers: 110,500 (42,680 Square Miles) | Conference Region:West (Ouest) | Catholic Directory Abbreviation: Cal | Official Web Site:http://www.calgarydiocese.ca/ | Mailing Address: Catholic Pastoral Centre, Room 290, The Iona Building, 120-17th Avenue S.W., Calgary, AB T2S 2T2, Canada | Telephone: (403)218-5528 | Fax: 264-0272 https://www.catholic-hierarchy.org/diocese/dcalg.html
4 Archdiocese of Winnipeg Richard JosephGagnon, Archbishop | James VernonWeisgerber, Archbishop Emeritus Type of Jurisdiction: Archdiocese | Erected:4 December1915 | Immediately Subject to the Holy See | Rite: Latin (or Roman) | Province: Manitoba | Country:Canada | Square Kilometers: 116,405 (44,961 Square Miles) | Conference Region:West (Ouest) | Catholic Directory Abbreviation: W | Official Web Site:http://www.archwinnipeg.ca/ | Mailing Address: Chancery Office, 1495 Pembina Highway, Winnipeg, MB R3T 2C6, Canada | Telephone: (204)452-2227 | Fax: 475-4409 https://www.catholic-hierarchy.org/diocese/dwinn.html
... ... ... ... ...
2619 Archiepiscopal Exarchate of Krym (Ukrainian) Vacant | Makariy BohdanLeniv, O.S.B.M., Apostolic Administrator | MykhayloBubniy, C.SS.R., Archiepiscopal Administrator Type of Jurisdiction: Archiepiscopal Exarchate | Split:13 February2014 | Metropolitan: Archeparchy ofKyiv-Halyč {Kiev} (Ukrainian) | Ukrainian Catholic Church of the Byzantine Tradition | Country:Ukraine | Mailing Address: vul. Schmidta 22/12, 65000 Odessa, Ukraina | Telephone: (0482)32.58.90 | Fax: 32.58.89 https://www.catholic-hierarchy.org/diocese/dkrym.html
2620 Diocese of Lutsk VitaliySkomarovskyi, Bishop | MarkijanTrofym’yak, Bishop Emeritus Type of Jurisdiction: Diocese | Split:28 October1925 | Metropolitan: Archdiocese ofLviv | Rite: Latin (or Roman) | Country:Ukraine | Square Kilometers: 40,190 (15,523 Square Miles) | Official Web Site:http://catholic.volyn.ua/ | Mailing Address: Kuria Diecezjalna, vul. Katedralna 17, 43016 Lutsk, Ukraina | Telephone: (0332)72.15.32 | Fax: (same) https://www.catholic-hierarchy.org/diocese/dluts.html
2621 Diocese of Stockholm AndersArborelius, O.C.D., Cardinal, Bishop Type of Jurisdiction: Diocese | Elevated:29 June1953 | Immediately Subject to the Holy See | Rite: Latin (or Roman) | Country:Sweden | Square Kilometers: 450,295 (173,926 Square Miles) | Official Web Site:https://www.katolskakyrkan.se | Mailing Address: Katolska Biskopsambetet, Gotgatan 68, P.O. Box 4114, S-102 62 Stockholm, Sverige | Telephone: (08)462.66.02 | Fax: 702.05.55 https://www.catholic-hierarchy.org/diocese/dstos.html
2622 Archeparchy of Diarbekir (Amida) (Chaldean) RamziGarmou, Ist. del Prado, Archbishop Type of Jurisdiction: Archeparchy | Elevated:3 January1966 | Chaldean Catholic Church of the Chaldean Tradition | Country:Turkey | Mailing Address: Archeveche Chaldeen, Hamalbasi Caddesi 20, Galatasaray, 34435 Beyoglu, Istanbul, Turkiye | Telephone: (0212)252.34.49 | Fax: (same) https://www.catholic-hierarchy.org/diocese/ddiar.html
2623 Eparchy of Kolomyia (Ukrainian) VasylIvasyuk, Bishop Type of Jurisdiction: Eparchy | Split:12 September2017 | Metropolitan: Archeparchy ofIvano-Frankivsk [Stanislaviv] (Ukrainian) | Ukrainian Catholic Church of the Byzantine Tradition | Country:Ukraine | Square Kilometers: 14,000 (5,407 Square Miles) | Official Web Site:https://kolugcc.org.ua | Mailing Address: vul. Ivana Franka 29, 78200 Kolomyia, Ukraina | Telephone: (06891)19.707 https://www.catholic-hierarchy.org/diocese/dkolo.html
2624 rows × 4 columns

As you can see, this code will pull the full info for 2.6k dioceses in approx 3 minutes, while using far less resources than multiprocessing or multithreading.

You will need to install the following (install or upgrade, just run these commands one by one in colab notebook):

pip install -U asyncio
pip install -U nest-asyncio
pip install -U httpx
pip install -U bs4
pip install -U pandas

I also imported re, in case you will want to select the bits of information one by one (Jurisdiction, Tradition, Address, website, and so on), each of them in a try/except block, to account for missing ones, and extend the list/dataframe accordingly. All packages above can be found on https://pypi.org/, and are documented.

Getting href urls using beautifulsoup in python

Select your elements more specific e.g. with css selectors and be aware you have to concat the href with baseUrl:

['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

or simply change your code and use find() instead of findAll() to locate the table, what causes the following attribute error:

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

market_dataset = soup.find("table",{"class":"table table-striped table-condensed table-clean"})

Note: In new code use strict find_all() instead of old syntax findAll() or a mix of both.

Example

from bs4 import BeautifulSoup
import requests

url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

Output

['https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220318_FinalEnergyPrices_I.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220317_FinalEnergyPrices_I.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220316_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220315_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220314_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220313_FinalEnergyPrices.csv',
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices/20220312_FinalEnergyPrices.csv',...]


Related Topics



Leave a reply



Submit