Beautiful Soup Findall Doesn't Find Them All

Beautiful Soup findAll doesn't find them all

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

BeautifulSoup: findAll doesn't find the tags

The issue is the parser:

In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")

In [22]: soup = BeautifulSoup(req.content, "lxml")

In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26

In [24]: soup = BeautifulSoup(req.content, "html.parser")

In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")

In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26

You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:

Sample Image

Beautiful Soup findAll doesn't find value

The data you see is loaded via Javascript. You can use requests/json module to load it:

import json
import requests


url = "https://www.migros.com.tr/rest/search/screens/temel-gida-c-2?sayfa=1"
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for p in data["data"]["searchInfo"]["storeProductInfos"]:
print(
"{:<35} {:<10} {:<10}".format(
p["name"], p["regularPrice"], p["salePrice"]
)
)

Prints:

Maydanoz Adet                       445        445       
Soğan Kuru Dökme Kg 195 195
Migros Havuç Beypazarı Paket Kg 875 750
Domates Kg 1495 1495
Kabak Sakız Kg 1990 1990
Dereotu Adet 930 930
Roka Demet 603 603
Salata Kıvırcık Adet 1090 1090
Patlıcan Kemer Kg 1990 1990
Soğan Taze Demet 925 925
Hıyar Kg 1790 1790
Domates Salkım Kg 2290 2290
Biber Kırmızı Kg 2190 2190
Brokoli Kg 3450 3450
Atom Salata Adet 1206 1206
Kereviz Kg 875 875
Karnabahar Kg 1390 1390
Ispanak Kg 1450 1450
Patates Taze Kg 556 556
Biber Köy Usulü Kg 2990 2990
Nane Adet 631 631
Biber Sivri Kg 2690 2690
Pırasa Kg 930 930
Lahana Beyaz Kg 595 595
Biber Dolmalık Kg 2702 2702
Domates Şeker 250 G 837 837
Lahana Kırmızı Kg 1206 1206
Patates Ekonomik Boy File Kg 445 445
Pancar Kg 743 743
Domates Pembe Kg 1850 1850

Beautiful Soup find_all not finding them all

The data you see is loaded from external URL via Json. You can use this example to load it:

import json
import requests


url = "https://www.mohfw.gov.in/data/datanew.json"
data = requests.get(url).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

for d in data:
print(
"{:<50} {:<10} {}".format(
d["state_name"], d["new_active"], d["new_cured"]
)
)

Prints:

Andaman and Nicobar Islands                        232        5879
Andhra Pradesh 159597 1016142
Arunachal Pradesh 1632 17501
Assam 29407 237088
Bihar 110431 410484
Chandigarh 8170 37288
Chhattisgarh 124459 653542
Dadra and Nagar Haveli and Daman and Diu 1606 6564
Delhi 90419 1124771
Goa 26731 72799
Gujarat 148297 464396
Haryana 108830 429950
Himachal Pradesh 23572 85713
Jammu and Kashmir 37302 152109
Jharkhand 59707 194433
Karnataka 464383 1210013
Kerala 357215 1339257
Ladakh 1374 13035
Lakshadweep 1165 2078
Madhya Pradesh 86639 520024
Maharashtra 644068 4107092
Manipur 2391 30141
Meghalaya 2019 15810
Mizoram 1609 5168
Nagaland 1798 12801
Odisha 67437 410227
Puducherry 10849 51584
Punjab 61935 327976
Rajasthan 197045 466310
Sikkim 1930 6617
Tamil Nadu 125230 1109450
Telangana 77704 389491
Tripura 1905 33929
Uttarakhand 56627 144409
Uttar Pradesh 272568 1081817
West Bengal 120946 765843
3487229 16951731

BeautifulSoup findAll() not finding all, regardless of which parser I use

When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.

Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.

Sample Image

So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

Beautiful soup findAll didn't find all of them

Works for me:

soup = BeautifulSoup.BeautifulSoup(xml)    
for section in soup.findAll("section"):
for post in section.findAll('a', attrs={'class':['package-link']}):
print post

results in:

<a href="/node/21537908" class="package-link">Democracy and its enemies</a>
<a href="/node/21537909" class="package-link">The year of self-induced stagnation</a>
<a href="/node/21537914" class="package-link">How to run the euro?</a>
<a href="/node/21537916" class="package-link">Wanted: a fantasy American president</a>
<a href="/node/21537917" class="package-link">Poking goes public</a>
<a href="/node/21537918" class="package-link">Varied company</a>
<a href="/node/21537919" class="package-link">All eyes on London</a>
<a href="/node/21537921" class="package-link">And now for some non-events</a>

Edit

Versions I use:

  • Python 2.7.3
  • BeautifulSoup 3.2.0


Related Topics



Leave a reply



Submit