Beautiful Soup findAll doesn't find them all
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml
parser is not dealing very well with it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18
The standard library html.parser
has less trouble with this specific page:
>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44
Translating that to your specific code sample using urllib
, you would specify the parser thus:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading
BeautifulSoup: findAll doesn't find the tags
The issue is the parser:
In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")
In [22]: soup = BeautifulSoup(req.content, "lxml")
In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26
In [24]: soup = BeautifulSoup(req.content, "html.parser")
In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")
In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26
You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:
Beautiful Soup findAll doesn't find value
The data you see is loaded via Javascript. You can use requests
/json
module to load it:
import json
import requests
url = "https://www.migros.com.tr/rest/search/screens/temel-gida-c-2?sayfa=1"
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for p in data["data"]["searchInfo"]["storeProductInfos"]:
print(
"{:<35} {:<10} {:<10}".format(
p["name"], p["regularPrice"], p["salePrice"]
)
)
Prints:
Maydanoz Adet 445 445
Soğan Kuru Dökme Kg 195 195
Migros Havuç Beypazarı Paket Kg 875 750
Domates Kg 1495 1495
Kabak Sakız Kg 1990 1990
Dereotu Adet 930 930
Roka Demet 603 603
Salata Kıvırcık Adet 1090 1090
Patlıcan Kemer Kg 1990 1990
Soğan Taze Demet 925 925
Hıyar Kg 1790 1790
Domates Salkım Kg 2290 2290
Biber Kırmızı Kg 2190 2190
Brokoli Kg 3450 3450
Atom Salata Adet 1206 1206
Kereviz Kg 875 875
Karnabahar Kg 1390 1390
Ispanak Kg 1450 1450
Patates Taze Kg 556 556
Biber Köy Usulü Kg 2990 2990
Nane Adet 631 631
Biber Sivri Kg 2690 2690
Pırasa Kg 930 930
Lahana Beyaz Kg 595 595
Biber Dolmalık Kg 2702 2702
Domates Şeker 250 G 837 837
Lahana Kırmızı Kg 1206 1206
Patates Ekonomik Boy File Kg 445 445
Pancar Kg 743 743
Domates Pembe Kg 1850 1850
Beautiful Soup find_all not finding them all
The data you see is loaded from external URL via Json. You can use this example to load it:
import json
import requests
url = "https://www.mohfw.gov.in/data/datanew.json"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for d in data:
print(
"{:<50} {:<10} {}".format(
d["state_name"], d["new_active"], d["new_cured"]
)
)
Prints:
Andaman and Nicobar Islands 232 5879
Andhra Pradesh 159597 1016142
Arunachal Pradesh 1632 17501
Assam 29407 237088
Bihar 110431 410484
Chandigarh 8170 37288
Chhattisgarh 124459 653542
Dadra and Nagar Haveli and Daman and Diu 1606 6564
Delhi 90419 1124771
Goa 26731 72799
Gujarat 148297 464396
Haryana 108830 429950
Himachal Pradesh 23572 85713
Jammu and Kashmir 37302 152109
Jharkhand 59707 194433
Karnataka 464383 1210013
Kerala 357215 1339257
Ladakh 1374 13035
Lakshadweep 1165 2078
Madhya Pradesh 86639 520024
Maharashtra 644068 4107092
Manipur 2391 30141
Meghalaya 2019 15810
Mizoram 1609 5168
Nagaland 1798 12801
Odisha 67437 410227
Puducherry 10849 51584
Punjab 61935 327976
Rajasthan 197045 466310
Sikkim 1930 6617
Tamil Nadu 125230 1109450
Telangana 77704 389491
Tripura 1905 33929
Uttarakhand 56627 144409
Uttar Pradesh 272568 1081817
West Bengal 120946 765843
3487229 16951731
BeautifulSoup findAll() not finding all, regardless of which parser I use
When you download a page through urllib
(or requests
HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'>
and these elements are generated by the JavaScript.
So when you download a page using urllib
, the downloaded content only contains the original source page (you could see it by view-source option in the browser).
Beautiful soup findAll didn't find all of them
Works for me:
soup = BeautifulSoup.BeautifulSoup(xml)
for section in soup.findAll("section"):
for post in section.findAll('a', attrs={'class':['package-link']}):
print post
results in:
<a href="/node/21537908" class="package-link">Democracy and its enemies</a>
<a href="/node/21537909" class="package-link">The year of self-induced stagnation</a>
<a href="/node/21537914" class="package-link">How to run the euro?</a>
<a href="/node/21537916" class="package-link">Wanted: a fantasy American president</a>
<a href="/node/21537917" class="package-link">Poking goes public</a>
<a href="/node/21537918" class="package-link">Varied company</a>
<a href="/node/21537919" class="package-link">All eyes on London</a>
<a href="/node/21537921" class="package-link">And now for some non-events</a>
Edit
Versions I use:
- Python 2.7.3
- BeautifulSoup 3.2.0
Related Topics
How to Get Text of an Element in Selenium Webdriver, Without Including Child Element Text
How to Activate Virtualenv in Linux
Can Python Select What Network Adapter When Opening a Socket
What Does If _Name_ == "_Main_": Do
How to Make Function Decorators and Chain Them Together
How to Make a Dictionary from Separate Lists of Keys and Values
A Non-Blocking Read on a Subprocess.Pipe in Python
What Is the Purpose of the Return Statement? How Is It Different from Printing
Difference Between '/' and '//' When Used For Division
Why Is Nothing Drawn in Pygame At All
How to Simulate Html5 Drag and Drop in Selenium Webdriver
Python-Dev Installation Error: Importerror: No Module Named Apt_Pkg
Calling a Python Script from Command Line Without Typing "Python" First
Why Does "A == X or Y or Z" Always Evaluate to True
Difference Between Python'S List Methods Append and Extend