Beautiful Soup Findall Doesn't Find Them All

Beautiful Soup findAll doesn't find them all

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

BeautifulSoup: findAll doesn't find the tags

The issue is the parser:

In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")

In [22]: soup = BeautifulSoup(req.content, "lxml")

In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26

In [24]: soup = BeautifulSoup(req.content, "html.parser")

In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")

In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26

You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:

Sample Image

Beautiful Soup findAll doesn't find value

The data you see is loaded via Javascript. You can use requests/json module to load it:

import json
import requests


url = "https://www.migros.com.tr/rest/search/screens/temel-gida-c-2?sayfa=1"
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for p in data["data"]["searchInfo"]["storeProductInfos"]:
    print(
        "{:<35} {:<10} {:<10}".format(
            p["name"], p["regularPrice"], p["salePrice"]
        )
    )

Prints:

Maydanoz Adet                       445        445       
Soğan Kuru Dökme Kg                 195        195       
Migros Havuç Beypazarı Paket Kg     875        750       
Domates Kg                          1495       1495      
Kabak Sakız Kg                      1990       1990      
Dereotu Adet                        930        930       
Roka Demet                          603        603       
Salata Kıvırcık Adet                1090       1090      
Patlıcan Kemer Kg                   1990       1990      
Soğan Taze Demet                    925        925       
Hıyar Kg                            1790       1790      
Domates Salkım Kg                   2290       2290      
Biber Kırmızı Kg                    2190       2190      
Brokoli Kg                          3450       3450      
Atom Salata Adet                    1206       1206      
Kereviz Kg                          875        875       
Karnabahar Kg                       1390       1390      
Ispanak Kg                          1450       1450      
Patates Taze Kg                     556        556       
Biber Köy Usulü Kg                  2990       2990      
Nane Adet                           631        631       
Biber Sivri Kg                      2690       2690      
Pırasa Kg                           930        930       
Lahana Beyaz Kg                     595        595       
Biber Dolmalık Kg                   2702       2702      
Domates Şeker 250 G                 837        837       
Lahana Kırmızı Kg                   1206       1206      
Patates Ekonomik Boy File Kg        445        445       
Pancar Kg                           743        743       
Domates Pembe Kg                    1850       1850

Beautiful Soup find_all not finding them all

The data you see is loaded from external URL via Json. You can use this example to load it:

import json
import requests


url = "https://www.mohfw.gov.in/data/datanew.json"
data = requests.get(url).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

for d in data:
    print(
        "{:<50} {:<10} {}".format(
            d["state_name"], d["new_active"], d["new_cured"]
        )
    )

Prints:

Andaman and Nicobar Islands                        232        5879
Andhra Pradesh                                     159597     1016142
Arunachal Pradesh                                  1632       17501
Assam                                              29407      237088
Bihar                                              110431     410484
Chandigarh                                         8170       37288
Chhattisgarh                                       124459     653542
Dadra and Nagar Haveli and Daman and Diu           1606       6564
Delhi                                              90419      1124771
Goa                                                26731      72799
Gujarat                                            148297     464396
Haryana                                            108830     429950
Himachal Pradesh                                   23572      85713
Jammu and Kashmir                                  37302      152109
Jharkhand                                          59707      194433
Karnataka                                          464383     1210013
Kerala                                             357215     1339257
Ladakh                                             1374       13035
Lakshadweep                                        1165       2078
Madhya Pradesh                                     86639      520024
Maharashtra                                        644068     4107092
Manipur                                            2391       30141
Meghalaya                                          2019       15810
Mizoram                                            1609       5168
Nagaland                                           1798       12801
Odisha                                             67437      410227
Puducherry                                         10849      51584
Punjab                                             61935      327976
Rajasthan                                          197045     466310
Sikkim                                             1930       6617
Tamil Nadu                                         125230     1109450
Telangana                                          77704      389491
Tripura                                            1905       33929
Uttarakhand                                        56627      144409
Uttar Pradesh                                      272568     1081817
West Bengal                                        120946     765843
                                                   3487229    16951731

BeautifulSoup findAll() not finding all, regardless of which parser I use

When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.

Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.

Sample Image

So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

Beautiful soup findAll didn't find all of them

Works for me:

soup = BeautifulSoup.BeautifulSoup(xml)    
for section in soup.findAll("section"):
    for post in section.findAll('a', attrs={'class':['package-link']}):
        print post

results in:

<a href="/node/21537908" class="package-link">Democracy and its enemies</a>
<a href="/node/21537909" class="package-link">The year of self-induced stagnation</a>
<a href="/node/21537914" class="package-link">How to run the euro?</a>
<a href="/node/21537916" class="package-link">Wanted: a fantasy American president</a>
<a href="/node/21537917" class="package-link">Poking goes public</a>
<a href="/node/21537918" class="package-link">Varied company</a>
<a href="/node/21537919" class="package-link">All eyes on London</a>
<a href="/node/21537921" class="package-link">And now for some non-events</a>

Edit

Versions I use:

Python 2.7.3
BeautifulSoup 3.2.0

Beautiful Soup Findall Doesn't Find Them All