Understand the Find() Function in Beautiful Soup

Understand the Find() function in Beautiful Soup

soup.find("div", {"class":"real number"})['data-value']

Here you are searching for a div element, but the span has the "real number" class in your example HTML data, try instead:

soup.find("span", {"class": "real number", "data-value": True})['data-value']

Here we are also checking for presence of data-value attribute.


To find elements having "real number" or "fake number" classes, you can make a CSS selector:

for elm in soup.select(".real.number,.fake.number"):
print(elm.get("data-value"))

To get the 69% value:

soup.find("div", {"class": "percentage good"}).get_text(strip=True)

Or, a CSS selector:

soup.select_one(".percentage.good").get_text(strip=True)
soup.select_one(".score .percentage").get_text(strip=True)

Or, locating the h6 element having Audit score text and then getting the preceding sibling:

soup.find("h6", text="Audit score").previous_sibling.get_text(strip=True)

Beautiful Soup find() function returning none even though element exists and find() works for other elements on the page?

Here's the code I used to scrape ALL 100 songs and their authors. This website is really horrible to scrape because it doesn't use ids or classes in a scrapable manner, so instead I relied (mostly) on the current structure of the page.
I'm not sure what exactly was causing your problem. The page was made with a framework so it was littered with styling classes. Your selection was fickle because it relied on these being consistent. Perhaps the 1st element was styled differently (actually, this is almost certainly the case, notice how the cover image is bigger on the actual page).

from bs4 import BeautifulSoup
import requests
x = requests.get("https://www.billboard.com/charts/hot-100/").text
soup = BeautifulSoup(x, "html.parser")
chart = soup.find("div",class_="lxml")
#div.chart-results-list > div.o-chart-results-list-row-container > ul.o-chart-results-list-row"

songNames = [x.text for x in soup.select("div.chart-results-list > div.o-chart-results-list-row-container > ul.o-chart-results-list-row > li:nth-child(4) > ul > li:nth-child(1) h3")]
authorNames = [x.text for x in soup.select("div.chart-results-list > div.o-chart-results-list-row-container > ul.o-chart-results-list-row > li:nth-child(4) > ul > li:nth-child(1) span")]
print(songNames)
#print(authorNames)
print(len(songNames))

beautifulsoup find function returns - when retrieving text

To find the element by id you can use soup.find(id='your_id').

Try this:

from bs4 import BeautifulSoup as bs

html = '''
<div class="row align-items-center">
<div class="col-md-4 mb-1 mb-md-0">Transfers:</div>
<div class="col-md-8"></div>
<span id="totaltxns">266,765</span><hr class="hr-space">
</div>
'''

soup = bs(html, 'html.parser')

print(soup.find(id='totaltxns').text)

Outputs:

266,765

If you look at the page source for the link you've mentioned, the value in totaltxns is -. That's why it's returning -.

Sample Image

The value might just be populated with some javascript code on the page.


UPDATE

urlopen().read() simply returns the initial page source received from the server without any further client-side changes.

You can achieve your desired output using Selenium + Chrome WebDriver. The idea is we let the javascript in page run and parse the final page source.

Try this:

from bs4 import BeautifulSoup as bs
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options

url='https://etherscan.io/token/0x629cdec6acc980ebeebea9e5003bcd44db9fc5ce'

#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()
chrome_options.add_argument("--headless")

# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
browser.get(url)
page_source = browser.page_source

#Parse the final page source
soup = bs(page_source, 'html.parser')

print(soup.find(id='totaltxns').text)

Outputs:

995,632

More info on setting up webdriver + example is in another StackOverflow question here.

Using the BeautifulSoup find method to obtain data from a table row

Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.

First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr element. Now identify the td elements in the tr. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.

You will need to do something clever to extract properly readable strings from these results.

>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
<a href="../networks/site-info?site_id=ED3">Edinburgh St Leonards</a>
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'

How to understand recursive with BeautifulSoup in Python

HTML documents are nested, tags have tags inside of them.
In the document you've provided ('s'), the structure looks like:

Div
p
strong
`text node A`
`text node B`

Recursive is instructing beautifulsoup to check the children of a particular node for matches (or not to if set to false).

  1. There is only one root node (div). Because you tell beautifulsoup NOT to check recursively, it will not look at the div's children, so it returns None since there are no root 'p' elements.

  2. This is actually two instances of 'find' being chained together. The first 'find' looks for a 'p' (and looks recursively, since the default for recursive it True). It finds the 'div>p' as we'd expect. After this, you've called 'find' AGAIN on the result of the first find, which is then searching for anything since you didn't specify the node type you're looking for. The first child of the 'p' is the 'strong' tag, so that is what is returned.

beautiful soup find function to scrape number off google

To obtain information from pages served by Google, you need to specify User-Agent header.

For example:

import requests
from bs4 import BeautifulSoup

url ='https://www.google.com/search?hl=en&q=corona+virus+uk'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

soup = BeautifulSoup( requests.get(url, headers=headers).content, 'html.parser' )

table1 = soup.select_one('div:has(span:contains("United Kingdom")) + table')
table2 = soup.select_one('div:has(span:contains("Worldwide")) + table')

print('UK:')
print('-'*80)
for td in table1.select('td'):
print(td.get_text(strip=True, separator=' '))

print()

print('World:')
print('-'*80)
for td in table2.select('td'):
print(td.get_text(strip=True, separator=' '))

Prints:

UK:
--------------------------------------------------------------------------------
Confirmed 276K 4,258 + 1,570
Recovered -
Deaths 39,045 602 + 0

World:
--------------------------------------------------------------------------------
Confirmed 6.06M 860 + 123K
Recovered -
Deaths 371K 53 + 4,000

EDIT: Running the code as of 6th July 2020 prints:

UK:
--------------------------------------------------------------------------------
Confirmed 285K 4,398 + 624
Recovered -
Deaths 44,220 681 + 67

World:
--------------------------------------------------------------------------------
Confirmed 11.4M 1,621 + 203K
Recovered 6.16M 874
Deaths 534K 76 + 5,193

Why find function is not working in BeautifulSoup?

The select() function expects a CSS selector as a parameter. Whereas, the find() function expects tag names and/or attributes as the parameters.

The docs say (regarding find()):

Signature: find(name, attrs, recursive, string, **kwargs)

So, there are three ways you can get the tag you want:

  1. soup.select('.a-size-large')[0].text.strip() or

    soup.select_one('.a-size-large').text.strip()

  2. soup.find('span', class_='a-size-large').text.strip() or

    soup.find('span', {'class': 'a-size-large'}).text.strip()

  3. soup.find(class_='a-size-large').text.strip() or

    soup.find(True, {'class': 'a-size-large'}).text.strip()

All give Alien 3 as the output.

How to find children of nodes using BeautifulSoup

Try this

li = soup.find('li', {'class': 'text'})
children = li.findChildren("a" , recursive=False)
for child in children:
print(child)


Related Topics



Leave a reply



Submit