Understand the Find() function in Beautiful Soup
soup.find("div", {"class":"real number"})['data-value']
Here you are searching for a div
element, but the span
has the "real number" class in your example HTML data, try instead:
soup.find("span", {"class": "real number", "data-value": True})['data-value']
Here we are also checking for presence of data-value
attribute.
To find elements having "real number" or "fake number" classes, you can make a CSS selector:
for elm in soup.select(".real.number,.fake.number"):
print(elm.get("data-value"))
To get the 69%
value:
soup.find("div", {"class": "percentage good"}).get_text(strip=True)
Or, a CSS selector:
soup.select_one(".percentage.good").get_text(strip=True)
soup.select_one(".score .percentage").get_text(strip=True)
Or, locating the h6
element having Audit score
text and then getting the preceding sibling:
soup.find("h6", text="Audit score").previous_sibling.get_text(strip=True)
Beautiful Soup find() function returning none even though element exists and find() works for other elements on the page?
Here's the code I used to scrape ALL 100 songs and their authors. This website is really horrible to scrape because it doesn't use ids or classes in a scrapable manner, so instead I relied (mostly) on the current structure of the page.
I'm not sure what exactly was causing your problem. The page was made with a framework so it was littered with styling classes. Your selection was fickle because it relied on these being consistent. Perhaps the 1st element was styled differently (actually, this is almost certainly the case, notice how the cover image is bigger on the actual page).
from bs4 import BeautifulSoup
import requests
x = requests.get("https://www.billboard.com/charts/hot-100/").text
soup = BeautifulSoup(x, "html.parser")
chart = soup.find("div",class_="lxml")
#div.chart-results-list > div.o-chart-results-list-row-container > ul.o-chart-results-list-row"
songNames = [x.text for x in soup.select("div.chart-results-list > div.o-chart-results-list-row-container > ul.o-chart-results-list-row > li:nth-child(4) > ul > li:nth-child(1) h3")]
authorNames = [x.text for x in soup.select("div.chart-results-list > div.o-chart-results-list-row-container > ul.o-chart-results-list-row > li:nth-child(4) > ul > li:nth-child(1) span")]
print(songNames)
#print(authorNames)
print(len(songNames))
beautifulsoup find function returns - when retrieving text
To find the element by id you can use soup.find(id='your_id')
.
Try this:
from bs4 import BeautifulSoup as bs
html = '''
<div class="row align-items-center">
<div class="col-md-4 mb-1 mb-md-0">Transfers:</div>
<div class="col-md-8"></div>
<span id="totaltxns">266,765</span><hr class="hr-space">
</div>
'''
soup = bs(html, 'html.parser')
print(soup.find(id='totaltxns').text)
Outputs:
266,765
If you look at the page source for the link you've mentioned, the value in totaltxns
is -
. That's why it's returning -
.
The value might just be populated with some javascript code on the page.
UPDATE
urlopen().read()
simply returns the initial page source received from the server without any further client-side changes.
You can achieve your desired output using Selenium + Chrome WebDriver. The idea is we let the javascript in page run and parse the final page source.
Try this:
from bs4 import BeautifulSoup as bs
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options
url='https://etherscan.io/token/0x629cdec6acc980ebeebea9e5003bcd44db9fc5ce'
#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()
chrome_options.add_argument("--headless")
# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
browser.get(url)
page_source = browser.page_source
#Parse the final page source
soup = bs(page_source, 'html.parser')
print(soup.find(id='totaltxns').text)
Outputs:
995,632
More info on setting up webdriver + example is in another StackOverflow question here.
Using the BeautifulSoup find method to obtain data from a table row
Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.
First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr
element. Now identify the td
elements in the tr
. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td
elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.
You will need to do something clever to extract properly readable strings from these results.
>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
<a href="../networks/site-info?site_id=ED3">Edinburgh St Leonards</a>
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'
How to understand recursive with BeautifulSoup in Python
HTML documents are nested, tags have tags inside of them.
In the document you've provided ('s'), the structure looks like:
Div
p
strong
`text node A`
`text node B`
Recursive is instructing beautifulsoup to check the children of a particular node for matches (or not to if set to false).
There is only one root node (div). Because you tell beautifulsoup NOT to check recursively, it will not look at the div's children, so it returns None since there are no root 'p' elements.
This is actually two instances of 'find' being chained together. The first 'find' looks for a 'p' (and looks recursively, since the default for recursive it True). It finds the 'div>p' as we'd expect. After this, you've called 'find' AGAIN on the result of the first find, which is then searching for anything since you didn't specify the node type you're looking for. The first child of the 'p' is the 'strong' tag, so that is what is returned.
beautiful soup find function to scrape number off google
To obtain information from pages served by Google, you need to specify User-Agent
header.
For example:
import requests
from bs4 import BeautifulSoup
url ='https://www.google.com/search?hl=en&q=corona+virus+uk'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup( requests.get(url, headers=headers).content, 'html.parser' )
table1 = soup.select_one('div:has(span:contains("United Kingdom")) + table')
table2 = soup.select_one('div:has(span:contains("Worldwide")) + table')
print('UK:')
print('-'*80)
for td in table1.select('td'):
print(td.get_text(strip=True, separator=' '))
print()
print('World:')
print('-'*80)
for td in table2.select('td'):
print(td.get_text(strip=True, separator=' '))
Prints:
UK:
--------------------------------------------------------------------------------
Confirmed 276K 4,258 + 1,570
Recovered -
Deaths 39,045 602 + 0
World:
--------------------------------------------------------------------------------
Confirmed 6.06M 860 + 123K
Recovered -
Deaths 371K 53 + 4,000
EDIT: Running the code as of 6th July 2020 prints:
UK:
--------------------------------------------------------------------------------
Confirmed 285K 4,398 + 624
Recovered -
Deaths 44,220 681 + 67
World:
--------------------------------------------------------------------------------
Confirmed 11.4M 1,621 + 203K
Recovered 6.16M 874
Deaths 534K 76 + 5,193
Why find function is not working in BeautifulSoup?
The select()
function expects a CSS selector as a parameter. Whereas, the find()
function expects tag names and/or attributes as the parameters.
The docs say (regarding find()
):
Signature: find(name, attrs, recursive, string, **kwargs)
So, there are three ways you can get the tag you want:
soup.select('.a-size-large')[0].text.strip()
or
soup.select_one('.a-size-large').text.strip()
soup.find('span', class_='a-size-large').text.strip()
orsoup.find('span', {'class': 'a-size-large'}).text.strip()
soup.find(class_='a-size-large').text.strip()
orsoup.find(True, {'class': 'a-size-large'}).text.strip()
All give Alien 3
as the output.
How to find children of nodes using BeautifulSoup
Try this
li = soup.find('li', {'class': 'text'})
children = li.findChildren("a" , recursive=False)
for child in children:
print(child)
Related Topics
Error While Importing Tensorflow in Python 2.7 in Ubuntu 12.04. 'Glibc_2.17 Not Found'
How Find Specific Data Attribute from HTML Tag in Beautifulsoup4
How to Convert R Dataframe Back to Pandas Using Rpy2
Integration Testing for a Web App
Scripting Http More Effeciently
How to Perform Element-Wise Multiplication of Two Lists
How to Operate on a Dataframe with a Series for Every Column
What Is the Cause of the Bad Request Error When Submitting Form in Flask Application
Using Beautifulsoup to Extract Text Without Tags
Opencv Videocapture and Error: (-215:Assertion Failed) !_Src.Empty() in Function 'Cv::Cvtcolor'
Beautifulsoup Not Grabbing Dynamic Content
How Is the Feature Score(/Importance) in the Xgboost Package Calculated
Learning Ruby from Python; Differences and Similarities
How to Pickle a Python Function (Or Otherwise Serialize Its Code)