Beautifulsoup:Difference Between .Find() and .Select()

Beautifulsoup : Difference between .find() and .select()

To summarise the comments:

  • select finds multiple instances and returns a list, find finds the first, so they don't do the same thing. select_one would be the equivalent to find.
  • I almost always use css selectors when chaining tags or using tag.classname, if looking for a single element without a class I use find. Essentially it comes down to the use case and personal preference.
  • As far as flexibility goes I think you know the answer, soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a") would look pretty ugly using multiple chained find/find_all calls.
  • The only issue with the css selectors in bs4 is the very limited support, nth-of-type is the only pseudo class implemented and chaining attributes like a[href][src] is also not supported as are many other parts of css selectors. But things like a[href=..]* , a[href^=], a[href$=] etc.. are I think much nicer than find("a", href=re.compile(....)) but again that is personal preference.

For performance we can run some tests, I modified the code from an answer here running on 800+ html files taken from here, is is not exhaustive but should give a clue to the readability of some of the options and the performance:

The modified functions are:

from bs4 import BeautifulSoup
from glob import iglob

def parse_find(soup):
author = soup.find("h4", class_="h12 talk-link__speaker").text
title = soup.find("h4", class_="h9 m5").text
date = soup.find("span", class_="meta__val").text.strip()
soup.find("footer",class_="footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":")
soup.find_all("span",class_="talk-transcript__fragment")

def parse_select(soup):
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
date = soup.select_one("span.meta__val").text.strip()
soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text
soup.select("span.talk-transcript__fragment")

def test(patt, func):
for html in iglob(patt):
with open(html) as f:
func(BeautifulSoup(f, "lxml")

Now for the timings:

In [7]: from testing import test, parse_find, parse_select

In [8]: timeit test("./talks/*.html",parse_find)
1 loops, best of 3: 51.9 s per loop

In [9]: timeit test("./talks/*.html",parse_select)
1 loops, best of 3: 32.7 s per loop

Like I said not exhaustive but I think we can safely say the css selectors are definitely more efficient.

bs4 soup.select() vs. soup.find()

While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.

There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.

In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :

for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])

or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:

for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])

...


An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:

import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]

or to get only the two specific rows:

pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]









































NameISO 2
0AfghanistanAF
1Åland IslandsAX
2AlbaniaAL
3AlgeriaDZ
4American SamoaAS
5AndorraAD

How to know difference between select and find

The page is loaded dynamically. There is nothing in the static html. You have to fetch the data from the api.

It's up to you to figure out the paramters you need by looking in the Dev Tools:

import requests
import pandas as pd

url = 'https://api.orderappetit.com/api/customers/stores'
payload ={
'page': '1',
'limit': '12',
'location': '609b7a404575ff2493d02d54',
'isRandom': '1',
'type': ''
}

jsonData = requests.get(url, params=payload).json()
stores = jsonData['stores']

df = pd.DataFrame(stores)

Output:

print(df['name'])
0 Orazio's by Zarcone
1 Carbone's Pizza | South Park
2 Colter Bay
3 Saigon Bangkok
4 Marco’s Italian Restaurant
5 Lakeshore Cafe
6 Cookies and Cream | Seneca
7 Deep South Taco | Hertel
8 Macy's Place Pizzeria
9 Wise Guys Pizza
10 Tappo Pizza
11 Aguacates Mex Bar & Grill
Name: name, dtype: object

Or to search for specific food and choose dilery/pickup:

search = 'pizza'

url = 'https://api.orderappetit.com/api/customers/stores/search'
payload ={
'generic': search,
'type': 'delivery',
'page': '1',
'limit': '8',
'location': '609b7a404575ff2493d02d54'}

jsonData = requests.get(url, params=payload).json()
stores = jsonData['stores']

df = pd.json_normalize(jsonData, record_path=['stores'])

Output:

print(df['store.name'])
0 Macy's Place Pizzeria
1 Prohibition 2020
2 Trattoria Aroma Williamsville
3 Sidelines Sports Bar and Grill
4 Marco’s Italian Restaurant
5 Orazio's by Zarcone
6 Butera's Craft Beer & Craft Pizza
7 Allentown pizza
Name: store.name, dtype: object

Sample Image

Beautifulsoup select an element based on the innerHTML with Python

It's working

import requests
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

title = [x.get_text(strip=True) for x in soup.select('[class="s-post-summary--content-title"] > a')]
print(title)
votes = [x.get_text(strip=True) for x in soup.select('div[class="s-post-summary--stats-item s-post-summary--stats-item__emphasized"] > span:nth-child(1)')]
print(votes)

Output:

['React Native - expo/vector-icons typescript type definition for icon name', 'React 25+5 Clock is working but fails all tests', 'Add weekly tasks, monthly tasks in google spreadsheet', 'Count number of change in values in Pandas column', "React-Select: How do I update the selected option dropdown's defaultValue on selected value onChange?", 'Block execution over a variable (TTS Use-Case), other than log statements (spooky)', "'npm install firebase' hangs in wsl. runs fine in windows", 'Kubernetes Dns service sometimes not working', 'Neo4j similarity of single node with entire graph', 'What is this error message? ORA-00932: inconsistent datatypes: expected DATE got NUMBER', 'Why getChildrenQueryBuilder of NestedTreeRepository say Too few parameters: the query defines 2 parameters but you only bound 0', 'Is is a security issue that Paypal uses dynamic certificate to verify webhook notification?', 'MessageBox to autoclose after 
a function done', 'Can someone clearly explain how this function is working?', 'Free open-sourced tools for obfuscating iOS app?', "GitHub page is not showing background image, FF console
shows couldn't load images", 'Is possible to build a MLP model with the tidymodels framework?', 'How do I embed an interactive Tableau visual into an R Markdown script/notebook on Kaggle?', 'Dimensionality reduction methods for data including categorical variables', 'Reaching localhost api from hosted static site', 'Finding the zeros of a two term exponential function with
python', 'optimizing synapse delta lake table not reducing the number of files', '(GAS) Email
Spreadsheet range based on date input in the cell', 'EXCEL Formula to find and copy cell based on criteria', 'how to write function reduce_dimensionality?', 'Semi-Radial type Volume Slider in WPF C#', 'tippy.js tool tips stop working after "window.reload()"', 'is there some slice indices must be integers on FFT opencv python? because i think my coding is correct', 'NoParameterFoundException', 'How to get the two Input control elements look exactly same in terms of background and border?', 'My code is wrong because it requests more data than necessary, how can i solve it?', 'Express Session Not Saving', 'Which value should I search for when changing the date by dragging in FullCalendar?', 'Non-constant expression specified where only constant
expressions are allowed', 'Cocoapods not updating even after latest version is installed', 'Ruby "Each with Index" starting at 1?', 'Converting images to Pytorch tensors loses label data', 'itemview in Adapter for recyclerview not getting id from xml', 'Use Margin Auto & Flex to Align Text', '(C++) URLDownloadToFile Function corrupting downloaded EXE', 'Search plugin for Woocommerce website (Free)', 'Create new folder when save image in Python Plotly', "What's the difference between avfilter_graph_parse_ptr() and avfilter_link()?", 'Inputs to toString (java) on a resultset from MySQL', 'Which language i learn in This time for better future? python or javaScript?', 'Hi everyone. I want to write a function in python for attached data frame. I can not figure out how can I do it', 'is there a way in R to mutate a cumulative subtraction to update the same mutated var?', 'making a simple reccommendation system in JavaScript', 'Amchart4 cursor does not match mouse position in screen with zoom', 'Bash curl command works in terminal, but not with Python os.system()']
['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '-2', '0', '1', '0', '0', '0']

How to check if a soup contains an element?

You can try select_one instead of find. Something like this.

soup.select_one('details[data-level="2"] summary.section-heading h2#English')

The result will be

<h2 id="English">English</h2>

Beautiful Soup select google image returns empty list

Unfortunately, the problem is not that you're using BeautifulSoup wrong. The webpage that you're requesting appears to be missing its content! I saved html.text to a file for inspection:

screenshot of the response HTML

Why does this happen? Because the webpage actually loads its content using JavaScript. When you open the site in your browser, the browser executes the JavaScript, which adds all of the artist squares to the webpage. (You may even notice the brief moment during which the squares aren't there when you first load the site.) On the other hand, requests does NOT execute JavaScript—it just downloads the contents of the webpage and saves them to a string.

What can you do about it? Unfortunately, this means that scraping the website will be really tough. In such cases, I would suggest looking for an alternative source of information or using an API provided by the website.



Related Topics



Leave a reply



Submit