How to find children of nodes using BeautifulSoup
Try this
li = soup.find('li', {'class': 'text'})
children = li.findChildren("a" , recursive=False)
for child in children:
print(child)
How to get all direct children of a BeautifulSoup Tag?
You can set the recursive
argument to False
if you want to select only direct descendants.
An example with the html you provided:
from bs4 import BeautifulSoup
html = "<div class='body'><span>A</span><span><span>B</span></span><span>C</span></div>"
soup = BeautifulSoup(html, "lxml")
for j in soup.div.find_all(recursive=False):
print(j)
<span>A</span>
<span><span>B</span></span>
<span>C</span>
beautifulsoup finding all children under certain child
You can use .find_all(recursive=False)
with list slice:
from bs4 import BeautifulSoup
html_doc = """
<ul>
<li class = "list_item_1">item 1</li>
<li class = "list_item_2">item 2</li>
<li class = "list_item_3">item 3</li>
<li class = "list_item_4">item 4</li>
</ul>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.ul.find_all(recursive=False)[2:])
Prints:
[<li class="list_item_3">item 3</li>, <li class="list_item_4">item 4</li>]
Or if you're open to using .select
, you can use CSS selector with ~
:
print(soup.select(".list_item_2 ~ *"))
Prints:
[<li class="list_item_3">item 3</li>, <li class="list_item_4">item 4</li>]
Beautiful Soup find children for particular div
It is useful to know that whatever elements BeautifulSoup finds within one element still have the same type as that parent element - that is, various methods can be called.
So this is somewhat working code for your example:
soup = BeautifulSoup(html)
divTag = soup.find_all("div", {"class": "tablebox"})
for tag in divTag:
tdTags = tag.find_all("td", {"class": "align-right"})
for tag in tdTags:
print tag.text
This will print all the text of all the td
tags with the class of "align-right" that have a parent div
with the class of "tablebox".
BeautifulSoup: Extracting Value from Children nodes
You could try something like this. It basically does what you did above - first iterates through all section
-classed td
's and then iterates through all span
text within. This prints out the class, just in case you needed to be more restrictive:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # Your html here
In [3]: soup = BeautifulSoup(html)
In [4]: for td in soup.find_all('td', {'class': 'section'}):
...: for span in td.find_all('span'):
...: print span.attrs['class'], span.text
...:
['username'] xxUsername
['comment']
A test comment
Or with a more-convoluted-than-necessary one-liner that will store everything back in your list:
In [5]: results = [span.text for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span')]
In [6]: results
Out[6]: [u'xxUsername', u'\nA test comment\n']
Or on that same theme, a dictionary with the keys being a tuple of the classes and the values being the text itself:
In [8]: results = dict((tuple(span.attrs['class']), span.text) for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span'))
In [9]: results
Out[9]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}
Assuming this one is bit closer to what you want, I would suggest rewriting as:
In [10]: results = {}
In [11]: for td in soup.find_all('td', {'class': 'section'}):
....: for span in td.find_all('span'):
....: results[tuple(span.attrs['class'])] = span.text
....:
In [12]: results
Out[12]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}
BeautifulSoup children of div
Use response.content
instead of response.text
.
you're also not requesting the correct url in your code. https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&skipper=False&search_src=home
only displays a single boat hence you're code is only returning one row.
Use https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&guests_count=&order_by=-rank&is_roundtrip=&coupon_code=&skipper=None
instead in this case
You'll probally find use in adjusting the url parameters to filter boats at some point !
Find elements which have a specific child with BeautifulSoup
There are multiple ways to approach the problem.
One option is to locate the Email
div by text and get the next sibling:
soup.find("div", text="Email").next_sibling.strip() # prints "info@blah.com"
BeautifulSoup finding children with only 'dot', without 'find()' function
What you ask is well documented here: BS: navigating using tag names
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the
<head>
tag, just saysoup.head
.You can do use this trick again and again to zoom in on a certain part of the parse tree.
soup.body.b
gets the first<b>
tag beneath the<body>
tag.Using a tag name as an attribute will give you only the first tag by that name.
If you need to get all the
<a>
tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such asfind_all()
(emphasis and omissions mine)
So your page_soup.div.div
finds the first ever div
thats inside a div
- and page_soup.div
finds the first ever div
.
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<div>first div</div>
<p>unrelated
</p>
<div>second div
<div>with another div inside</div>
</div>
<div>can't get this one by soup.div.div
<div>with another div inside</div>
</div>
</body
BeautifulSoup: Classify parent and children element
Why use soup.find
when you can use soup.select
, get help from all the CSS wiz kids and test your criteria in a browser first?
There's a performance benchmark on SO and select
is faster, or at least not significantly slower, so that's not it. Habit, I guess.
(works just as well without the <p>
tag qualifier, i.e. just "[itemprop=name]"
)
found = soup.select("p[itemprop=name]")
results = dict()
for node in found:
itemtype = node.parent.attrs.get("itemtype", "?")
itemtype = itemtype.split("/")[-1]
results[itemtype] = node.text
print(results)
output:
It is what you asked for, but if many nodes existed with FoodEstablishment, last would win, because you are using a dictionary. A defaultdict with a list might work better, for you to judge.
{'PostalAddress': '33 San Francisco', 'FoodEstablishment': "The Dormouse's story"}
step 1, before Python: rock that CSS!
and if you need to check higher up ancestors for itemtype
:
it would help if you had html with that happening:
<div class="address" itemtype="http://schema.org/PostalAddress">
<div>
<p itemprop="name">33 San Francisco</p>
</div>
</div>
found = soup.select("[itemprop=name]")
results = dict()
for node in found:
itemtype = None
parent = node.parent
while itemtype is None and parent is not None:
itemtype = parent.attrs.get("itemtype")
if itemtype is None:
parent = parent.parent
itemtype = itemtype or "?"
itemtype = itemtype.split("/")[-1]
results[itemtype] = node.text
print(results)
same output.
using a defautdict
everything stays the same except for declaring the results and putting data into it.
from collections import defaultdict
...
results = defaultdict(list)
...
results[itemtype].append(node.text)
output (after I added a sibling to 33 San Francisco):defaultdict(<class 'list'>, {'PostalAddress': ['33 San Francisco', '34 LA'], 'FoodEstablishment': ["The Dormouse's story"]})
Related Topics
Does Anybody Know How to Identify Shadow Dom Web Elements Using Selenium Webdriver
Python in Raw Mode Stdin Print Adds Spaces
Numpy Selecting Specific Column Index Per Row by Using a List of Indexes
Parsing Boolean Values with Argparse
How to Construct a Timedelta Object from a Simple String
Python -Intersection of Multiple Lists
Executing Multi-Line Statements in the One-Line Command-Line
What Do Square Brackets, "[]", Mean in Function/Class Documentation
Compare Two Columns Using Pandas
How to Use the Python HTMLparser Library to Extract Data from a Specific Div Tag
Move Files from One Directory to Another with Paramiko
How to Write to a CSV Line by Line
Way to Change Google Chrome User Agent in Selenium
Get Md5 Hash of Big Files in Python
Python Nltk Pos_Tag Not Returning the Correct Part-Of-Speech Tag
Syntaxerror: Non-Ascii Character '\Xa3' in File When Function Returns '£'