How to Find Children of Nodes Using Beautifulsoup

How to find children of nodes using BeautifulSoup

Try this

li = soup.find('li', {'class': 'text'})
children = li.findChildren("a" , recursive=False)
for child in children:
print(child)

How to get all direct children of a BeautifulSoup Tag?

You can set the recursive argument to False if you want to select only direct descendants.

An example with the html you provided:

from bs4 import BeautifulSoup

html = "<div class='body'><span>A</span><span><span>B</span></span><span>C</span></div>"
soup = BeautifulSoup(html, "lxml")
for j in soup.div.find_all(recursive=False):
print(j)

<span>A</span>
<span><span>B</span></span>
<span>C</span>

beautifulsoup finding all children under certain child

You can use .find_all(recursive=False) with list slice:

from bs4 import BeautifulSoup

html_doc = """
<ul>
<li class = "list_item_1">item 1</li>
<li class = "list_item_2">item 2</li>
<li class = "list_item_3">item 3</li>
<li class = "list_item_4">item 4</li>

</ul>
"""

soup = BeautifulSoup(html_doc, "html.parser")

print(soup.ul.find_all(recursive=False)[2:])

Prints:

[<li class="list_item_3">item 3</li>, <li class="list_item_4">item 4</li>]

Or if you're open to using .select, you can use CSS selector with ~:

print(soup.select(".list_item_2 ~ *"))

Prints:

[<li class="list_item_3">item 3</li>, <li class="list_item_4">item 4</li>]

Beautiful Soup find children for particular div

It is useful to know that whatever elements BeautifulSoup finds within one element still have the same type as that parent element - that is, various methods can be called.

So this is somewhat working code for your example:

soup = BeautifulSoup(html)
divTag = soup.find_all("div", {"class": "tablebox"})

for tag in divTag:
tdTags = tag.find_all("td", {"class": "align-right"})
for tag in tdTags:
print tag.text

This will print all the text of all the td tags with the class of "align-right" that have a parent div with the class of "tablebox".

BeautifulSoup: Extracting Value from Children nodes

You could try something like this. It basically does what you did above - first iterates through all section-classed td's and then iterates through all span text within. This prints out the class, just in case you needed to be more restrictive:

In [1]: from bs4 import BeautifulSoup

In [2]: html = # Your html here

In [3]: soup = BeautifulSoup(html)

In [4]: for td in soup.find_all('td', {'class': 'section'}):
...: for span in td.find_all('span'):
...: print span.attrs['class'], span.text
...:
['username'] xxUsername
['comment']
A test comment

Or with a more-convoluted-than-necessary one-liner that will store everything back in your list:

In [5]: results = [span.text for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span')]

In [6]: results
Out[6]: [u'xxUsername', u'\nA test comment\n']

Or on that same theme, a dictionary with the keys being a tuple of the classes and the values being the text itself:

In [8]: results = dict((tuple(span.attrs['class']), span.text) for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span'))

In [9]: results
Out[9]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}

Assuming this one is bit closer to what you want, I would suggest rewriting as:

In [10]: results = {}

In [11]: for td in soup.find_all('td', {'class': 'section'}):
....: for span in td.find_all('span'):
....: results[tuple(span.attrs['class'])] = span.text
....:

In [12]: results
Out[12]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}

BeautifulSoup children of div

Use response.content instead of response.text.

you're also not requesting the correct url in your code. https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&skipper=False&search_src=home only displays a single boat hence you're code is only returning one row.

Use https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&guests_count=&order_by=-rank&is_roundtrip=&coupon_code=&skipper=None instead in this case

You'll probally find use in adjusting the url parameters to filter boats at some point !

Find elements which have a specific child with BeautifulSoup

There are multiple ways to approach the problem.

One option is to locate the Email div by text and get the next sibling:

soup.find("div", text="Email").next_sibling.strip()  # prints "info@blah.com"

BeautifulSoup finding children with only 'dot', without 'find()' function

What you ask is well documented here: BS: navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.head.

You can do use this trick again and again to zoom in on a certain part of the parse tree. soup.body.b gets the first <b> tag beneath the <body> tag.

Using a tag name as an attribute will give you only the first tag by that name.

If you need to get all the <a> tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all()

(emphasis and omissions mine)

So your page_soup.div.div finds the first ever div thats inside a div - and page_soup.div finds the first ever div.

<html>

<head>
<title>The Dormouse's story</title>
</head>

<body>
<div>first div</div>
<p>unrelated
</p>
<div>second div
<div>with another div inside</div>
</div>

<div>can't get this one by soup.div.div
<div>with another div inside</div>
</div>
</body

BeautifulSoup: Classify parent and children element

Why use soup.find when you can use soup.select, get help from all the CSS wiz kids and test your criteria in a browser first?

There's a performance benchmark on SO and select is faster, or at least not significantly slower, so that's not it. Habit, I guess.

(works just as well without the <p> tag qualifier, i.e. just "[itemprop=name]")

found = soup.select("p[itemprop=name]")

results = dict()

for node in found:

itemtype = node.parent.attrs.get("itemtype", "?")
itemtype = itemtype.split("/")[-1]
results[itemtype] = node.text

print(results)

output:

It is what you asked for, but if many nodes existed with FoodEstablishment, last would win, because you are using a dictionary. A defaultdict with a list might work better, for you to judge.

{'PostalAddress': '33 San Francisco', 'FoodEstablishment': "The Dormouse's story"}

step 1, before Python: rock that CSS!

Sample Image

and if you need to check higher up ancestors for itemtype:

it would help if you had html with that happening:

    <div class="address" itemtype="http://schema.org/PostalAddress">
<div>
<p itemprop="name">33 San Francisco</p>
</div>

</div>
found = soup.select("[itemprop=name]")

results = dict()

for node in found:

itemtype = None
parent = node.parent
while itemtype is None and parent is not None:
itemtype = parent.attrs.get("itemtype")
if itemtype is None:
parent = parent.parent

itemtype = itemtype or "?"
itemtype = itemtype.split("/")[-1]
results[itemtype] = node.text

print(results)

same output.

using a defautdict

everything stays the same except for declaring the results and putting data into it.

from collections import defaultdict
...
results = defaultdict(list)
...

results[itemtype].append(node.text)
output (after I added a sibling to 33 San Francisco):

defaultdict(<class 'list'>, {'PostalAddress': ['33 San Francisco', '34 LA'], 'FoodEstablishment': ["The Dormouse's story"]})


Related Topics



Leave a reply



Submit