Difference Between .String and .Text Beautifulsoup

Difference between .string and .text BeautifulSoup

.string on a Tag type object returns a NavigableString type object. On the other hand, .text gets all the child strings and return concatenated using the given separator. Return type of .text is unicode object.

From the documentation, A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

From the documentation on .string, we can see that, If the html is like this,

<td>Some Table Data</td>
<td></td>

Then, .string on the second td will return None.
But .text will return an empty string, which is a unicode type object.

For greater convenience,

string

  • Convenience property of a tag to get the single string within this tag.
  • If the tag has a single string child then the return value is that string.
  • If the tag has no children or more than one child then the return value is None
  • If this tag has one child tag then the return value is the 'string' attribute of the child tag, recursively.

And text

  • Get all the child strings and return concatenated using the given separator.

If the html is like this:

<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>

.string on the four td will return,

some text
None
more text
None

.text will give result like this,

some text

more text
even more text

Difference between text and string in BeautifulSoup

From the docs:

With string you can search for strings instead of tags.
The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text.

Differences between .text and .get_text()

It looks like .text is just a property that calls get_text. Therefore, calling get_text without arguments is the same thing as .text. However, get_text can also support various keyword arguments to change how it behaves (separator, strip, types). If you need more control over the result, then you need the functional form.

BeautifulSoup find tag by string without children text

You might need to do the search manually rather than relying on the regular expression:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
header_title = "Unique Title 2"

for h4 in soup.find_all('h4'):
if header_title in h4.text:
...

getText() vs text() vs get_text()

They are very similar:

  • .get_text is a function that returns the text of a tag as a string
  • .text is a property that calls get_text (so it's identical, except you don't use parantheses)
  • .getText is an alias of get_text

I would use .text whenever possible, and .get_text(...) when you need to pass custom arguments (e.g. foo.get_text(strip=True, seperator='\n')).

BeautifulSoup get links between strings

You can use str.join with an iteration over soup.contents:

import bs4
html = '''<div>Some TEXT with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)

Output:

'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'

Edit: ignoring br tags:

html = '''<div>Some TEXT <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')

Edit 2: recursive solution:

def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)

html = '''<div>Some TEXT <i>other text in </i> <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))

Output:

Some TEXT  other text in     with  some LINK (https// - actual Link)  and some continuing TEXT with   following  some LINK (https//- next Link)  inside.

Also, whitespace may become an issue due to the presence of br tags, etc. To work around this, you can use re.sub:

import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))

Output:

'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'

Can't find string after a tag with BeautifulSoup in Python?

You need to select the a from the parent td and call .text, the text is inside the anchor which is a child of the td:

print([td.a.text for td in soup.find_all('td', class_='tou')])

There obviously is a td with the class tou or you would not be getting a list with None:

In [10]: html = """<td class='tou'>
<a href="/analyze/default/index/49398962/1/34925733" target="_blank">
<img alt="Sample Image" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
Jue VioIe Grace
</a>
</td>"""

In [11]: soup = BeautifulSoup(html,"html.parser")

In [12]: [a.string for a in soup.find_all('td', class_='tou')]
Out[12]: [None]

In [13]: [td.a.text for td in soup.find_all('td', class_='tou')]
Out[13]: [u'\n\n Jue VioIe Grace\n ']

You could also call .text on the td:

In [14]: [td.text for td in soup.find_all('td', class_='tou')]
Out[14]: [u'\n\n\n Jue VioIe Grace\n \n']

But that would maybe get more than you want.

using your full html from pastebin:

In [18]: import requests

In [19]: soup = BeautifulSoup(requests.get("http://pastebin.com/raw/4mvcMsJu").content,"html.parser")

In [20]: [td.a.text.strip() for td in soup.find_all('td', class_='tou')]
Out[20]:
[u'KElTHMCBRlEF',
u'game 5 loser',
u'Cris',
u'interestingstare',
u'ApoIlo Price',
u'Zary',
u'Adrian Ma',
u'Liquid Inori',
u'focus plz',
u'Shiphtur',
u'Cody Sun',
u'ApoIIo Price',
u'Pobelter',
u'Jue VioIe Grace',
u'Valkrin',
u'Piggy Kitten',
u'1 and 17',
u'BLOCK IT',
u'JiaQQ1035716423',
u'Twitchtv Flaresz']

In this case td.text.strip() gives you the same output:

In [23]: [td.text.strip() for td in soup.find_all('td', class_='tou')]
Out[23]:
[u'KElTHMCBRlEF',
u'game 5 loser',
u'Cris',
u'interestingstare',
u'ApoIlo Price',
u'Zary',
u'Adrian Ma',
u'Liquid Inori',
u'focus plz',
u'Shiphtur',
u'Cody Sun',
u'ApoIIo Price',
u'Pobelter',
u'Jue VioIe Grace',
u'Valkrin',
u'Piggy Kitten',
u'1 and 17',
u'BLOCK IT',
u'JiaQQ1035716423',
u'Twitchtv Flaresz']

But you should understand that there is a difference. Also the difference between .string vs .text

beautifulsoup: get text (including html tags) between two different tags (/h3 and h2)

Maybe css selectors can help:

for s in soup.select('h3'):
for ns in (s.fetchNextSiblings()):
if ns.name == "h2":
break
else:
if ns.name == "p":
print(ns)

Output:

<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>


Related Topics



Leave a reply



Submit