Difference between .string and .text BeautifulSoup
.string
on a Tag
type object returns a NavigableString
type object. On the other hand, .text
gets all the child strings and return concatenated using the given separator. Return type of .text is unicode
object.
From the documentation, A NavigableString
is just like a Python Unicode
string, except that it also supports some of the features described in Navigating the tree and Searching the tree.
From the documentation on .string
, we can see that, If the html is like this,
<td>Some Table Data</td>
<td></td>
Then, .string
on the second td will return None
.
But .text
will return an empty string, which is a unicode
type object.
For greater convenience,
string
- Convenience property of a
tag
to get the single string within this tag. - If the
tag
has a single string child then the return value is that string. - If the
tag
has no children or more than one child then the return value isNone
- If this
tag
has one child tag then the return value is the 'string' attribute of the child tag, recursively.
And text
- Get all the child strings and return concatenated using the given separator.
If the html
is like this:
<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>
.string
on the four td
will return,
some text
None
more text
None
.text
will give result like this,
some text
more text
even more text
Difference between text and string in BeautifulSoup
From the docs:
With string you can search for strings instead of tags.
The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text.
Differences between .text and .get_text()
It looks like .text
is just a property that calls get_text
. Therefore, calling get_text
without arguments is the same thing as .text
. However, get_text
can also support various keyword arguments to change how it behaves (separator
, strip
, types
). If you need more control over the result, then you need the functional form.
BeautifulSoup find tag by string without children text
You might need to do the search manually rather than relying on the regular expression:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
header_title = "Unique Title 2"
for h4 in soup.find_all('h4'):
if header_title in h4.text:
...
getText() vs text() vs get_text()
They are very similar:
.get_text
is a function that returns the text of a tag as a string.text
is a property that callsget_text
(so it's identical, except you don't use parantheses).getText
is an alias ofget_text
I would use .text
whenever possible, and .get_text(...)
when you need to pass custom arguments (e.g. foo.get_text(strip=True, seperator='\n')
).
BeautifulSoup get links between strings
You can use str.join
with an iteration over soup.contents
:
import bs4
html = '''<div>Some TEXT with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
Output:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
Edit: ignoring br
tags:
html = '''<div>Some TEXT <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
Edit 2: recursive solution:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
Also, whitespace may become an issue due to the presence of br
tags, etc. To work around this, you can use re.sub
:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
Can't find string after a tag with BeautifulSoup in Python?
You need to select the a from the parent td and call .text, the text is inside the anchor which is a child of the td:
print([td.a.text for td in soup.find_all('td', class_='tou')])
There obviously is a td with the class tou or you would not be getting a list with None:
In [10]: html = """<td class='tou'>
<a href="/analyze/default/index/49398962/1/34925733" target="_blank">
<img alt="Sample Image" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
Jue VioIe Grace
</a>
</td>"""
In [11]: soup = BeautifulSoup(html,"html.parser")
In [12]: [a.string for a in soup.find_all('td', class_='tou')]
Out[12]: [None]
In [13]: [td.a.text for td in soup.find_all('td', class_='tou')]
Out[13]: [u'\n\n Jue VioIe Grace\n ']
You could also call .text on the td:
In [14]: [td.text for td in soup.find_all('td', class_='tou')]
Out[14]: [u'\n\n\n Jue VioIe Grace\n \n']
But that would maybe get more than you want.
using your full html from pastebin:
In [18]: import requests
In [19]: soup = BeautifulSoup(requests.get("http://pastebin.com/raw/4mvcMsJu").content,"html.parser")
In [20]: [td.a.text.strip() for td in soup.find_all('td', class_='tou')]
Out[20]:
[u'KElTHMCBRlEF',
u'game 5 loser',
u'Cris',
u'interestingstare',
u'ApoIlo Price',
u'Zary',
u'Adrian Ma',
u'Liquid Inori',
u'focus plz',
u'Shiphtur',
u'Cody Sun',
u'ApoIIo Price',
u'Pobelter',
u'Jue VioIe Grace',
u'Valkrin',
u'Piggy Kitten',
u'1 and 17',
u'BLOCK IT',
u'JiaQQ1035716423',
u'Twitchtv Flaresz']
In this case td.text.strip()
gives you the same output:
In [23]: [td.text.strip() for td in soup.find_all('td', class_='tou')]
Out[23]:
[u'KElTHMCBRlEF',
u'game 5 loser',
u'Cris',
u'interestingstare',
u'ApoIlo Price',
u'Zary',
u'Adrian Ma',
u'Liquid Inori',
u'focus plz',
u'Shiphtur',
u'Cody Sun',
u'ApoIIo Price',
u'Pobelter',
u'Jue VioIe Grace',
u'Valkrin',
u'Piggy Kitten',
u'1 and 17',
u'BLOCK IT',
u'JiaQQ1035716423',
u'Twitchtv Flaresz']
But you should understand that there is a difference. Also the difference between .string vs .text
beautifulsoup: get text (including html tags) between two different tags (/h3 and h2)
Maybe css selectors can help:
for s in soup.select('h3'):
for ns in (s.fetchNextSiblings()):
if ns.name == "h2":
break
else:
if ns.name == "p":
print(ns)
Output:
<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>
Related Topics
Pyinstaller and --Onefile: How to Include an Image in the Exe File
Getting a Callback When a Tkinter Listbox Selection Is Changed
Convert Rgba Png to Rgb with Pil
If X:, VS If X == True, VS If X Is True
Python Multiprocessing on Windows, If _Name_ == "_Main_"
How to Convert a Given Ordinal Number (From Excel) to a Date
What Is Python Whitespace and How Does It Work
Parsing Datetime Strings Containing Nanoseconds
Do Python for Loops Work by Reference
How to Do Exponentiation in Python
Combine Lists with Common Elements
Split a Python List into Other "Sublists" I.E Smaller Lists
How to Unzip a List of Tuples into Individual Lists
Pandas "Can Only Compare Identically-Labeled Dataframe Objects" Error
How to Pass an Operator to a Python Function
Python's Insert Returning None
How to Prevent a C Shared Library to Print on Stdout in Python