BeautifulSoup - search by text inside a tag
The problem is that your <a>
tag with the <i>
tag inside, doesn't have the string
attribute you expect it to have. First let's take a look at what text=""
argument for find()
does.
NOTE: The text
argument is an old name, since BeautifulSoup 4.4.0 it's called string
.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
Now let's take a look what Tag
's string
attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:print(soup.html.string)
# None
This is exactly your case. Your <a>
tag contains a text and <i>
tag. Therefore, the find gets None
when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update
so it should be fast enough.
How to find tag with particular text with Beautiful Soup?
You can pass a regular expression to the text parameter of findAll
, like so:
import BeautifulSoup
import re
columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})
BeautifulSoup Find tag with text containing
The non-breaking space is parsed as \xa0
, so you can either run:
text = soup.find('strong', text='Hello\xa0')
Or you could use regex:
import re
text = soup.find('strong', text=re.compile("Hello"))
Alternatively you could use a lambda function that looks for Hello
at the start of the string:
text = soup.find("strong", text=lambda value: value.startswith("Hello"))
Using BeautifulSoup to find a HTML tag that contains certain text
from BeautifulSoup import BeautifulSoup
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
for elem in soup(text=re.compile(r' #\S{11}')):
print elem.parent
Prints:
<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
Show text inside the tags BeautifulSoup
To get the text within the tags, there are a couple of approaches,
a) Use the .text
attribute of the tag.
cars = soup.find_all('span', attrs={'class': 'listing-row__price'})
for tag in cars:
print(tag.text.strip())
Output
$71,996
$75,831
$71,412
$75,476
....
b) Use get_text()
for tag in cars:
print(tag.get_text().strip())
c) If there is only that string inside the tag, you can use these options also
.string
.contents[0]
next(tag.children)
next(tag.strings)
next(tag.stripped_strings)
ie.
for tag in cars:
print(tag.string.strip()) #or uncomment any of the below lines
#print(tag.contents[0].strip())
#print(next(tag.children).strip())
#print(next(tag.strings).strip())
#print(next(tag.stripped_strings))
Outputs:
$71,996
$75,831
$71,412
$75,476
$77,001
...
Note:
.text
and .string
are not the same. If there are other elements in the tag, .string
returns the None
, while .text will return the text inside the tag.
from bs4 import BeautifulSoup
html="""
<p>hello <b>there</b></p>
"""
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p')
print(p.string)
print(p.text)
Outputs
None
hello there
Search for text inside a tag using beautifulsoup and returning the text in the tag after it
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key)
returns the <span>
tag whose text=key
.
.parent
returns the parent tag of the current <span>
tag.
Example:
When key='Color'
, soup.find('span', text=key).parent
will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag
. Only thing left is getting the text of second <span>
, which is what the line key_tag.find_all('span')[1].text
does.
BeautifulSoup find text in specific tag
As you know the exact positions of the tags you want to find, you can use find_all()
which returns a list and then get the tag from the required index.
In this case, (19th <tr>
and 2nd <td>
) use this:
result = soup.find_all('tr')[18].find_all('td')[1].text
Related Topics
Getting Only Element from a Single-Element List in Python
How to Display a Pandas Data Frame with Pyqt5/Pyside2
Pyplot Common Axes Labels for Subplots
How to Upgrade to Python 3.6 with Conda
Pycharm Current Working Directory
Easy Way of Finding Decimal Places
Full Examples of Using Pyserial Package
How to Declare an Array in Python
Matplotlib Axes.Plot() VS Pyplot.Plot()
What's the Best Way to Find the Inverse of Datetime.Isocalendar()
Python 3.5 - "Geckodriver Executable Needs to Be in Path"
How to Make My Player Rotate Towards Mouse Position
Record Speakers Output with Pyaudio
Is the Single Underscore "_" a Built-In Variable in Python
Typeerror: Worker() Takes 0 Positional Arguments But 1 Was Given