Only extracting text from this element, not its children
what about .find(text=True)
?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'
EDIT:
I think that I've understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'
Using .text() to retrieve only text not nested in child tags
I liked this reusable implementation based on the clone()
method found here to get only the text inside the parent element.
Code provided for easy reference:
$("#foo")
.clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text();
Beautiful Soup 4: extracting text only from a tag containing children tags
You may be looking for .next_element
[docs] attribute which points to immediate afterwards of whatever was grabbed. So, in your case, it will look something like this.
result = soup.find('div', class_='title').next_element.strip()
# -> $ 430000
Extract text from inline children of an anchor tag
You can grab the shallow text using tag.find_all(text=True, recursive=False)
as described here, then use a normal selector to pull out the deeper text from the span.
This way, all of the data is separate from the start and you're not dealing with parsing the individual pieces from the smushed text in the parent's view after the fact.
from bs4 import BeautifulSoup
html = """
<a class="reference internal" href="optparse.html">
<code class="xref py py-mod docutils literal notranslate">
<span class="pre">
optparse
</span>
</code>
— Parser for command line options
</a>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("a.reference.internal"):
shallow = "".join(tag.find_all(text=True, recursive=False)).strip()
deep = tag.find("code").text.strip()
print(repr(shallow)) # => '— Parser for command line options'
print(repr(deep)) # => 'optparse'
Extract Parent Text without Children Text; Parsing HTML
You can use .contents
and access the 0th element:
for tag in soup.find_all("p"):
print(tag.contents[0].strip())
Output:
Environment:
Basic Rules
Or with your attempt, you can remove the <span>
's using .extract()
by:
for tag in soup.select("p span"):
tag.extract()
print(soup.prettify())
Output:
<footer>
<p class="tags environment-tags">
Environment:
</p>
<p class="source monster-source">
Basic Rules
</p>
</footer>
Text from BeautifulSoup4 missing
If you have a look at the .contents
of a tag, you'll see that the text you want belongs to a class called NavigableString
.
from bs4 import BeautifulSoup, NavigableString
html = """<a aria-expanded="false" aria-owns="faqGen5" href="#">aaa <span class="nobreak">bbb</span> ccc?</a>"""
soup = BeautifulSoup(html, 'lxml')
for content in soup.find('a').contents:
print(content, type(content))
# aaa <class 'bs4.element.NavigableString'>
# <span class="nobreak">bbb</span> <class 'bs4.element.Tag'>
# ccc? <class 'bs4.element.NavigableString'>
Now, you simply need to get the elements belonging to the NavigableString
class and join them together.
text = ''.join([x for x in soup.find('a').contents if isinstance(x, NavigableString)])
print(text)
# aaa ccc?
PyQuery: Get only text of element, not text of child elements
I don't think there is an clean way to do that. At least I've found this solution:
>>> print doc('h1').html(doc('h1')('span').outerHtml())
<h1 class="price"><span class="strike">$325.00</span></h1>
You can use .text() instead of .outerHtml() if you don't want to keep the span tag.
Removing the first one is much more easy:
>>> print doc('h1').remove('span')
<h1 class="price">
$295.00
</h1>
jquery - get text for element without children text
hey try this please": http://jsfiddle.net/MtVxx/2/
Good link for your specific case in here: http://viralpatel.net/blogs/jquery-get-text-element-without-child-element/ (This will only get the text of the element)
Hope this helps, :)
code
jQuery.fn.justtext = function() {
return $(this).clone()
.children()
.remove()
.end()
.text();
};
alert($('#mydiv').justtext());
Related Topics
Is There a Matplotlib Equivalent of Matlab's Datacursormode
How to Have Assignment in a Condition
Numpy Sum Elements in Array Based on Its Value
How to Profile Python Code Line-By-Line
How to Add Multiple Values to a Dictionary Key
Django Upgrading to 1.9 Error "Appregistrynotready: Apps Aren't Loaded Yet."
Pandas Split Column into Multiple Columns by Comma
Correct Code to Remove the Vowels from a String in Python
How to Display Full Output in Jupyter, Not Only Last Result
Python "Syntaxerror: Non-Ascii Character '\Xe2' in File"
How to Put Multiple Statements in One Line
How to Create a "View" on a Python List
Split a Python List into Other "Sublists" I.E Smaller Lists
Find First Element in a Sequence That Matches a Predicate
How to Switch to the Active Tab in Selenium