how to select and extract texts between two elements?
An extraction pattern I like to use for these cases is:
- loop over the "boundaries" (here,
h4
elements) - while enumerating them starting from 1
- using XPath's
following-sibling
axis, like in @Andersson's answer, to get elements before the next boundary, - and filtering them by counting the number of preceding "boundary" elements, since we know from our enumeration where we are
This would be the loop:
$ scrapy shell 'http://www.imdb.com/title/tt0092455/trivia?tab=mc&ref_=tt_trv_cnn'
(...)
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... print(cnt, h4.xpath('normalize-space()').get())
...
1 Follows
2 Followed by
3 Edited into
4 Spun-off from
5 Spin-off
6 Referenced in
7 Featured in
8 Spoofed in
And this is one example of using the enumeration to get elements between boundaries (note that this use XPath variables with $cnt
in the expression and passing cnt=cnt
in .xpath()
):
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... print(cnt, h4.xpath('normalize-space()').get())
... print(h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
cnt=cnt).xpath(
'string(.//a)').getall())
...
1 Follows
['Star Trek', 'Star Trek: The Animated Series', 'Star Trek: The Motion Picture', 'Star Trek II: The Wrath of Khan', 'Star Trek III: The Search for Spock', 'Star Trek IV: The Voyage Home']
2 Followed by
['Star Trek V: The Final Frontier', 'Star Trek VI: The Undiscovered Country', 'Star Trek: Deep Space Nine', 'Star Trek: Generations', 'Star Trek: Voyager', 'First Contact', 'Star Trek: Insurrection', 'Star Trek: Enterprise', 'Star Trek: Nemesis', 'Star Trek', 'Star Trek Into Darkness', 'Star Trek Beyond', 'Star Trek: Discovery', 'Untitled Star Trek Sequel']
3 Edited into
['Reading Rainbow: The Bionic Bunny Show', 'The Unauthorized Hagiography of Vincent Price']
4 Spun-off from
['Star Trek']
5 Spin-off
['Star Trek: The Next Generation - The Transinium Challenge', 'A Night with Troi', 'Star Trek: Deep Space Nine', "Star Trek: The Next Generation - Future's Past", 'Star Trek: The Next Generation - A Final Unity', 'Star Trek: The Next Generation: Interactive VCR Board Game - A Klingon Challenge', 'Star Trek: Borg', 'Star Trek: Klingon', 'Star Trek: The Experience - The Klingon Encounter']
6 Referenced in
(...)
Here's how you could use that to populate and item (here, I'm using a simple dict just for illustration):
>>> item = {}
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... key = h4.xpath('normalize-space()').get().strip() # there are some non-breaking spaces
... if key in ['Follows', 'Followed by', 'Spin-off']:
... values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
... cnt=cnt).xpath(
... 'string(.//a)').getall()
... item[key] = values
...
>>> from pprint import pprint
>>> pprint(item)
{'Followed by': ['Star Trek V: The Final Frontier',
'Star Trek VI: The Undiscovered Country',
'Star Trek: Deep Space Nine',
'Star Trek: Generations',
'Star Trek: Voyager',
'First Contact',
'Star Trek: Insurrection',
'Star Trek: Enterprise',
'Star Trek: Nemesis',
'Star Trek',
'Star Trek Into Darkness',
'Star Trek Beyond',
'Star Trek: Discovery',
'Untitled Star Trek Sequel'],
'Follows': ['Star Trek',
'Star Trek: The Animated Series',
'Star Trek: The Motion Picture',
'Star Trek II: The Wrath of Khan',
'Star Trek III: The Search for Spock',
'Star Trek IV: The Voyage Home'],
'Spin-off': ['Star Trek: The Next Generation - The Transinium Challenge',
'A Night with Troi',
'Star Trek: Deep Space Nine',
"Star Trek: The Next Generation - Future's Past",
'Star Trek: The Next Generation - A Final Unity',
'Star Trek: The Next Generation: Interactive VCR Board Game - A '
'Klingon Challenge',
'Star Trek: Borg',
'Star Trek: Klingon',
'Star Trek: The Experience - The Klingon Encounter']}
>>>
Retrieve Text Between Two Child Elements With Text
Try it like using xpath 2.0+:
//div[@class="indemandProgress-raised ng-binding"]/text()
Test Demo
In Selenium, you cannot use XPath that returns Attributes or Text nodes, since only Nodes are supported.
To get the text you want you can use Javascript to extract it from the Text Node.
Or select the node and then use .text
result = browser.find_element_by_xpath('//div[contains(@class, "indemandProgress-raisedAmount")]').text.split()[1]
So, ultimately, it is not possible using XPath /text() in Selenium and you have to rely on alternatives methods as outlined.
XPath - extracting text between two nodes
You should be able to just test the first preceding sibling h5
...
//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]
Extract text between two (different) HTML tags using jsoup
Use the Element.nextSibling() method. In the example code below, the desired values are placed into a List Interface of String:
String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";
List<String> valuesList = new ArrayList<>();
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
Node node = a.nextSibling();
valuesList.add(node.toString().trim());
}
// Display valuesLlist in Condole window:
for (String value : valuesList) {
System.out.println(value);
}
It will display the following into the Console Window:
2 145
31 704
30.12.2021
If you prefer to just get the value for Total:
then you can try this:
String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";
String totalValue = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
if (a.before("</span>").text().contains("Total:")) {
Node node = a.nextSibling();
totalValue = "Total: --> " + node.toString().trim();
break;
}
}
// Display the value in Condole window:
System.out.println(totalValue);
The above code will display the following within the Console Window:
Total: --> 31 704
How to use sed/grep to extract text between two words?
sed -e 's/Here\(.*\)String/\1/'
Extract text between html elements
Using map()
function you can get all the text in p
like following.
var cities = $('.cities p').map(function () { return $(this).text();}).get().join();
$('.show').html(cities)
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script><div class="cities"> <div><p>Los Angeles</p><h5>Description</h5></div> <div><p>San Francisco</p><h5>Description</h5></div> <div><p>San Diego</p><h5>Description</h5></div> <div><p>Santa Barbara</p><h5>Description</h5></div> <div><p>Davis</p><h5>Description</h5></div> <div><p>San Jose</p><h5>Description</h5></div></div>
<h3>All cities</h3><div class="show"></div>
How to get text between two Elements in DOM object?
You can always remove the user
and the website
elements like this (you can clone your submitted
element if you do not want the remove actions to "damage" your document):
public static void main(String[] args) throws Exception {
Document content = Jsoup.parse(
"<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\"><a href=\"/goto/002\">www.google.com</a></span>" +
"</div> ");
// create a clone of the element so we do not destroy the original
Element submitted = content.getElementsByClass("submitted").first().clone();
// remove the elements that you do not need
submitted.getElementsByTag("strong").remove();
submitted.getElementsByClass("via").remove();
// print the result (demo)
System.out.println(submitted.text());
}
Outputs:
on 27/09/2011 - 15:17
Trying to get only the text between two strong tags
To answer the question as is, leaving the opportunity to scrape the "Title of Article" and "Footnotes". You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text "PRESENT" and "Section Header" are not present. It can easily be adapted to remove elements before the first "Strong" tag if needed.
from bs4 import BeautifulSoup, element
html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
if i.name == "strong":
counter += 1
if counter == 2:
i.parent.decompose() # Remove the second Strong tag's parent.
if counter > 1: # Remove all tags after second Strong tag.
if isinstance(i, element.Tag):
i.decompose()
print(notes)
Outputs:
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
</div>
Extract text from a line that is between two elements using Cheeriogs
Will this suffice?
$(str).children()[2].next.data.trim()
First time using cheerio so please correct me if i'm wrong.
Related Topics
Get Human Readable Version of File Size
Add Custom CSS Styling to Model Form Django
Numpy/Scipy Equivalent of R Ecdf(X)(X) Function
Equivalent to Python's Findall() Method in Ruby
How to Import a Python Class That Is in a Directory Above
How to Pass Optional Parameters to a Function
Python Giving Filenotfounderror for File Name Returned by Os.Listdir
Authentication Plugin 'Caching_Sha2_Password' Is Not Supported
SQL Join or R's Merge() Function in Numpy
Best Way to Set Entry Background Color in Python Gtk3 and Set Back to Default
Find in Files Using Ruby or Python
How to Add Title to Subplots in Matplotlib
Expand Python Search Path to Other Source
How to Fetch a Non-Ascii Url with Urlopen
Python Sockets Error Typeerror: a Bytes-Like Object Is Required, Not 'Str' with Send Function