How to Select and Extract Texts Between Two Elements

how to select and extract texts between two elements?

An extraction pattern I like to use for these cases is:

  • loop over the "boundaries" (here, h4 elements)
  • while enumerating them starting from 1
  • using XPath's following-sibling axis, like in @Andersson's answer, to get elements before the next boundary,
  • and filtering them by counting the number of preceding "boundary" elements, since we know from our enumeration where we are

This would be the loop:

$ scrapy shell 'http://www.imdb.com/title/tt0092455/trivia?tab=mc&ref_=tt_trv_cnn'
(...)
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... print(cnt, h4.xpath('normalize-space()').get())
...
1 Follows 
2 Followed by 
3 Edited into 
4 Spun-off from 
5 Spin-off 
6 Referenced in 
7 Featured in 
8 Spoofed in 

And this is one example of using the enumeration to get elements between boundaries (note that this use XPath variables with $cnt in the expression and passing cnt=cnt in .xpath()):

>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... print(cnt, h4.xpath('normalize-space()').get())
... print(h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
cnt=cnt).xpath(
'string(.//a)').getall())
...
1 Follows 
['Star Trek', 'Star Trek: The Animated Series', 'Star Trek: The Motion Picture', 'Star Trek II: The Wrath of Khan', 'Star Trek III: The Search for Spock', 'Star Trek IV: The Voyage Home']
2 Followed by 
['Star Trek V: The Final Frontier', 'Star Trek VI: The Undiscovered Country', 'Star Trek: Deep Space Nine', 'Star Trek: Generations', 'Star Trek: Voyager', 'First Contact', 'Star Trek: Insurrection', 'Star Trek: Enterprise', 'Star Trek: Nemesis', 'Star Trek', 'Star Trek Into Darkness', 'Star Trek Beyond', 'Star Trek: Discovery', 'Untitled Star Trek Sequel']
3 Edited into 
['Reading Rainbow: The Bionic Bunny Show', 'The Unauthorized Hagiography of Vincent Price']
4 Spun-off from 
['Star Trek']
5 Spin-off 
['Star Trek: The Next Generation - The Transinium Challenge', 'A Night with Troi', 'Star Trek: Deep Space Nine', "Star Trek: The Next Generation - Future's Past", 'Star Trek: The Next Generation - A Final Unity', 'Star Trek: The Next Generation: Interactive VCR Board Game - A Klingon Challenge', 'Star Trek: Borg', 'Star Trek: Klingon', 'Star Trek: The Experience - The Klingon Encounter']
6 Referenced in 
(...)

Here's how you could use that to populate and item (here, I'm using a simple dict just for illustration):

>>> item = {}
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... key = h4.xpath('normalize-space()').get().strip() # there are some non-breaking spaces
... if key in ['Follows', 'Followed by', 'Spin-off']:
... values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
... cnt=cnt).xpath(
... 'string(.//a)').getall()
... item[key] = values
...

>>> from pprint import pprint
>>> pprint(item)
{'Followed by': ['Star Trek V: The Final Frontier',
'Star Trek VI: The Undiscovered Country',
'Star Trek: Deep Space Nine',
'Star Trek: Generations',
'Star Trek: Voyager',
'First Contact',
'Star Trek: Insurrection',
'Star Trek: Enterprise',
'Star Trek: Nemesis',
'Star Trek',
'Star Trek Into Darkness',
'Star Trek Beyond',
'Star Trek: Discovery',
'Untitled Star Trek Sequel'],
'Follows': ['Star Trek',
'Star Trek: The Animated Series',
'Star Trek: The Motion Picture',
'Star Trek II: The Wrath of Khan',
'Star Trek III: The Search for Spock',
'Star Trek IV: The Voyage Home'],
'Spin-off': ['Star Trek: The Next Generation - The Transinium Challenge',
'A Night with Troi',
'Star Trek: Deep Space Nine',
"Star Trek: The Next Generation - Future's Past",
'Star Trek: The Next Generation - A Final Unity',
'Star Trek: The Next Generation: Interactive VCR Board Game - A '
'Klingon Challenge',
'Star Trek: Borg',
'Star Trek: Klingon',
'Star Trek: The Experience - The Klingon Encounter']}
>>>

Retrieve Text Between Two Child Elements With Text

Try it like using xpath 2.0+:

//div[@class="indemandProgress-raised ng-binding"]/text()

Test Demo


In Selenium, you cannot use XPath that returns Attributes or Text nodes, since only Nodes are supported.

To get the text you want you can use Javascript to extract it from the Text Node.
Or select the node and then use .text

result = browser.find_element_by_xpath('//div[contains(@class, "indemandProgress-raisedAmount")]').text.split()[1]

So, ultimately, it is not possible using XPath /text() in Selenium and you have to rely on alternatives methods as outlined.

XPath - extracting text between two nodes

You should be able to just test the first preceding sibling h5...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

Extract text between two (different) HTML tags using jsoup

Use the Element.nextSibling() method. In the example code below, the desired values are placed into a List Interface of String:

String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";

List<String> valuesList = new ArrayList<>();

Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
Node node = a.nextSibling();
valuesList.add(node.toString().trim());
}

// Display valuesLlist in Condole window:
for (String value : valuesList) {
System.out.println(value);
}

It will display the following into the Console Window:

2 145
31 704
30.12.2021

If you prefer to just get the value for Total: then you can try this:

String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";
String totalValue = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
if (a.before("</span>").text().contains("Total:")) {
Node node = a.nextSibling();
totalValue = "Total: --> " + node.toString().trim();
break;
}
}

// Display the value in Condole window:
System.out.println(totalValue);

The above code will display the following within the Console Window:

 Total: --> 31 704

How to use sed/grep to extract text between two words?

sed -e 's/Here\(.*\)String/\1/'

Extract text between html elements

Using map() function you can get all the text in p like following.

var cities = $('.cities p').map(function () {    return $(this).text();}).get().join();
$('.show').html(cities)
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script><div class="cities">    <div><p>Los Angeles</p><h5>Description</h5></div>    <div><p>San Francisco</p><h5>Description</h5></div>    <div><p>San Diego</p><h5>Description</h5></div>    <div><p>Santa Barbara</p><h5>Description</h5></div>    <div><p>Davis</p><h5>Description</h5></div>    <div><p>San Jose</p><h5>Description</h5></div></div> 
<h3>All cities</h3><div class="show"></div>

How to get text between two Elements in DOM object?

You can always remove the user and the website elements like this (you can clone your submitted element if you do not want the remove actions to "damage" your document):

public static void main(String[] args) throws Exception {

Document content = Jsoup.parse(
"<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\"><a href=\"/goto/002\">www.google.com</a></span>" +
"</div> ");

// create a clone of the element so we do not destroy the original
Element submitted = content.getElementsByClass("submitted").first().clone();

// remove the elements that you do not need
submitted.getElementsByTag("strong").remove();
submitted.getElementsByClass("via").remove();

// print the result (demo)
System.out.println(submitted.text());
}

Outputs:

on 27/09/2011 - 15:17

Trying to get only the text between two strong tags

To answer the question as is, leaving the opportunity to scrape the "Title of Article" and "Footnotes". You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text "PRESENT" and "Section Header" are not present. It can easily be adapted to remove elements before the first "Strong" tag if needed.

from bs4 import BeautifulSoup, element

html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
if i.name == "strong":
counter += 1
if counter == 2:
i.parent.decompose() # Remove the second Strong tag's parent.
if counter > 1: # Remove all tags after second Strong tag.
if isinstance(i, element.Tag):
i.decompose()
print(notes)

Outputs:

<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>

</div>

Extract text from a line that is between two elements using Cheeriogs

Will this suffice?

$(str).children()[2].next.data.trim()

test

First time using cheerio so please correct me if i'm wrong.



Related Topics



Leave a reply



Submit