Extracting Data Between Two Tags in HTML File

How can I get the contents between two tags in a html page using Beautiful Soup?

A couple of issues:

  • You are selecting the link from the table of contents instead of the header: the header is not an a tag, but just a font tag (you can always inspect these details in a browser). However, if you try to do soup.find_all("font", text="Risk Factors") you will get 2 results because the link from the table of contents also has a font tag, so you would need to select the second one: soup.find_all("font", text="Risk Factors")[1].
  • Similar issue for the second header, but this time something funny happens: the header has an "invisible" space just before the closing tag, although the link from the TOC doesn't, so you would need to select it like this soup.find_all("font", text="Unresolved Staff Comments ")[0].
  • Another issue, the "text in between" is not a sibling (or siblings) of the tree elements that we've selected, but siblings with an ancestor from those elements. If you inspect the page source code, you will see that the headings are included inside a div, inside a table cell (td), inside a table row (tr), inside a table, so we need to go 4 parent levels up: risk_factors_header.parent.parent.parent.parent.
  • Also, there are several siblings that you are interested in, better to use next_siblings and iterate through all of them.
  • Once you've got all of that, you can use the second heading to break the iteration once you reach it.
  • Since you want to get the text only (ignoring all the html tags) you can use get_text() instead of content.

Ok, all together:

import requests                                                                                                                                                                      
import bs4 as bs

file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("font", text="Risk Factors")[1]
staff_comments_header = soup.find_all("font", text="Unresolved Staff Comments ")[0]

for paragraph in risk_factors_header.parent.parent.parent.parent.next_siblings:
if paragraph == staff_comments_header.parent.parent.parent.parent:
break

print(paragraph.get_text())

Extracting data between two tags in HTML file

A file of size 50 MB isn't so big that you can't just load its contents directly into MATLAB as a string, which you can do with the function FILEREAD:

strContents = fileread('yourfile.html');

Assuming the file format you have above, you can then parse the contents with the function REGEXP (using named token capture):

expr = '<(?<tag>name|prodId|color)>''([^<>]+)''</\k<tag>>';
tokens = regexp(strContents,expr,'tokens');
tokens = vertcat(tokens{:});

And the contents of token using your sample file contents will be:

tokens = 

'name' 'hat'
'prodId' '1829493'
'color' 'cyan'
'name' 'shirt'
'prodId' '193'
'name' 'dress'
'prodId' '18'
'color' 'dark purple'

You may then want to parse the resulting N-by-2 cell array and place the contents in a structure array with fields 'name', 'prodId', and 'color'. The difficulty is that not every entry will have all three fields. Assuming each 'name' will be followed by either a 'prodId', a 'color', or both (in the order 'prodId' then 'color'), then the following code should work for you:

s = struct('name',[],'prodId',[],'color',[]);  %# Initialize structure
nTokens = size(tokens,1); %# Get number of tokens
nameIndex = find(strcmp(tokens(:,1),'name')); %# Find indices of 'name'
[s(1:numel(nameIndex)).name] = deal(tokens{nameIndex,2}); %# Fill 'name' field

%# Find and fill 'prodId' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'prodId');
[s(index).prodId] = deal(tokens{nameIndex(index)+1,2});

%# Find and fill 'color' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'color');
[s(index).color] = deal(tokens{nameIndex(index)+1,2});

%# Find and fill 'color' that follows a 'prodId':
index = strcmp(tokens(min(nameIndex+2,nTokens),1),'color');
[s(index).color] = deal(tokens{min(nameIndex(index)+2,nTokens),2});

And the contents of s using your sample file contents will be:

>> s(1)

name: 'hat'
prodId: '1829493'
color: 'cyan'

>> s(2)

name: 'shirt'
prodId: '193'
color: []

>> s(3)

name: 'dress'
prodId: '18'
color: 'dark purple'

PHP Extract data between specific tags from an html file

It's not worth going into why your regex doesn't work, IMO (for general regex knowledge though .... a . doesn't count for new lines (unless s modifier is used) and .* in a character class is allowing either of those 2 literal characters).

For the domdocument you need to get further into the DOM tree to get the value. You can use the xpath for this.

$html = '<tr>
<td>Income</td>
<td id="income">
<font color="green">
<span data-c="2250000">0.0225 RP</span>
</font>
</td>
</tr>';
$dom = new domdocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
echo $xpath->query('//tr/td[@id="income"]/font/span')[0]->nodeValue;

Scraping/extracting content between two different html tags using python

You could use a regular expression:

content = re.search(
'<h2>WEB TRAFFIC BLOCK LIST</h2>(.*?)<h2>EMAILS</h2>',
html,
re.DOTALL
).group(1)

Or with Beautiful Soup, collect the nodes in between the start and end tags:

soup = BeautifulSoup(html, 'html.parser')
start = soup.find('h2', text='WEB TRAFFIC BLOCK LIST')
end = soup.find('h2', text='EMAILS')
content = ''
item = start.nextSibling

while item != end:
content += str(item)
item = item.nextSibling

print(content)

How to extract text from 2 tags of html or replace first and last tag

Use

re.sub(r'^<p>(.*)</p>$', r'\1', d, flags=re.S)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
<p> '<p>'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
</p> '</p>'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Extract text between two (different) HTML tags using jsoup

Use the Element.nextSibling() method. In the example code below, the desired values are placed into a List Interface of String:

String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";

List<String> valuesList = new ArrayList<>();

Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
Node node = a.nextSibling();
valuesList.add(node.toString().trim());
}

// Display valuesLlist in Condole window:
for (String value : valuesList) {
System.out.println(value);
}

It will display the following into the Console Window:

2 145
31 704
30.12.2021

If you prefer to just get the value for Total: then you can try this:

String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";
String totalValue = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
if (a.before("</span>").text().contains("Total:")) {
Node node = a.nextSibling();
totalValue = "Total: --> " + node.toString().trim();
break;
}
}

// Display the value in Condole window:
System.out.println(totalValue);

The above code will display the following within the Console Window:

 Total: --> 31 704

Beautiful soup: Extract everything between two tags

One solution is to .extract() all content in front of first <h1> and after second <h1> tag:

from bs4 import BeautifulSoup

html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''

soup = BeautifulSoup(html_doc, 'html.parser')

for c in list(soup.contents):
if c is soup.h1 or c.find_previous('h1') is soup.h1:
continue
c.extract()

for h1 in soup.select('h1'):
h1.extract()

print(soup)

Prints:

Text <i>here</i> has no tag
<div>This is in a div</div>


Related Topics



Leave a reply



Submit