How can I get the contents between two tags in a html page using Beautiful Soup?
A couple of issues:
- You are selecting the link from the table of contents instead of the header: the header is not an
a
tag, but just afont
tag (you can always inspect these details in a browser). However, if you try to dosoup.find_all("font", text="Risk Factors")
you will get 2 results because the link from the table of contents also has afont
tag, so you would need to select the second one:soup.find_all("font", text="Risk Factors")[1]
. - Similar issue for the second header, but this time something funny happens: the header has an "invisible" space just before the closing tag, although the link from the TOC doesn't, so you would need to select it like this
soup.find_all("font", text="Unresolved Staff Comments ")[0]
. - Another issue, the "text in between" is not a sibling (or siblings) of the tree elements that we've selected, but siblings with an ancestor from those elements. If you inspect the page source code, you will see that the headings are included inside a
div
, inside a table cell (td
), inside a table row (tr
), inside atable
, so we need to go 4 parent levels up:risk_factors_header.parent.parent.parent.parent
. - Also, there are several siblings that you are interested in, better to use
next_siblings
and iterate through all of them. - Once you've got all of that, you can use the second heading to break the iteration once you reach it.
- Since you want to get the text only (ignoring all the html tags) you can use
get_text()
instead ofcontent
.
Ok, all together:
import requests
import bs4 as bs
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("font", text="Risk Factors")[1]
staff_comments_header = soup.find_all("font", text="Unresolved Staff Comments ")[0]
for paragraph in risk_factors_header.parent.parent.parent.parent.next_siblings:
if paragraph == staff_comments_header.parent.parent.parent.parent:
break
print(paragraph.get_text())
Extracting data between two tags in HTML file
A file of size 50 MB isn't so big that you can't just load its contents directly into MATLAB as a string, which you can do with the function FILEREAD:
strContents = fileread('yourfile.html');
Assuming the file format you have above, you can then parse the contents with the function REGEXP (using named token capture):
expr = '<(?<tag>name|prodId|color)>''([^<>]+)''</\k<tag>>';
tokens = regexp(strContents,expr,'tokens');
tokens = vertcat(tokens{:});
And the contents of token
using your sample file contents will be:
tokens =
'name' 'hat'
'prodId' '1829493'
'color' 'cyan'
'name' 'shirt'
'prodId' '193'
'name' 'dress'
'prodId' '18'
'color' 'dark purple'
You may then want to parse the resulting N-by-2 cell array and place the contents in a structure array with fields 'name'
, 'prodId'
, and 'color'
. The difficulty is that not every entry will have all three fields. Assuming each 'name'
will be followed by either a 'prodId'
, a 'color'
, or both (in the order 'prodId'
then 'color'
), then the following code should work for you:
s = struct('name',[],'prodId',[],'color',[]); %# Initialize structure
nTokens = size(tokens,1); %# Get number of tokens
nameIndex = find(strcmp(tokens(:,1),'name')); %# Find indices of 'name'
[s(1:numel(nameIndex)).name] = deal(tokens{nameIndex,2}); %# Fill 'name' field
%# Find and fill 'prodId' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'prodId');
[s(index).prodId] = deal(tokens{nameIndex(index)+1,2});
%# Find and fill 'color' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'color');
[s(index).color] = deal(tokens{nameIndex(index)+1,2});
%# Find and fill 'color' that follows a 'prodId':
index = strcmp(tokens(min(nameIndex+2,nTokens),1),'color');
[s(index).color] = deal(tokens{min(nameIndex(index)+2,nTokens),2});
And the contents of s
using your sample file contents will be:
>> s(1)
name: 'hat'
prodId: '1829493'
color: 'cyan'
>> s(2)
name: 'shirt'
prodId: '193'
color: []
>> s(3)
name: 'dress'
prodId: '18'
color: 'dark purple'
PHP Extract data between specific tags from an html file
It's not worth going into why your regex doesn't work, IMO (for general regex knowledge though .... a .
doesn't count for new lines (unless s
modifier is used) and .*
in a character class is allowing either of those 2 literal characters).
For the domdocument you need to get further into the DOM tree to get the value. You can use the xpath for this.
$html = '<tr>
<td>Income</td>
<td id="income">
<font color="green">
<span data-c="2250000">0.0225 RP</span>
</font>
</td>
</tr>';
$dom = new domdocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
echo $xpath->query('//tr/td[@id="income"]/font/span')[0]->nodeValue;
Scraping/extracting content between two different html tags using python
You could use a regular expression:
content = re.search(
'<h2>WEB TRAFFIC BLOCK LIST</h2>(.*?)<h2>EMAILS</h2>',
html,
re.DOTALL
).group(1)
Or with Beautiful Soup, collect the nodes in between the start and end tags:
soup = BeautifulSoup(html, 'html.parser')
start = soup.find('h2', text='WEB TRAFFIC BLOCK LIST')
end = soup.find('h2', text='EMAILS')
content = ''
item = start.nextSibling
while item != end:
content += str(item)
item = item.nextSibling
print(content)
How to extract text from 2 tags of html or replace first and last tag
Use
re.sub(r'^<p>(.*)</p>$', r'\1', d, flags=re.S)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
<p> '<p>'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
</p> '</p>'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Extract text between two (different) HTML tags using jsoup
Use the Element.nextSibling() method. In the example code below, the desired values are placed into a List Interface of String:
String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";
List<String> valuesList = new ArrayList<>();
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
Node node = a.nextSibling();
valuesList.add(node.toString().trim());
}
// Display valuesLlist in Condole window:
for (String value : valuesList) {
System.out.println(value);
}
It will display the following into the Console Window:
2 145
31 704
30.12.2021
If you prefer to just get the value for Total:
then you can try this:
String html = "<td>\n"
+ " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
+ " <span class=\"detailh2\">Total: </span> 31 704 \n"
+ " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
+ "</td>";
String totalValue = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
if (a.before("</span>").text().contains("Total:")) {
Node node = a.nextSibling();
totalValue = "Total: --> " + node.toString().trim();
break;
}
}
// Display the value in Condole window:
System.out.println(totalValue);
The above code will display the following within the Console Window:
Total: --> 31 704
Beautiful soup: Extract everything between two tags
One solution is to .extract()
all content in front of first <h1>
and after second <h1>
tag:
from bs4 import BeautifulSoup
html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for c in list(soup.contents):
if c is soup.h1 or c.find_previous('h1') is soup.h1:
continue
c.extract()
for h1 in soup.select('h1'):
h1.extract()
print(soup)
Prints:
Text <i>here</i> has no tag
<div>This is in a div</div>
Related Topics
Trying to Center Div Horizontally and Vertically in Screen
How Could I Play a Shoutcast/Icecast Stream Using HTML5
What Is Dom? (Summary and Importance)
CSS Flexbox | Reordering Elements in Mobile
Adjusting and Image Size to Fit a Div with Bootstrap
Keep Padding from Making The Element Bigger
Assign Variables to Child Template in {% Include %} Tag Django
Flex Items Create Space Between Them When They Wrap
How to Make a Div Take The Full Width of The Page When It Is Inside Another Div That Have 90% Width
Why Can't I Use a Heading Tag Inside a P Tag and Style It with CSS