Python - Beautiful Soup - extract text between div and sup
I have resolved it. The problem was in the JavaScript-generated data. So static parsing methods don't work with it. I tried several solutions (including Selenium and an XHR script results capturing).
Finally, inside my parsed data I have found a static URL of a page that links to a separate web page, where this JavaScript code is executed and can be parsed by static methods.
The video "Python Web Scraping Tutorial: scraping dynamic JavaScript/Ajax websites with Beautiful Soup" explains a similar solution.
Extract text between two different tags beautiful soup
All the paragraphs that you want are located inside the <div class="td-post-content">
tag along with the paragraphs for the author information. But, the required <p>
tags are direct children of this <div>
tag, while the other not required <p>
tags are not direct children (they are nested inside other div
tags).
So, you can use recursive=False
to access those tags only.
Code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
r = requests.get('https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
container = soup.find('div', class_='td-post-content')
for para in container.find_all('p', recursive=False):
print(para.text)
Output:
Cybersecurity giant McAfee released its McAfee Labs Threat Report: June 2018 on Wednesday, outlining the growth and trends of new malware and cyber threats in Q1 2018. According to the report, coin mining malware saw a 623 percent growth in the first quarter of 2018, infecting 2.9 million machines in that period. McAfee Labs counted 313 publicly disclosed security incidents in the first three months of 2018, a 41 percent increase over the previous quarter. In particular, incidents in the healthcare sector rose 57 percent, with a significant portion involving Bitcoin-based ransomware that healthcare institutions were often compelled to pay.
Chief Scientist at McAfee Raj Samani said, “There were new revelations this quarter concerning complex nation-state cyber-attack campaigns targeting users and enterprise systems worldwide. Bad actors demonstrated a remarkable level of technical agility and innovation in tools and tactics. Criminals continued to adopt cryptocurrency mining to easily monetize their criminal activity.”
Sizeable criminal organizations are responsible for many of the attacks in recent months. In January, malware dubbed Golden Dragon attacked organizations putting together the Pyeongchang Winter Olympics in South Korea, using a malicious word attachment to install a script that would encrypt and send stolen data to an attacker’s command center. The Lazarus cybercrime ring launched a highly sophisticated Bitcoin phishing campaign called HaoBao that targeted global financial organizations, sending an email attachment that would scan for Bitcoin activity, credentials and mining data.
Chief Technology Officer at McAfee Steve Grobman said, “Cybercriminals will gravitate to criminal activity that maximizes their profit. In recent quarters we have seen a shift to ransomware from data-theft, as ransomware is a more efficient crime. With the rise in value of cryptocurrencies, the market forces are driving criminals to crypto-jacking and the theft of cryptocurrency. Cybercrime is a business, and market forces will continue to shape where adversaries focus their efforts.”
Beautiful soup: Extract everything between two tags
One solution is to .extract()
all content in front of first <h1>
and after second <h1>
tag:
from bs4 import BeautifulSoup
html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''
soup = BeautifulSoup(html_doc, 'html.parser')
for c in list(soup.contents):
if c is soup.h1 or c.find_previous('h1') is soup.h1:
continue
c.extract()
for h1 in soup.select('h1'):
h1.extract()
print(soup)
Prints:
Text <i>here</i> has no tag
<div>This is in a div</div>
how to get text between two SETS of tags in python
YOu could use the .next_sibling
from each of those elements.
Code:
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
bs = soup.find_all('b')
for each in bs:
eachFollowingText = each.next_sibling.strip()
print(f'{each.text} {eachFollowingText}')
Output:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
Extract text between two varying sections using python and beautiful soup
You can use :contains
with bs4 4.7.1 + and filter out the sibling divs that come after the next h2
from the h2
of interest. You then have all the relevant parent div
s and can loop and extract whatever info you want and format how you like.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.physiology.org/toc/advances/43/2')
soup = bs(r.content, 'lxml')
divs = soup.select('h2:contains("HOW WE TEACH") ~ div:not(h2:contains("ILLUMINATIONS") ~ div)')
for div in divs:
print(div.get_text(' '), '\n')
If you don't know what the next h2 header will be then you can generalise with:
divs = soup.select('h2:contains("HOW WE TEACH") ~ div:not(h2:contains("HOW WE TEACH") ~ h2 ~ div)')
Beautiful soup extract text between span tags
Why use selenium
? It's so unnecessary. Only use selenium
if the page is JavaScript rendered. Otherwise use the following:
from bs4 import BeautifulSoup
html = '<span id="priceblock_dealprice" class="a-size-medium a-color-price"><span class="currencyINR"> </span> 33,990.00 </span>'
soup = BeautifulSoup(html, 'lxml')
text = soup.select_one('span.a-color-price').text.strip()
Output:
33,990.00
Python & Beautiful Soup - Extract text between a specific tag and class combination
You can use for example tag.find_previous
to find to which block the paragraph belongs:
from bs4 import BeautifulSoup
html_doc = """\
<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post 1"</p>
<p style="display: block;"> "Second paragraph of post 1"</p>
<h2 class = "thisYear" title = "Click here to display/hide information">
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post 2"</p>
<p style="display: block;"> "Second paragraph of post 2"</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
out = {}
for p in soup.select("h2.thisYear ~ p:not(.pubdate)"):
title = p.find_previous("h2").text.strip()
pubdate = p.find_previous(class_="pubdate").text.strip()
out.setdefault((title, pubdate), []).append(p.text.strip())
print(out)
Prints:
{
('"First Post Title"', "2022-07-11"): [
'"First paragraph of post 1"',
'"Second paragraph of post 1"',
],
('"Second Post Title"', "2022-07-07"): [
'"First paragraph of post 2"',
'"Second paragraph of post 2"',
],
}
EDIT: To transform out
as a DataFrame you can do:
import pandas as pd
df = pd.DataFrame(
[
(title, date, "\n".join(paragraphs))
for (title, date), paragraphs in out.items()
],
columns=["Title", "Date", "Paragraphs"],
)
print(df)
Prints:
Title Date Paragraphs
0 "First Post Title" 2022-07-11 "First paragraph of post 1"\n"Second paragraph of post 1"
1 "Second Post Title" 2022-07-07 "First paragraph of post 2"\n"Second paragraph of post 2"
Related Topics
Run Child Processes as Different User from a Long Running Python Process
How to Group a List of Tuples/Objects by Similar Index/Attribute in Python
Getting Individual Colors from a Color Map in Matplotlib
Function Name Is Undefined in Python Class
Importing Pyspark in Python Shell
Python: Find_Element_By_Css_Selector
How to Get a Thread Safe Print in Python 2.6
Python - Add Pythonpath During Command Line Module Run
It Is More Efficient to Use If-Return-Return or If-Else-Return
Where's My JSON Data in My Incoming Django Request
Count Unique Values Using Pandas Groupby
How to Convert a File to Utf-8 in Python
How to Join Two Wav Files Using Python
Installing Numpy on 64Bit Windows 7 with Python 2.7.3