How can I use the python HTMLParser library to extract data from a specific div tag?
class LinksParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.recording = 0
self.data = []
def handle_starttag(self, tag, attributes):
if tag != 'div':
return
if self.recording:
self.recording += 1
return
for name, value in attributes:
if name == 'id' and value == 'remository':
break
else:
return
self.recording = 1
def handle_endtag(self, tag):
if tag == 'div' and self.recording:
self.recording -= 1
def handle_data(self, data):
if self.recording:
self.data.append(data)
self.recording
counts the number of nested div
tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data
.
The data at the end of the parse are left in self.data
(a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.
The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div'
, 'id'
, and 'remository'
, instance attributes self.tag
, self.attname
and self.attvalue
, set by __init__
from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).
Extract data from span tag within a div tag
I would use the BeautifulSoup library. Here is how I would grap this info knowning that you already have the HTML file :
from bs4 import BeautifulSoup
with open(html_path) as html_file:
html_page = BeautifulSoup(html_file, 'html.parser')
div = html_page.find('div', class_='playbackTimeline__duration')
span = div.find('span', {'aria-hidden': 'true'})
text = span.get_text()
I'm not sure if it works, but it gives you an idea on how to do this kind of stuff. Check for "web scraping" if you want more information about that. :)
Extracting text from HTML file using Python
html2text is a Python program that does a pretty good job at this.
return data from HTMLParser handle_starttag
feed()
method doesn't return anything - which is why you are getting None
. Instead, read the value of data
property after calling feed()
:
from HTMLParser import HTMLParser
class YoutubeLinkParser(HTMLParser):
def handle_starttag(self, tag, attrs):
self.data = attrs[2][1].split('/')[-1]
iframe = open('iframe.html').read()
parser = YoutubeLinkParser()
parser.feed(iframe)
print parser.data
Prints:
fY9UhIxitYM
How to extract text from inside div tag using BeautifulSoup
In edited question data load from javascript
and you need library like selenium
and you can't get data with BeautifulSoup
.
This answer for old question:
If you have multiple class="subPrice"
, you can use find_all()
and get price with .text
like below:
from bs4 import BeautifulSoup
html="""
<div class="nowPrice">
<div class="showPrice" style="color: rgb(14, 203, 129);">47,864.58</div>
<div class="subPrice">$47,864.58</div>
<div class="subPrice">$57,864.58</div>
<div class="subPrice">$67,864.58</div>
<div class="subPrice">$77,864.58</div>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
for sp in soup.find_all("div",class_="subPrice"):
print(sp.text)
output:
$47,864.58
$57,864.58
$67,864.58
$77,864.58
How to use html.parser
Here's a good start that might require specific tuning:
import html.parser
class MyParser(html.parser.HTMLParser):
def __init__(self, html):
self.matches = []
self.match_count = 0
super().__init__()
def handle_data(self, data):
self.matches.append(data)
self.match_count += 1
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if tag == "div":
if attrs.get("product-cost"):
self.handle_data()
else: return
The usage is along the lines of
request_html = the_request_method(url, ...)
parser = MyParser()
parser.feed(request_html)
for item in parser.matches:
print(item)
Get return value from HTMLParser class to main class
You store information you want to collect on your parser instance:
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__()
self.links = []
def handle_starttag(self, tag, attrs):
if tag == "a" and 'href' in attrs:
self.links.append(attrs['href'])
then after you have fed HTML into the parser you can retrieve the links
attribute from the instance
parser = MyHTMLParser()
parser.feed(html)
print parser.links
For parsing HTML, I can heartily recommend you look at BeautifulSoup instead:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [a['href'] for a in soup.find_all('a', href=True)]
Parsing HTML using Python
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
Related Topics
Performance of Pandas Apply VS Np.Vectorize to Create New Column from Existing Columns
How to Automatically Install Required Packages from a Python Script as Necessary
Importerror: Matplotlib Is Required for Plotting When the Default Backend "Matplotlib" Is Selected
Sharing a Result Queue Among Several Processes
Why Isn't My Pandas 'Apply' Function Referencing Multiple Columns Working
Opencv Giving Wrong Color to Colored Images on Loading
How to Append a New Row to an Old CSV File in Python
Typeerror: a Bytes-Like Object Is Required, Not 'Str'
Reading an Excel File in Python Using Pandas
Appending Turns My List to Nonetype
Programming on Samsung Chromebook
Assign Environment Variables from Bash Script to Current Session from Python
Access an Arbitrary Element in a Dictionary in Python
Transpose Column to Row with Spark
Pandas Dataframe: Replace Nan Values with Average of Columns