How to Use the Python HTMLparser Library to Extract Data from a Specific Div Tag

How can I use the python HTMLParser library to extract data from a specific div tag?

class LinksParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.recording = 0
self.data = []

def handle_starttag(self, tag, attributes):
if tag != 'div':
return
if self.recording:
self.recording += 1
return
for name, value in attributes:
if name == 'id' and value == 'remository':
break
else:
return
self.recording = 1

def handle_endtag(self, tag):
if tag == 'div' and self.recording:
self.recording -= 1

def handle_data(self, data):
if self.recording:
self.data.append(data)

self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

Extract data from span tag within a div tag

I would use the BeautifulSoup library. Here is how I would grap this info knowning that you already have the HTML file :

from bs4 import BeautifulSoup

with open(html_path) as html_file:
html_page = BeautifulSoup(html_file, 'html.parser')
div = html_page.find('div', class_='playbackTimeline__duration')
span = div.find('span', {'aria-hidden': 'true'})
text = span.get_text()

I'm not sure if it works, but it gives you an idea on how to do this kind of stuff. Check for "web scraping" if you want more information about that. :)

Extracting text from HTML file using Python

html2text is a Python program that does a pretty good job at this.

return data from HTMLParser handle_starttag

feed() method doesn't return anything - which is why you are getting None. Instead, read the value of data property after calling feed():

from HTMLParser import HTMLParser

class YoutubeLinkParser(HTMLParser):
def handle_starttag(self, tag, attrs):
self.data = attrs[2][1].split('/')[-1]

iframe = open('iframe.html').read()
parser = YoutubeLinkParser()
parser.feed(iframe)
print parser.data

Prints:

fY9UhIxitYM

How to extract text from inside div tag using BeautifulSoup

In edited question data load from javascript and you need library like selenium and you can't get data with BeautifulSoup.

This answer for old question:

If you have multiple class="subPrice", you can use find_all() and get price with .text like below:

from bs4 import BeautifulSoup

html="""
<div class="nowPrice">
<div class="showPrice" style="color: rgb(14, 203, 129);">47,864.58</div>
<div class="subPrice">$47,864.58</div>
<div class="subPrice">$57,864.58</div>
<div class="subPrice">$67,864.58</div>
<div class="subPrice">$77,864.58</div>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
for sp in soup.find_all("div",class_="subPrice"):
print(sp.text)

output:

$47,864.58
$57,864.58
$67,864.58
$77,864.58

How to use html.parser

Here's a good start that might require specific tuning:

import html.parser

class MyParser(html.parser.HTMLParser):

def __init__(self, html):
self.matches = []
self.match_count = 0
super().__init__()

def handle_data(self, data):
self.matches.append(data)
self.match_count += 1

def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if tag == "div":
if attrs.get("product-cost"):
self.handle_data()
else: return

The usage is along the lines of

request_html = the_request_method(url, ...)

parser = MyParser()
parser.feed(request_html)

for item in parser.matches:
print(item)

Get return value from HTMLParser class to main class

You store information you want to collect on your parser instance:

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__()
self.links = []

def handle_starttag(self, tag, attrs):
if tag == "a" and 'href' in attrs:
self.links.append(attrs['href'])

then after you have fed HTML into the parser you can retrieve the links attribute from the instance

parser = MyHTMLParser()
parser.feed(html)
print parser.links

For parsing HTML, I can heartily recommend you look at BeautifulSoup instead:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
links = [a['href'] for a in soup.find_all('a', href=True)]

Parsing HTML using Python

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.



Related Topics



Leave a reply



Submit