Parsing HTML Using Python

Parsing HTML using Python

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

Parsing an HTML Document with python

Based on the code you provided it looks like you are trying to open a html file that you have.

Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
    tree = html.fromstring(page)
parser.feed(tree)

Pythons HTML parser requires the feed to be a string.
What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html

parser.feed("THE ENTIRE HTML AS STRING HERE")

I hope this helps

Edit———-
Have you tried getting the html into a string like you have and then calling str.strip() on the string to remove all blank spaces from leading and trailing of the string.

FYI you can also use sentence.replace(“ “, “”) to remove all blank spaces from string

Hope this helps

How to parse html table in python

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text 2text</td>
    <td>3text </td>
    </tr>
    <tr>
    <td>4text 5text</td>
    <td>6text </td>
    </tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
    print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')

for i in trs:
    tds = i.find_all('td')
    print(tds[1].text)

Result:

3text 
6text

Reading and parsing HTML files starting from a specific line using Python

You may extract the lines starting from 415 till end. Pass this block to BeautifulSoup to get data out of HTML. Here is the code.

from itertools import islice
from bs4 import BeautifulSoup
import os
fname =  "TestFile"
folder = "TestFolder"
for filename in os.listdir(folder):
    if filename.endswith('.html'):
       fname = os.path.join(folder, filename)
       print('Filename: {}'.format(fname))
with open (fname, 'r', encoding='utf8') as f:
    block = islice(f, 415, 600)
    for line in block:
        soup = BeautifulSoup(line, 'html.parser')
        info = soup.find_all('div', class_='panel-body')

Parse HTML using Python

html = '''<span class="passingAlert bar">
     <span class="fold-buttons">
         <a href="#" onclick="fold();">Fold</a> | 
         <a href="#" onclick="unfold();">Unfold</a>
     </span>149 specs, 0 failed, 0 pending
  </span>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

# get <span class="fold-buttons">
c = soup.find(class_="fold-buttons")

# get element after `span`
print( c.nextSibling.strip() )

Extracting text from HTML file using Python

html2text is a Python program that does a pretty good job at this.

Parsing HTML table (lxml, XPath) with enclosed tags

Use cell.xpath('string()') instead of cell.text to simply read out the string value of each cell.

HTML Parsing using Python

My preferred solution for parsing HTML or XML is lxml and xpath.

A quick and dirty example of how you might use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
  print tr.xpath('./td/text()')

Yields:

['Registration Number: ', ' CS 2047103']
['Name of the Candidate: ', 'PATIL SANTOSH KUMARRAO        ']
['Examination Paper: ', 'CS - Computer Science and Information Technology']
['Marks Obtained: ', '75.67 Out of 100']
['GATE Score: ', '911']
['All India Rank: ', '34']
['No of Candidates Appeared in CS: ', '156780']
['Qualifying Marks for CS: ', '\r\n\t\t\t\t\t']
['General', 'OBC ', '(Non-Creamy)', 'SC / ST / PD ']
['31.54', '28.39', '21.03 ']

This code creates an ElementTree out of the HTML data. Using xpath, it selects all <tr> elements where there is an attribute of class="trmenu1". Then for each <tr> it selects and prints the text of any <td> children.

Parsing HTML Using Python