Parsing HTML using Python
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
Parsing an HTML Document with python
Based on the code you provided it looks like you are trying to open a html file that you have.
Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
parser.feed(tree)
Pythons HTML parser requires the feed to be a string.
What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html
parser.feed("THE ENTIRE HTML AS STRING HERE")
I hope this helps
Edit———-
Have you tried getting the html into a string like you have and then calling str.strip()
on the string to remove all blank spaces from leading and trailing of the string.
FYI you can also use sentence.replace(“ “, “”)
to remove all blank spaces from string
Hope this helps
How to parse html table in python
You can use CSS selector select()
and select_one()
to get "3text" and "6text" like below:
import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
print(i.select_one('td:nth-child(2)').text)
You can also use find_all
method:
trs = soup.find('table').find_all('tr')
for i in trs:
tds = i.find_all('td')
print(tds[1].text)
Result:
3text
6text
Reading and parsing HTML files starting from a specific line using Python
You may extract the lines starting from 415 till end. Pass this block to BeautifulSoup
to get data out of HTML. Here is the code.
from itertools import islice
from bs4 import BeautifulSoup
import os
fname = "TestFile"
folder = "TestFolder"
for filename in os.listdir(folder):
if filename.endswith('.html'):
fname = os.path.join(folder, filename)
print('Filename: {}'.format(fname))
with open (fname, 'r', encoding='utf8') as f:
block = islice(f, 415, 600)
for line in block:
soup = BeautifulSoup(line, 'html.parser')
info = soup.find_all('div', class_='panel-body')
Parse HTML using Python
html = '''<span class="passingAlert bar">
<span class="fold-buttons">
<a href="#" onclick="fold();">Fold</a> |
<a href="#" onclick="unfold();">Unfold</a>
</span>149 specs, 0 failed, 0 pending
</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# get <span class="fold-buttons">
c = soup.find(class_="fold-buttons")
# get element after `span`
print( c.nextSibling.strip() )
Extracting text from HTML file using Python
html2text is a Python program that does a pretty good job at this.
Parsing HTML table (lxml, XPath) with enclosed tags
Use cell.xpath('string()')
instead of cell.text
to simply read out the string value of each cell.
HTML Parsing using Python
My preferred solution for parsing HTML or XML is lxml
and xpath
.
A quick and dirty example of how you might use xpath
:
from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)
for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
print tr.xpath('./td/text()')
Yields:
['Registration Number: ', ' CS 2047103']
['Name of the Candidate: ', 'PATIL SANTOSH KUMARRAO ']
['Examination Paper: ', 'CS - Computer Science and Information Technology']
['Marks Obtained: ', '75.67 Out of 100']
['GATE Score: ', '911']
['All India Rank: ', '34']
['No of Candidates Appeared in CS: ', '156780']
['Qualifying Marks for CS: ', '\r\n\t\t\t\t\t']
['General', 'OBC ', '(Non-Creamy)', 'SC / ST / PD ']
['31.54', '28.39', '21.03 ']
This code creates an ElementTree
out of the HTML data. Using xpath
, it selects all <tr>
elements where there is an attribute of class="trmenu1"
. Then for each <tr>
it selects and prints the text of any <td>
children.
Related Topics
Valueerror: Invalid Literal For Int() With Base 10: ''
Finding the Source Code For Built-In Python Functions
Delete a Column from a Pandas Dataframe
How to Check If a String Is a Substring of Items in a List of Strings
How to Get Local Variables Updated, When Using the 'Exec' Call
How to Retrieve a Module'S Path
Difference Between Null=True and Blank=True in Django
Python 3: Unboundlocalerror: Local Variable Referenced Before Assignment
How to Prettyprint a Json File
Reading Binary File and Looping Over Each Byte
Choosing the Correct Upper and Lower Hsv Boundaries For Color Detection With'Cv::Inrange' (Opencv)
How to Replace Nan Values by Zeroes in a Column of a Pandas Dataframe