Parsing HTML using Python
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
Parsing an HTML Document with python
Based on the code you provided it looks like you are trying to open a html file that you have.
Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
parser.feed(tree)
Pythons HTML parser requires the feed to be a string.
What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html
parser.feed("THE ENTIRE HTML AS STRING HERE")
I hope this helps
Edit———-
Have you tried getting the html into a string like you have and then calling str.strip()
on the string to remove all blank spaces from leading and trailing of the string.
FYI you can also use sentence.replace(“ “, “”)
to remove all blank spaces from string
Hope this helps
How to parse html table in python
You can use CSS selector select()
and select_one()
to get "3text" and "6text" like below:
import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
print(i.select_one('td:nth-child(2)').text)
You can also use find_all
method:
trs = soup.find('table').find_all('tr')
for i in trs:
tds = i.find_all('td')
print(tds[1].text)
Result:
3text
6text
How to parse html code saved as text?
As many of the comments stated it is possible to feed .txt file to BeautifulSoup():
from bs4 import BeautifulSoup
path = 'path/to/file.txt'
with open(path) as f:
text = f.read()
BeautifulSoup(text, 'lxml')
How to extract string value in html with parsing in python
Use .stripped_strings
to generate a list of strings of elements in your selection and pick / slice the result - In this case pick first element to get Pennsylvania:
[list(x.stripped_strings)[0] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]
Note In new code find_all()
should be used, findAll()
actually still works but is very old syntax
To get the href
:
[x.a['href'] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]
Example
With multiple li
tags:
from bs4 import BeautifulSoup
html="""
<li class="facetbox-shownrow ">
<a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&s=1&q=%7B%22search%22%3A%5B%22H.R.9043%22%2C%22H.R.9043%22%5D%2C%22cosponsor-state%22%3A%22Pennsylvania%22%7D" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
Pennsylvania <span id="facetItemcosponsor-statePennsylvaniacount" class="count">[1]</span> </a>
</li>
<li class="facetbox-shownrow ">
<a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&s=1&q=%7B%22search%22%3A%5B%22H.R.9043%22%2C%22H.R.9043%22%5D%2C%22cosponsor-state%22%3A%22Pennsylvania%22%7D" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
Main <span id="facetItemcosponsor-statePennsylvaniacount" class="count">[1]</span> </a>
</li>
<li class="facetbox-shownrow ">
<a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&s=1&q=%7B%22search%22%3A%5B%22H.R.9043%22%2C%22H.R.9043%22%5D%2C%22cosponsor-state%22%3A%22Pennsylvania%22%7D" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
California <span id="facetItemcosponsor-statePennsylvaniacount" class="count">[1]</span> </a>
</li>
"""
soup=BeautifulSoup(html,"html.parser")
[list(x.stripped_strings)[0] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]
Output
['Pennsylvania', 'Main', 'California']
Parsing HTML with Python with no regard for correct tag hierarchy
BeautifulSoup should do this fine.
it would be a case of:
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
then you'd search "soup" for whatever you're looking for.
Python Beautiful Soup html.parser returns none
There is no html in the site. You can just print r.content
directly (however, I prefer r.text
as it is a string
not a bytes
object) , and it will contain the string on the page. Remember, when you use developer tools in chrome (or other browsers), the html you see when you inspect is not necessarily the same result that requests will get. Usually viewing the source code directly in your browser (or printing out the result of requests.get(url).text/.content
) will give a more accurate picture of what html you are dealing with.
Related Topics
How to Create a "View" on a Python List
Separation of Business Logic and Data Access in Django
Writing a Dictionary to a CSV File with One Line for Every 'Key: Value'
Python Os.Path.Join on Windows
Transparent Background in a Tkinter Window
How to Convert CSV File to Multiline JSON
Cost of Exception Handlers in Python
Find Length of Sequences of Identical Values in a Numpy Array (Run Length Encoding)
Changing an Element in One List Changes Multiple Lists
Getting a Callback When a Tkinter Listbox Selection Is Changed
What Is _Future_ in Python Used for and How/When to Use It, and How It Works
How to Format a String Using a Dictionary in Python-3.X
Format Floats with Standard JSON Module
Find the Recaptcha Element and Click on It -- Python + Selenium
Reimport a Module While Interactive