Parsing HTML in Python

Parsing HTML using Python

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

Parsing an HTML Document with python

Based on the code you provided it looks like you are trying to open a html file that you have.

Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
    tree = html.fromstring(page)
parser.feed(tree)

Pythons HTML parser requires the feed to be a string.
What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html

parser.feed("THE ENTIRE HTML AS STRING HERE")

I hope this helps

Edit———-
Have you tried getting the html into a string like you have and then calling str.strip() on the string to remove all blank spaces from leading and trailing of the string.

FYI you can also use sentence.replace(“ “, “”) to remove all blank spaces from string

Hope this helps

How to parse html table in python

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text 2text</td>
    <td>3text </td>
    </tr>
    <tr>
    <td>4text 5text</td>
    <td>6text </td>
    </tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
    print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')

for i in trs:
    tds = i.find_all('td')
    print(tds[1].text)

Result:

3text 
6text

How to parse html code saved as text?

As many of the comments stated it is possible to feed .txt file to BeautifulSoup():

from bs4 import BeautifulSoup

path = 'path/to/file.txt'
with open(path) as f:
    text = f.read()
BeautifulSoup(text, 'lxml')

How to extract string value in html with parsing in python

Use .stripped_strings to generate a list of strings of elements in your selection and pick / slice the result - In this case pick first element to get Pennsylvania:

[list(x.stripped_strings)[0] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]

Note In new code find_all() should be used, findAll() actually still works but is very old syntax

To get the href:

[x.a['href'] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]

Example

With multiple li tags:

from bs4 import BeautifulSoup

html="""
<li class="facetbox-shownrow ">
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&s=1&q=%7B%22search%22%3A%5B%22H.R.9043%22%2C%22H.R.9043%22%5D%2C%22cosponsor-state%22%3A%22Pennsylvania%22%7D" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        Pennsylvania        <span id="facetItemcosponsor-statePennsylvaniacount" class="count">[1]</span>    </a>
</li>
<li class="facetbox-shownrow ">
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&s=1&q=%7B%22search%22%3A%5B%22H.R.9043%22%2C%22H.R.9043%22%5D%2C%22cosponsor-state%22%3A%22Pennsylvania%22%7D" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        Main        <span id="facetItemcosponsor-statePennsylvaniacount" class="count">[1]</span>    </a>
</li>
<li class="facetbox-shownrow ">
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&s=1&q=%7B%22search%22%3A%5B%22H.R.9043%22%2C%22H.R.9043%22%5D%2C%22cosponsor-state%22%3A%22Pennsylvania%22%7D" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        California        <span id="facetItemcosponsor-statePennsylvaniacount" class="count">[1]</span>    </a>
</li>
"""
soup=BeautifulSoup(html,"html.parser")

[list(x.stripped_strings)[0] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]

Output

['Pennsylvania', 'Main', 'California']

Parsing HTML with Python with no regard for correct tag hierarchy

BeautifulSoup should do this fine.

it would be a case of:

from bs4 import BeautifulSoup
import requests

r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

then you'd search "soup" for whatever you're looking for.

Python Beautiful Soup html.parser returns none

There is no html in the site. You can just print r.content directly (however, I prefer r.text as it is a string not a bytes object) , and it will contain the string on the page. Remember, when you use developer tools in chrome (or other browsers), the html you see when you inspect is not necessarily the same result that requests will get. Usually viewing the source code directly in your browser (or printing out the result of requests.get(url).text/.content) will give a more accurate picture of what html you are dealing with.