Parsing HTML in Python - Lxml or Beautifulsoup? Which of These Is Better for What Kinds of Purposes

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.

Quoting from the linked page:

Version 3.1.0 of Beautiful Soup does
significantly worse on real-world HTML
than version 3.0.8 does. The most
common problems are handling
tags incorrectly, "malformed start
tag" errors, and "bad end tag" errors.
This page explains what happened, how
the problem will be addressed, and
what you can do right now.

This page was originally written in
March 2009. Since then, the 3.2 series
has been released, replacing the 3.1
series, and development of the 4.x
series has gotten underway. This page
will remain up for historical
purposes.

tl;dr

Use 3.2.0 instead.

BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

From the docs's summarized table of advantages and disadvantages:

  1. html.parser - BeautifulSoup(markup, "html.parser")

    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

  2. lxml - BeautifulSoup(markup, "lxml")

    • Advantages: Very fast, Lenient

    • Disadvantages: External C dependency

  3. html5lib - BeautifulSoup(markup, "html5lib")

    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

    • Disadvantages: Very slow, External Python dependency

BeautifulSoup and lxml.html - what to prefer?

The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.

Edit:

This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4 now supports using lxml as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml myself -- perhaps it's just force of habit :)).

What's the relationship between 'BeautifulSoup' and 'lxml'?

Nothing should be confusing about BS parser and lxml.html parser. BS has an HTML parser, and lxml has its own HTML parser.

BS documentation you quoted simply says that you can parse HTML into BS soup object using lxml parser or other possible third-party parsers, as alternative to using the default BS parser :

BeautifulSoup(markup, "lxml")

Similarly, the lxml documentation says that you can parse HTML into lxml tree object using BS parser, as alternative to using the default lxml.html parser :

root = lxml.html.soupparser.fromstring(tag_soup)

For web scraping and xml parsing, which is best library to learn

Scrapy is used for scraping web pages (extracting data from web pages) hence the name.

Beautiful Soup is library for parsing/pulling data from XML and HTML files.

xml.elementtree provides object representation of the XML file and it is a XML processing module of Python XML package. It is neat to use for parsing and manipulating data in XML format.

lxml is as they claim compatible yet superior to elementtree of the Python XML module but essentially does the same however, I never used it for parsing of HTML files.

In my experience I used Scrapy for fetching data from various user panels that did not have any kind of API for pulling the data. However, parsing of HTML files I mostly did with Beautiful Soup as it is really neat and easy to use.
Regarding XML parsing I mostly used Python XML package however, I never had any complicated XML parsing to perform so Python XML package covered everything I need.

The right tool really depends on your requirements. If you need library to parse XML and HTML files both I would go with Beautiful Soup as it is really easy to use and you have vast documentation online.

Beautiful Soup and Table Scraping - lxml vs html parser

There is a special paragraph in BeautifulSoup documentation called Differences between parsers, it states that:

Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers.

The differences become clear on non well-formed HTML documents.

The moral is just that you should use the parser that works in your particular case.

Also note that you should always explicitly specify which parser are you using. This would help you to avoid surprises when running the code on different machines or virtual environments.

Python HTML web scraping

I would suggest you to go for BeautifulSoup, as the CSS selectors are more convenient than xpaths.

By using beautiful soup, the code for your problem will be,

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.cardbinlist.com/search.html?bin=371793')
soup = BeautifulSoup(page.content, 'html.parser')
brand_parent = soup.find('th', string='Brand (Financial Service)') # selects <th> element which contains text 'Brand (Financial Service)'
brand = brand_parent.find_next_sibling('td').text # O/P AMEX

If you want to go with Xpath,

change the xpath to //td//following::td[5]/a and try.

Read the following answers to choose your method of scraping,

Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

Robustly Parsing HTML in Python

In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.

Raw:

<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>

Parsed with BeautifulSoup:

 <i>This 
<span title="a">is
<br /> some
<html>invalid HTML.
<sarcasm>It's so great!
</sarcasm>
</html>
</span>
</i>

Parsing and Modyfying the html with BeautifulSoup or lxml. Surround a text with some html tag which is directly under the body tag

I think you want to surround tail text, and I would choose lxml better that beautifulsoup to handle them. The following script searches for any element that contains tail text, creates a new <div> tag (choose yours) and inserts it there. It uses a regular expression to check that the text seems a price and this way skips the text in the end of Ships from and sold by Amazon.com or Gift-wrap available.:

from lxml import etree
import re

tree = etree.parse('htmlfile')
root = tree.getroot()

for elem in root.iter('*'):
if elem.tail is not None and elem.tail.strip() and re.search('\$\d+', elem.tail):
e = etree.Element('div')
e.text = elem.tail
elem.tail = ''
elem.addnext(e)

print(etree.tostring(root))

It yields:

<html><body>
<b>List Price:</b>
<strike>$150.00</strike><br/>
<b>Price</b><div>
$117.80</div><br/>
<b>You Save:</b><div>
$32.20(21%)</div><br/>
<font size="-1" color="#009900">In Stock</font>
<br/>
<a href="/gp/aw/help/id=sss/ref=aw_d_sss_shoes">Free Shipping</a>
<br/>
Ships from and sold by Amazon.com<br/>
Gift-wrap available.<br/></body></html>


Related Topics



Leave a reply



Submit