Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.
Quoting from the linked page:
Version 3.1.0 of Beautiful Soup does
significantly worse on real-world HTML
than version 3.0.8 does. The most
common problems are handling
tags incorrectly, "malformed start
tag" errors, and "bad end tag" errors.
This page explains what happened, how
the problem will be addressed, and
what you can do right now.This page was originally written in
March 2009. Since then, the 3.2 series
has been released, replacing the 3.1
series, and development of the 4.x
series has gotten underway. This page
will remain up for historical
purposes.tl;dr
Use 3.2.0 instead.
BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?
From the docs's summarized table of advantages and disadvantages:
html.parser -
BeautifulSoup(markup, "html.parser")
Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml -
BeautifulSoup(markup, "lxml")
Advantages: Very fast, Lenient
Disadvantages: External C dependency
html5lib -
BeautifulSoup(markup, "html5lib")
Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
Disadvantages: Very slow, External Python dependency
BeautifulSoup and lxml.html - what to prefer?
The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.
Edit:
This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4
now supports using lxml
as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml
myself -- perhaps it's just force of habit :)).
What's the relationship between 'BeautifulSoup' and 'lxml'?
Nothing should be confusing about BS
parser and lxml.html
parser. BS
has an HTML parser, and lxml
has its own HTML parser.
BS
documentation you quoted simply says that you can parse HTML into BS
soup object using lxml
parser or other possible third-party parsers, as alternative to using the default BS
parser :
BeautifulSoup(markup, "lxml")
Similarly, the lxml
documentation says that you can parse HTML into lxml
tree object using BS
parser, as alternative to using the default lxml.html
parser :
root = lxml.html.soupparser.fromstring(tag_soup)
For web scraping and xml parsing, which is best library to learn
Scrapy is used for scraping web pages (extracting data from web pages) hence the name.
Beautiful Soup is library for parsing/pulling data from XML and HTML files.
xml.elementtree provides object representation of the XML file and it is a XML processing module of Python XML package. It is neat to use for parsing and manipulating data in XML format.
lxml is as they claim compatible yet superior to elementtree of the Python XML module but essentially does the same however, I never used it for parsing of HTML files.
In my experience I used Scrapy for fetching data from various user panels that did not have any kind of API for pulling the data. However, parsing of HTML files I mostly did with Beautiful Soup as it is really neat and easy to use.
Regarding XML parsing I mostly used Python XML package however, I never had any complicated XML parsing to perform so Python XML package covered everything I need.
The right tool really depends on your requirements. If you need library to parse XML and HTML files both I would go with Beautiful Soup as it is really easy to use and you have vast documentation online.
Beautiful Soup and Table Scraping - lxml vs html parser
There is a special paragraph in BeautifulSoup
documentation called Differences between parsers, it states that:
Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers.
The differences become clear on non well-formed HTML documents.
The moral is just that you should use the parser that works in your particular case.
Also note that you should always explicitly specify which parser are you using. This would help you to avoid surprises when running the code on different machines or virtual environments.
Python HTML web scraping
I would suggest you to go for BeautifulSoup, as the CSS selectors are more convenient than xpaths.
By using beautiful soup, the code for your problem will be,
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.cardbinlist.com/search.html?bin=371793')
soup = BeautifulSoup(page.content, 'html.parser')
brand_parent = soup.find('th', string='Brand (Financial Service)') # selects <th> element which contains text 'Brand (Financial Service)'
brand = brand_parent.find_next_sibling('td').text # O/P AMEX
If you want to go with Xpath,
change the xpath to //td//following::td[5]/a
and try.
Read the following answers to choose your method of scraping,
Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?
Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
Robustly Parsing HTML in Python
In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.
Raw:
<i>This <span title="a">is<br> some <html>invalid</htl %> HTML.
<sarcasm>It's so great!</sarcasm>
Parsed with BeautifulSoup:
<i>This
<span title="a">is
<br /> some
<html>invalid HTML.
<sarcasm>It's so great!
</sarcasm>
</html>
</span>
</i>
Parsing and Modyfying the html with BeautifulSoup or lxml. Surround a text with some html tag which is directly under the body tag
I think you want to surround tail
text, and I would choose lxml better that beautifulsoup to handle them. The following script searches for any element
that contains tail
text, creates a new <div>
tag (choose yours) and inserts it there. It uses a regular expression to check that the text seems a price and this way skips the text in the end of Ships from and sold by Amazon.com
or Gift-wrap available.
:
from lxml import etree
import re
tree = etree.parse('htmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.tail is not None and elem.tail.strip() and re.search('\$\d+', elem.tail):
e = etree.Element('div')
e.text = elem.tail
elem.tail = ''
elem.addnext(e)
print(etree.tostring(root))
It yields:
<html><body>
<b>List Price:</b>
<strike>$150.00</strike><br/>
<b>Price</b><div>
$117.80</div><br/>
<b>You Save:</b><div>
$32.20(21%)</div><br/>
<font size="-1" color="#009900">In Stock</font>
<br/>
<a href="/gp/aw/help/id=sss/ref=aw_d_sss_shoes">Free Shipping</a>
<br/>
Ships from and sold by Amazon.com<br/>
Gift-wrap available.<br/></body></html>
Related Topics
Why Do Some Built-In Python Functions Only Have Pass
Is the Single Underscore "_" a Built-In Variable in Python
Replace All Occurrences of a String in a Pandas Dataframe (Python)
How to Delete All Blank Lines in the File with the Help of Python
Syntax Error When Using Command Line in Python
Split String into Strings by Length
Remove Non-Numeric Rows in One Column with Pandas
Does Spark Predicate Pushdown Work with Jdbc
Vectorised Haversine Formula with a Pandas Dataframe
Assign Operator to Variable in Python
Get Contents of a Tkinter Entry Widget
Scrapy - Reactor Not Restartable
Improper Use of _New_ to Generate Class Instances
How to Get the Row Count of a Pandas Dataframe
Lambda Function in List Comprehensions
Import Multiple Excel Files into Python Pandas and Concatenate Them into One Dataframe