Html-Parser on Node.Js

HTML-parser on Node.js

If you want to build DOM you can use jsdom.

There's also cheerio, it has the jQuery interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

parse5 also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.

If the website you're trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

There's a nettuts+ toturial for the latter solutions.

How do I parse a HTML page with Node.js

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

  • BeautifulSoup for python
  • you can convert you html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.



Related Topics



Leave a reply



Submit