HTML-parser on Node.js
If you want to build DOM you can use jsdom.
There's also cheerio, it has the jQuery interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.
You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.
parse5 also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.
If the website you're trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.
Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.
There's a nettuts+ toturial for the latter solutions.
How do I parse a HTML page with Node.js
You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.
Other options include:
- BeautifulSoup for python
- you can convert you html to xhtml and use XSLT
- HTMLAgilityPack for .NET
- CsQuery for .NET (my new favorite)
- The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.
Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.
Related Topics
Pure CSS Multi-Level Drop-Down Menu
Difference in Applying CSS to Html, Body, and the Universal Selector *
Can the ≪Script≫ Tag Not Be Self Closed
How to Position Two Elements Side by Side Using Css
Why Can't Radio Buttons Be "Readonly"
<Input Type="Number"> Not Working in Ie10
Font Rendering/Line-Height Issue on MAC/Pc (Outside of Element)
Calculator Keypad Layout with Flexbox
How to Make Div Occupy Remaining Height
Colspan/Rowspan For Elements Whose Display Is Set to Table-Cell
Having Google Chrome Repeat Table Headers on Printed Pages
Css ''Background-Color" Attribute Not Working on Checkbox Inside ≪Div≫
What Is Use of 'Initial' Value in Css
Style Input Element to Fill Remaining Width of Its Container
How to Run an External Program, E.G. Notepad, Using Hyperlink
Html5 Drag and Drop File Upload to Java Servlet
How to Show Split Header in the Material Table Having Nested Group of Data in Angular