Why Does My Xpath Query (Scraping HTML Tables) Only Work in Firebug, But Not the Application I'M Developing

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?



The Problem: DOM Requires <tbody/> Tags

Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.

The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says

The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.

There is an in-depth explanation in another answer on stackoverflow.

On the other hand, HTML does not necessarily require that tag to be used:

The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.

Most XPath Processors Work on raw XML

Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".

This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!

Reproducing the Issue

Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.


Solution 1: Remove /tbody Axis Step

Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.

Now remove the /tbody axis step, so your query will look like

//table[@id="example"]/tr[2]/td[1]

Solution 2: Skip <tbody/> Tags

This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.

Replace the /tbody axis step by a descendant-or-self step:

//table[@id="example"]//tr[2]/td[1]

Solution 3: Allow Both Input With and Without <tbody/> Tags

If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).

  • XPath 1.0:

    //table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
  • XPath 2.0: //table[@id="example"]/(tbody, .)/tr[2]/td[1]

Having problems filtering by xPath

If the problem is what I think it is, I've been battered by this one a couple of times. Chrome implicitly adds any missing <tbody> tags to the DOM, so if you then copy the XPath or CSS path, you may also have copied tags that don't necessarily exist in the source document. Try viewing the page's source and see if the DOM reported by your browser's console corresponds to the original source HTML. If the <tbody> tags are absent, be sure to exclude them in your filterXPath() call.

xPath, DomDocument, Scraping table

I am able to get the data you are looking for by using the xPath helper in Chrome in the following manner (these lines are typed into the Chrome console):

All chemicals / first chemical:

> allChemicals = $x("descendant::tr/td[(position() =1)]")
> firstChemical = allChemicals[0].innerText

All links / first link:

> allLinks = $x("descendant::tr/td[(position() =1)]/a")
> firstLink = allLinks[0].href

All parts / first part:

> allParts = $x("descendant::tr/td[(position() =2)]")
> firstPart = allParts[0].innerText

Hope that helps.

Selenium Script Fails even though Xpath , Firebug show the correct element

If Selenium fails to find an element you know is present, commonly the problem is with synchronization: Selenium tries to access the element too fast, before it appears on the page (and when you try to inspect element even a second later, you can see it, since it was rendered by then). Try to WAIT for the very same element before doing anything else. Examples of the wait can be found here

XPath expression returning empty list in scrapy

<tbody> element is not a part of initial HTML source- it is generated by browser parser, so you shouldn't use it in your XPath expression.

You can use link text to match exact element:

//a[text()="One-Day Internationals"]

Can not get Xpath to fetch a nodeList

If I'm right you'd like to get all the titles in that table. I'd suggest an easier, yet more specific XPath query, i.e.

$nodeList = $x->query('//div[@class="detName"]');

See it in action

Trouble with scraping text from site using lxml / xpath()

Try removing '/tbody' from the xpath.

The browser might be adding the `/tbody' tag whereas it might not appear in the raw HTML.

Read more here and here.

xpath doesn't work in this website

The following works perfectly in lxml.html (with modern Scrapy uses):

sel.xpath('.//div[@class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')

Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.



Related Topics



Leave a reply



Submit