Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
The Problem: DOM Requires <tbody/>
Tags
Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.
The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>
, <tfoot/>
) are included in table body tags <tbody/>
. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says
The
tbody
element is exposed for all tables, even if the table does not explicitly define atbody
element.
There is an in-depth explanation in another answer on stackoverflow.
On the other hand, HTML does not necessarily require that tag to be used:
The
TBODY
start tag is always required except when the table contains only one table body and no table head or foot sections.
Most XPath Processors Work on raw XML
Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/>
tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".
This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!
Reproducing the Issue
Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org
on the command line. Latter will probably not contain any <tbody/>
elements (they're rarely used), Firebug will always show them.
Solution 1: Remove /tbody
Axis Step
Check if the table you're stuck at really does not contain a <tbody/>
element (see last paragraph). If it does, you've probably got another kind of problem.
Now remove the /tbody
axis step, so your query will look like
//table[@id="example"]/tr[2]/td[1]
Solution 2: Skip <tbody/>
Tags
This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.
Replace the /tbody
axis step by a descendant-or-self step:
//table[@id="example"]//tr[2]/td[1]
Solution 3: Allow Both Input With and Without <tbody/>
Tags
If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).
- XPath 1.0:
//table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
- XPath 2.0:
//table[@id="example"]/(tbody, .)/tr[2]/td[1]
Having problems filtering by xPath
If the problem is what I think it is, I've been battered by this one a couple of times. Chrome implicitly adds any missing <tbody>
tags to the DOM, so if you then copy the XPath or CSS path, you may also have copied tags that don't necessarily exist in the source document. Try viewing the page's source and see if the DOM reported by your browser's console corresponds to the original source HTML. If the <tbody>
tags are absent, be sure to exclude them in your filterXPath()
call.
xPath, DomDocument, Scraping table
I am able to get the data you are looking for by using the xPath helper in Chrome in the following manner (these lines are typed into the Chrome console):
All chemicals / first chemical:
> allChemicals = $x("descendant::tr/td[(position() =1)]")
> firstChemical = allChemicals[0].innerText
All links / first link:
> allLinks = $x("descendant::tr/td[(position() =1)]/a")
> firstLink = allLinks[0].href
All parts / first part:
> allParts = $x("descendant::tr/td[(position() =2)]")
> firstPart = allParts[0].innerText
Hope that helps.
Selenium Script Fails even though Xpath , Firebug show the correct element
If Selenium fails to find an element you know is present, commonly the problem is with synchronization: Selenium tries to access the element too fast, before it appears on the page (and when you try to inspect element even a second later, you can see it, since it was rendered by then). Try to WAIT for the very same element before doing anything else. Examples of the wait can be found here
XPath expression returning empty list in scrapy
<tbody>
element is not a part of initial HTML
source- it is generated by browser parser, so you shouldn't use it in your XPath
expression.
You can use link text to match exact element:
//a[text()="One-Day Internationals"]
Can not get Xpath to fetch a nodeList
If I'm right you'd like to get all the titles in that table. I'd suggest an easier, yet more specific XPath query, i.e.
$nodeList = $x->query('//div[@class="detName"]');
See it in action
Trouble with scraping text from site using lxml / xpath()
Try removing '/tbody' from the xpath.
The browser might be adding the `/tbody' tag whereas it might not appear in the raw HTML.
Read more here and here.
xpath doesn't work in this website
The following works perfectly in lxml.html
(with modern Scrapy uses):
sel.xpath('.//div[@class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using //
to get between the div
and the td
, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.
Related Topics
Do Checkbox Inputs Only Post Data If They'Re Checked
What Is Href="#" and Why Is It Used
Single VS Double Quotes (' VS ")
How to Make Blinking/Flashing Text With CSS 3
Using Position Relative/Absolute Within a Td
Add Centered Text to the Middle of a Horizontal Rule
Is It Bad to Use !Important in a CSS Property
Is a Div Inside a Td a Bad Idea
Align an Element to Bottom With Flexbox
How to Escape Hash Character in Url
Center a Div Horizontally and Vertically
Is There an Equivalent to Background-Size: Cover and Contain For Image Elements
Play Local (Hard-Drive) Video File With Html5 Video Tag
Remove White Space Above and Below Large Text in an Inline-Block Element