Hpricot with Firebug's Xpath

Tbody tag in xpath produced by fire bug

In order to take into account and avoid this problem, use XPath expressions of the following kind:

 /locStep1/locStep2/.../table/YourSubExpression
|
 /locStep1/locStep2/.../table/tbody/YourSubExpression

If the table doesn't have a tbody child, then the second argument of the union operator (|) selects no nodes and the first argument of the union selects the wanted nodes.

Alternatively, if the table has a tbody child, then the first argument of the union operator selects no nodes and the second argument of the union selects the wanted nodes.

The end result: in both cases the wanted nodes are selected

HPricot css search: How do I select the parent/ancestor of a particular element using a string selector?

Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.

You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.

"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.

"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.

The only thing I'd suggest doing differently, is to use the % (AKA "at") instead of / (AKA search). % returns only the first occurrence so you can drop the [0] index.

(page%"a[@name=a1]").parent.parent.parent.parent.parent

page%'//a[@name="a1"]/../../../../../..'

which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.

If you know that the target table is the only one with that width and height, you can use a more specific xpath:

page%'//table[@height=61 and @width=700]'

I recommend Nokogiri over Hpricot.

You can also use XPath from the top of the document down:

irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil

Basically the XPath pattern means:

Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.

Note: Firefox automatically adds the <tbody> tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.

The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3] according to Firefox so you have to strip the tbody. Also you don't need to anchor at /html.

XPath and Hpricot -- works on some machines, not others?

Since I couldn't find any enlightenment here or elsewhere, I switched from Hpricot to Nokigiri and it works flawlessly across all machines now. The APIs are almost exactly compatible so it only took <10minutes to switch over. Also, I get the feeling that Nokogiri is being more actively maintained although it doesn't have a dependency on libxml2.

Failing to extract html table rows

The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.

I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:

require 'open-uri'
require 'nokogiri'

faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))

faculty.search("/html/body/center/table/tr").each do |text|
  puts text
end

Maybe you should switch?

Xpath to url for import.io

If you create an API from from URL 2.0 and reload the website with JS on but CSS off you should be able to see the collapsible menu:

DOM is constructed in such a way on this website that all the odd rows have job titles whereas more information about the job is hidden in the even rows. For that we can use position() property of XPath, so you can use the following XPath on manual row training:

/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[position() mod 2 = 0]

Which highlights the more information boxes only giving you access to the data inside. From here you can simply target the specific attributes of the elements that have title and link available.

Link xpath: .//a[@class=’forward jobadview’]/@href
Title xpath: .//div[@class=’info’]//h3

Having said that due to the heavy use of JS on the website, it may fail to publish so we have created an API for you to query and you can retrieve the same data using that here.

https://import.io/data/mine/?id=0626d49d-5233-469d-9429-707f73f1757a

Xpath - How to select related cousin data

I'm not sure that this is the best solution, but you might try

//th[not(.="bbb") and not(.="ddd") and not(.="eee")] | //tr[2]/td[not(position()=index-of(//th, "bbb")) and not(position()=index-of(//th, "ddd")) and not(position()=index-of(//th, "eee"))]

or shorter version

//th[not(.=("bbb", "ddd", "eee"))]| //tr[2]/td[not(position()=(index-of(//th, "bbb"), index-of(//th, "ddd"),index-of(//th, "eee")))]

that returns

<th>aaa</th>
<th>ccc</th>
<th>fff</th>
<td>111</td>
<td>333</td>
<td>666</td>

You can avoid using complicated XPath expressions to get required output. Try to use Python + Selenium features instead:

# Get list of th elements
th_elements = driver.find_elements_by_xpath('//th')
# Get list of td elements
td_elements = driver.find_elements_by_xpath('//tr[2]/td')
# Get indexes of required th elements - [0, 2, 5]
ok_index = [th_elements.index(i) for i in th_elements if i.text not in ('bbb', 'ddd', 'eee')]
for i in ok_index:
    print(th_elements[i].text)
for i in ok_index:
    print(td_elements[i].text)

Output is

'aaa'
'ccc'
'fff'
'111'
'333'
'666'

If you need XPath 1.0 solution:

//th[not(.=("bbb", "ddd", "eee"))]| //tr[2]/td[not(position()=(count(//th[.="bbb"]/preceding-sibling::th)+1, count(//th[.="ddd"]/preceding-sibling::th)+1, count(//th[.="eee"]/preceding-sibling::th)+1))]

Html / Script Scraping Google Map using Hpricot (Ruby On Rails)

This was a fun one. It can be done, but it's going to take more that hpricot. I noticed while
sniffing that a webservice is being called to populate the latitude and longitude. Here's what
you can do to get to that information:

Scrape the site like you're normally doing, but look for a call to the LoadMap javascript
function. The line will look something like:

<script type='text/javascript'>LoadMapByDetail(1668154, 0, 1)</script>

Parse the id out and call the webservice. This will end up looking something like:

require 'rubygems'
require 'hpricot' 
require 'open-uri' 
require 'soap/wsdlDriver'

WSDL_URL="http://yellowpages.com.mt/Web_Service/SearchMap.asmx?WSDL" 
soap = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver 
response = soap.GetCoordByDetail(:mainDetailID => '1668154', :type => '1')
soap.reset_stream response.getCoordByDetailResult.anyType.each { |x| puts x.anyType }

You see the latitude and longitude in the output:

35.88805
14.46627

Hope this helps. Good luck!