Parse Table Using Nokogiri

How to parse a HTML table with Nokogiri?

The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

Lets examine what each step does:

# Finds all <table>s with class open
# I'm assuming you have only one <table> so
# you don't actually have to loop through
# all tables, instead you can just operate
# on the first one. If that is not the case,
# you can use a loop the way you did
tables = doc.css('table.open')

# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text

# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text

# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text

Now that we know what is wrong, here is a possible solution:

html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>1001</td>
<td>1002</td>
<td>1003</td>
<td>1004</td>
<td>1005</td>
</tr>
<tr>
<th>Raw name 2</th>
<td>2001</td>
<td>2002</td>
<td>2003</td>
<td>2004</td>
<td>2005</td>
</tr>
<tr>
<th>Raw name 3</th>
<td>3001</td>
<td>3002</td>
<td>3003</td>
<td>3004</td>
<td>3005</td>
</tr>
</table>
EOT

doc = Nokogiri::HTML(html, nil, 'UTF-8')

# Fetches only the first <table>. If you have
# more than one, you can loop the way you
# originally did.
table = doc.css('table.open').first

# Fetches all rows (<tr>s)
rows = table.css('tr')

# The column names are the first row (shift returns
# the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)

# On each of the remaining rows
text_all_rows = rows.map do |row|

# We get the name (<th>)
# On the first row this will be Raw name 1
# on the second - Raw name 2, etc.
row_name = row.css('th').text

# We get the text of each individual value (<td>)
# On the first row this will be 1001, 1002, 1003...
# on the second - 2001, 2002, 2003... etc
row_values = row.css('td').map(&:text)

# We map the name, followed by all the values
[row_name, *row_values]
end

p column_names # => ["Table name", "Column name 1", "Column name 2",
# "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
# ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
# ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]

# If you want to combine them
text_all_rows.each do |row_as_text|
p column_names.zip(row_as_text).to_h
end # =>
# {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
# {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
# {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

How do I parse an HTML table with Nokogiri?

#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
(The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details

# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]

Nokogiri: parse, extract and return tr content in HTML table

This way:

data = Nokogiri::HTML(open(url))
rows = data.css("td[valign='top'] table tr") # All the <tr>this is a line</tr>
rows.each do |row|
puts row.text # Will print all the 'this is a line'
end

Nokogiri parse HTML table in Ruby

  1. The HTML is flawed; There are more than one element with the same id attribute: <tr onclick="colorize(this);" id="item_2" tcolor=""> <td class="firmname" id="item_2">.
  2. The id of the table element in the HTML is "gvResult" while in the Ruby code you are asking Nokogiri to look for a table where "id=gvResultTable".
  3. Nokogiri uses UTF-8 encoding to store strings internally, so you shouldn't have any problem with Russian characters.

Provided the HTML can be fixed, this works fine:

HTML:

<table id="gvResult">
<tbody>
<tr id="item_1">
<td class="firmname">Example1</td>
<td class="price">42.00</td>
</tr>
<tr id="item_2">
<td class="firmname">Example2</td>
<td class="price">24.00</td>
</tr>
</tbody>
</table>

Ruby:

require 'rubygems'
require 'nokogiri'
require 'pp'

html = open('http://www.example.com/page')

doc = Nokogiri::HTML(html)
doc.encoding = 'utf-8'

rows = doc.search('//tr[starts-with(@id, "item_")]')
@details = rows.collect do |row|
detail = {}
[
[:firmname, 'td[1]/text()'],
[:price, 'td[2]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp @details

I presumed that you want to get the data from all the tr elements with an id like "item_\d+" hence I used doc.search('//tr[starts-with(@id, "item_")]'). Change it to suit your needs.

Parse table using Nokogiri

Use:

td//text()[normalize-space()]

This selects all non-white-space-only text node descendents of any td child of the current node (the tr already selected in your code).

Or if you want to select all text-node descendents, regardles whether they are white-space-only or not:

td//text()

UPDATE:

The OP has signaled in a comment that he is getting an unwanted td with content just a ' ' (aka non-breaking space).

To exclude also tds whose content is composed only of (one or more) nbsp characters, use:

td//text()[translate(normalize-space(), ' ', '')]

How parse a table and extract data for last 6 months Nokogiri

Generally, it's helpful if you can post some example HTML rather than a screenshot of the page. Particularly as this task is about parsing HTML.

Why do you need to check the date beforehand? Nokogiri is pretty fast, and I can't imagine the table is so big that checking as you parse will be useful. Having reviewed the Nokogiri docs, I can't see any way to do what you're describing. You'll need to grab the data from the table, and then reject any rows that have a date older than six months.

How to parse TABLE text with Nokogiri?

If you want all text in the cell, hyperlinked or not:

doc.xpath('//td[1]').each do |cell|
puts cell.text.strip
end

Note: in a valid HTML document, a td will always be within a table and a tr. If you don't have any other selector requirements, you can simplify as above.

Parse a HTML table using Ruby, Nokogiri omitting the column headers

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')

# OR

rows = doc.xpath("//table/tbody/tr")
header = rows.shift

After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:

<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>

To get the inner text, removing all the html tags, run puts rows.text

ding
dong
ling

To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }

["ding", "dong", "ling"]

Parsing links from a table using Nokogiri

Solution using Nokogiri#css selecting. (Probably not the most efficient way but it works)

require 'open-uri'
require 'nokogiri'

page = Nokogiri::HTML(open("http://en.wikipedia.org/w/index.php?title=Katie_Holmes&action=edit&section=10"))
puts page.css('span.mw-headline#Filmography').text

page.css('table').each do |tab|
if tab.css('caption').text == "Film"
tab.css('th').css('a').each do |a|
puts "Title: #{a['title']} URL:#{a['href']}"
end
end
end

#=> Filmography
#=> Title: The Ice Storm (film) URL:/wiki/The_Ice_Storm_(film)
#=> Title: Disturbing Behavior URL:/wiki/Disturbing_Behavior
#=> .....So on


Related Topics



Leave a reply



Submit