How to parse a HTML table with Nokogiri?
The key of the problem is that calling #text
on multiple results will return the concatenation of the #text
of each individual element.
Lets examine what each step does:
# Finds all <table>s with class open
# I'm assuming you have only one <table> so
# you don't actually have to loop through
# all tables, instead you can just operate
# on the first one. If that is not the case,
# you can use a loop the way you did
tables = doc.css('table.open')
# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text
# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text
# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text
Now that we know what is wrong, here is a possible solution:
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>1001</td>
<td>1002</td>
<td>1003</td>
<td>1004</td>
<td>1005</td>
</tr>
<tr>
<th>Raw name 2</th>
<td>2001</td>
<td>2002</td>
<td>2003</td>
<td>2004</td>
<td>2005</td>
</tr>
<tr>
<th>Raw name 3</th>
<td>3001</td>
<td>3002</td>
<td>3003</td>
<td>3004</td>
<td>3005</td>
</tr>
</table>
EOT
doc = Nokogiri::HTML(html, nil, 'UTF-8')
# Fetches only the first <table>. If you have
# more than one, you can loop the way you
# originally did.
table = doc.css('table.open').first
# Fetches all rows (<tr>s)
rows = table.css('tr')
# The column names are the first row (shift returns
# the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)
# On each of the remaining rows
text_all_rows = rows.map do |row|
# We get the name (<th>)
# On the first row this will be Raw name 1
# on the second - Raw name 2, etc.
row_name = row.css('th').text
# We get the text of each individual value (<td>)
# On the first row this will be 1001, 1002, 1003...
# on the second - 2001, 2002, 2003... etc
row_values = row.css('td').map(&:text)
# We map the name, followed by all the values
[row_name, *row_values]
end
p column_names # => ["Table name", "Column name 1", "Column name 2",
# "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
# ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
# ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]
# If you want to combine them
text_all_rows.each do |row_as_text|
p column_names.zip(row_as_text).to_h
end # =>
# {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
# {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
# {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}
How do I parse an HTML table with Nokogiri?
#!/usr/bin/ruby1.8
require 'nokogiri'
require 'pp'
html = <<-EOS
(The HTML from the question goes here)
EOS
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details
# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]
Nokogiri: parse, extract and return tr content in HTML table
This way:
data = Nokogiri::HTML(open(url))
rows = data.css("td[valign='top'] table tr") # All the <tr>this is a line</tr>
rows.each do |row|
puts row.text # Will print all the 'this is a line'
end
Nokogiri parse HTML table in Ruby
- The HTML is flawed; There are more than one element with the same
id
attribute:<tr onclick="colorize(this);" id="item_2" tcolor=""> <td class="firmname" id="item_2">
. - The
id
of thetable
element in the HTML is "gvResult
" while in the Ruby code you are asking Nokogiri to look for a table where "id=gvResultTable
". - Nokogiri uses UTF-8 encoding to store strings internally, so you shouldn't have any problem with Russian characters.
Provided the HTML can be fixed, this works fine:
HTML:
<table id="gvResult">
<tbody>
<tr id="item_1">
<td class="firmname">Example1</td>
<td class="price">42.00</td>
</tr>
<tr id="item_2">
<td class="firmname">Example2</td>
<td class="price">24.00</td>
</tr>
</tbody>
</table>
Ruby:
require 'rubygems'
require 'nokogiri'
require 'pp'
html = open('http://www.example.com/page')
doc = Nokogiri::HTML(html)
doc.encoding = 'utf-8'
rows = doc.search('//tr[starts-with(@id, "item_")]')
@details = rows.collect do |row|
detail = {}
[
[:firmname, 'td[1]/text()'],
[:price, 'td[2]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp @details
I presumed that you want to get the data from all the tr
elements with an id
like "item_\d+
" hence I used doc.search('//tr[starts-with(@id, "item_")]')
. Change it to suit your needs.
Parse table using Nokogiri
Use:
td//text()[normalize-space()]
This selects all non-white-space-only text node descendents of any td
child of the current node (the tr
already selected in your code).
Or if you want to select all text-node descendents, regardles whether they are white-space-only or not:
td//text()
UPDATE:
The OP has signaled in a comment that he is getting an unwanted td
with content just a ' '
(aka non-breaking space).
To exclude also td
s whose content is composed only of (one or more) nbsp characters, use:
td//text()[translate(normalize-space(), ' ', '')]
How parse a table and extract data for last 6 months Nokogiri
Generally, it's helpful if you can post some example HTML rather than a screenshot of the page. Particularly as this task is about parsing HTML.
Why do you need to check the date beforehand? Nokogiri is pretty fast, and I can't imagine the table is so big that checking as you parse will be useful. Having reviewed the Nokogiri docs, I can't see any way to do what you're describing. You'll need to grab the data from the table, and then reject any rows that have a date older than six months.
How to parse TABLE text with Nokogiri?
If you want all text in the cell, hyperlinked or not:
doc.xpath('//td[1]').each do |cell|
puts cell.text.strip
end
Note: in a valid HTML document, a td
will always be within a table
and a tr
. If you don't have any other selector requirements, you can simplify as above.
Parse a HTML table using Ruby, Nokogiri omitting the column headers
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows
will contain every <tr>...</tr>
after the first one. For example puts rows.to_xml
prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]
Parsing links from a table using Nokogiri
Solution using Nokogiri#css selecting. (Probably not the most efficient way but it works)
require 'open-uri'
require 'nokogiri'
page = Nokogiri::HTML(open("http://en.wikipedia.org/w/index.php?title=Katie_Holmes&action=edit§ion=10"))
puts page.css('span.mw-headline#Filmography').text
page.css('table').each do |tab|
if tab.css('caption').text == "Film"
tab.css('th').css('a').each do |a|
puts "Title: #{a['title']} URL:#{a['href']}"
end
end
end
#=> Filmography
#=> Title: The Ice Storm (film) URL:/wiki/The_Ice_Storm_(film)
#=> Title: Disturbing Behavior URL:/wiki/Disturbing_Behavior
#=> .....So on
Related Topics
Str.Each in Ruby Isn't Working
In Ruby How to Use Class Level Local Variable? (A Ruby Newbie's Question)
Ruby: Method Inexplicably Overwritten and Set to Nil
Does Ruby 1.9.2 Have an Is_A? Function
Ruby String Split with Terminal Strings Empty
Capistrano and API Keys in Env Variables
How to Create Email with CSS and Images from Rails
Symbol#To_Proc with Custom Methods
How to Format a Date to Mm/Dd/Yyyy in Ruby
When Do You Need a Require in a Rails Gemfile
Using Nokogiri to Split Content on Br Tags
How the Anchor \Z and \G Works in Ruby
Sorting an Array in Ruby Without Using Sort Method
How to Render a PDF in the Browser That Is Retrieve via Rails Controller
How to Preserve Case with Http.Get
Changing Songs on Jplayer by Clicking a Link, Hosted on Amazon S3