How to Parse an HTML Table with Nokogiri

How to parse a HTML table with Nokogiri?

The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

Lets examine what each step does:

# Finds all <table>s with class open
# I'm assuming you have only one <table> so
# you don't actually have to loop through
# all tables, instead you can just operate
# on the first one. If that is not the case,
# you can use a loop the way you did
tables = doc.css('table.open')

# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text

# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text

# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text

Now that we know what is wrong, here is a possible solution:

html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>1001</td>
<td>1002</td>
<td>1003</td>
<td>1004</td>
<td>1005</td>
</tr>
<tr>
<th>Raw name 2</th>
<td>2001</td>
<td>2002</td>
<td>2003</td>
<td>2004</td>
<td>2005</td>
</tr>
<tr>
<th>Raw name 3</th>
<td>3001</td>
<td>3002</td>
<td>3003</td>
<td>3004</td>
<td>3005</td>
</tr>
</table>
EOT

doc = Nokogiri::HTML(html, nil, 'UTF-8')

# Fetches only the first <table>. If you have
# more than one, you can loop the way you
# originally did.
table = doc.css('table.open').first

# Fetches all rows (<tr>s)
rows = table.css('tr')

# The column names are the first row (shift returns
# the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)

# On each of the remaining rows
text_all_rows = rows.map do |row|

# We get the name (<th>)
# On the first row this will be Raw name 1
# on the second - Raw name 2, etc.
row_name = row.css('th').text

# We get the text of each individual value (<td>)
# On the first row this will be 1001, 1002, 1003...
# on the second - 2001, 2002, 2003... etc
row_values = row.css('td').map(&:text)

# We map the name, followed by all the values
[row_name, *row_values]
end

p column_names # => ["Table name", "Column name 1", "Column name 2",
# "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
# ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
# ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]

# If you want to combine them
text_all_rows.each do |row_as_text|
p column_names.zip(row_as_text).to_h
end # =>
# {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
# {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
# {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

Nokogiri: parse, extract and return tr content in HTML table

This way:

data = Nokogiri::HTML(open(url))
rows = data.css("td[valign='top'] table tr") # All the <tr>this is a line</tr>
rows.each do |row|
puts row.text # Will print all the 'this is a line'
end

Nokogiri parse HTML table in Ruby

  1. The HTML is flawed; There are more than one element with the same id attribute: <tr onclick="colorize(this);" id="item_2" tcolor=""> <td class="firmname" id="item_2">.
  2. The id of the table element in the HTML is "gvResult" while in the Ruby code you are asking Nokogiri to look for a table where "id=gvResultTable".
  3. Nokogiri uses UTF-8 encoding to store strings internally, so you shouldn't have any problem with Russian characters.

Provided the HTML can be fixed, this works fine:

HTML:

<table id="gvResult">
<tbody>
<tr id="item_1">
<td class="firmname">Example1</td>
<td class="price">42.00</td>
</tr>
<tr id="item_2">
<td class="firmname">Example2</td>
<td class="price">24.00</td>
</tr>
</tbody>
</table>

Ruby:

require 'rubygems'
require 'nokogiri'
require 'pp'

html = open('http://www.example.com/page')

doc = Nokogiri::HTML(html)
doc.encoding = 'utf-8'

rows = doc.search('//tr[starts-with(@id, "item_")]')
@details = rows.collect do |row|
detail = {}
[
[:firmname, 'td[1]/text()'],
[:price, 'td[2]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp @details

I presumed that you want to get the data from all the tr elements with an id like "item_\d+" hence I used doc.search('//tr[starts-with(@id, "item_")]'). Change it to suit your needs.

How do I parse an HTML table with Nokogiri?

#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
(The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details

# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]

Ruby - nokogiri - parse only specific html table

Have you tried the xpath of //table[2]/tr/td to get the second table. If you can change the source of the HTML the best solution would be to provide id attributes for your tables.

How parse a table and extract data for last 6 months Nokogiri

Generally, it's helpful if you can post some example HTML rather than a screenshot of the page. Particularly as this task is about parsing HTML.

Why do you need to check the date beforehand? Nokogiri is pretty fast, and I can't imagine the table is so big that checking as you parse will be useful. Having reviewed the Nokogiri docs, I can't see any way to do what you're describing. You'll need to grab the data from the table, and then reject any rows that have a date older than six months.

Parsing a table with Nokogiri

If you look at the actual HTML returned when you visit that page, you'll see the table is actually empty, and the contents are dynamically loaded by JS. Because of that you can't do what you want with just nokogiri opening the page. You'll need to use something that allows you to control a real browser (or emulates a browser with JS support) in order for the page to fully load before getting the page contents, or you'll need to figure out what URL the page is loading the data for the table from and see if you can access it directly from there (may not be possible).

Parse a HTML table using Ruby, Nokogiri omitting the column headers

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')

# OR

rows = doc.xpath("//table/tbody/tr")
header = rows.shift

After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:

<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>

To get the inner text, removing all the html tags, run puts rows.text

ding
dong
ling

To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }

["ding", "dong", "ling"]

How do I parse a plain HTML table with Nokogiri?

As a quick and dirty first pass I'd do:

html = <<EOT
<div id="__DailyStat__">
<table>
<tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
<tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
<tr class="blr">
<td>3</td>
<td>455</td>
<td>34</td>
<td class="r">3454</td>
<td class="r">5656</td>
<td class="r">3</td>
</tr>

<tr class="bla">
<td>1</td>
<td>1300</td>
<td>3664</td>
<td class="r">3545</td>
<td class="r">1000</td>
<td class="r">10</td>
</tr>

<tr class="blr">
<td>10</td>
<td>100000</td>
<td>3444</td>
<td class="r">3411</td>
<td class="r">36223</td>
<td class="r">15</td>
</tr>
</table>
</div>
EOT

# Today Yesterday
# Qnty Size Length Length Size Qnty
# 3 455 34 3454 5656 3
# 1 1300 3664 3545 1000 10
# 10 100000 3444 3411 36223 15

require 'nokogiri'

doc = Nokogiri::HTML(html)

Use CSS to find the start of the table, and define some places to hold the data we're capturing:

table = doc.at('div#__DailyStat__ table')

today_data = []
yesterday_data = []

Loop over the rows in the table, rejecting the headers:

table.search('tr').each do |tr|

next if (tr['class'] == 'blh')

Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:

  today_td_data     = [ 'Today'     ]
yesterday_td_data = [ 'Yesterday' ]

tr.search('td').each do |td|
if (td['class'] == 'r')
yesterday_td_data << td.text.to_i
else
today_td_data << td.text.to_i
end
end

today_data << today_td_data
yesterday_data << yesterday_td_data

end

And output the data:

puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }

> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15

Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:

[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]

Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:

  tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }

today_td_data = [ 'Today', *tr_data[0, 3] ]
yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]

In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.

And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.



Related Topics



Leave a reply



Submit