How to Parse a HTML Table with Nokogiri

How to parse a HTML table with Nokogiri?

The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

Lets examine what each step does:

# Finds all <table>s with class open
# I'm assuming you have only one <table> so
#  you don't actually have to loop through
#  all tables, instead you can just operate
#  on the first one. If that is not the case,
#  you can use a loop the way you did
tables = doc.css('table.open')

# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text

# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text

# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text

Now that we know what is wrong, here is a possible solution:

html = <<EOT
    <table class="open">
        <tr>
            <th>Table name</th>
            <th>Column name 1</th>
            <th>Column name 2</th>
            <th>Column name 3</th>
            <th>Column name 4</th>
            <th>Column name 5</th>
        </tr>
        <tr>
            <th>Raw name 1</th>
            <td>1001</td>
            <td>1002</td>
            <td>1003</td>
            <td>1004</td>
            <td>1005</td>         
        </tr>
        <tr>
            <th>Raw name 2</th>
            <td>2001</td>
            <td>2002</td>
            <td>2003</td>
            <td>2004</td>
            <td>2005</td>         
        </tr>
        <tr>
            <th>Raw name 3</th>
            <td>3001</td>
            <td>3002</td>
            <td>3003</td>
            <td>3004</td>
            <td>3005</td>         
        </tr>
    </table>
EOT

doc = Nokogiri::HTML(html, nil, 'UTF-8')

# Fetches only the first <table>. If you have
#  more than one, you can loop the way you
#  originally did.
table = doc.css('table.open').first

# Fetches all rows (<tr>s)
rows = table.css('tr')

# The column names are the first row (shift returns
#  the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)

# On each of the remaining rows
text_all_rows = rows.map do |row|

  # We get the name (<th>)
  # On the first row this will be Raw name 1
  #  on the second - Raw name 2, etc.
  row_name = row.css('th').text

  # We get the text of each individual value (<td>)
  # On the first row this will be 1001, 1002, 1003...
  #  on the second - 2001, 2002, 2003... etc
  row_values = row.css('td').map(&:text)

  # We map the name, followed by all the values
  [row_name, *row_values]
end

p column_names  # => ["Table name", "Column name 1", "Column name 2",
                #     "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
                #     ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
                #     ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]

# If you want to combine them
text_all_rows.each do |row_as_text|
  p column_names.zip(row_as_text).to_h
end # =>
    # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
    # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
    # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

Nokogiri: parse, extract and return tr content in HTML table

This way:

data = Nokogiri::HTML(open(url))
rows = data.css("td[valign='top'] table tr") # All the <tr>this is a line</tr>
rows.each do |row|
  puts row.text # Will print all the 'this is a line'
end

How do I parse an HTML table with Nokogiri?

#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
  (The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
  detail = {}
  [
    [:title, 'td[3]/div[1]/a/text()'],
    [:name, 'td[3]/div[2]/span/a/text()'],
    [:date, 'td[4]/text()'],
    [:time, 'td[4]/span/text()'],
    [:number, 'td[5]/a/text()'],
    [:views, 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end
pp details

# => [{:time=>"23:35",
# =>   :title=>"Vb4 Gold Released",
# =>   :number=>"24",
# =>   :date=>"06 Jan 2010",
# =>   :views=>"1,320",
# =>   :name=>"Paul M"}]

Nokogiri parse HTML table in Ruby

The HTML is flawed; There are more than one element with the same id attribute: <tr onclick="colorize(this);" id="item_2" tcolor=""> <td class="firmname" id="item_2">.
The id of the table element in the HTML is "gvResult" while in the Ruby code you are asking Nokogiri to look for a table where "id=gvResultTable".
Nokogiri uses UTF-8 encoding to store strings internally, so you shouldn't have any problem with Russian characters.

Provided the HTML can be fixed, this works fine:

HTML:

<table id="gvResult">
  <tbody>
    <tr id="item_1">
        <td class="firmname">Example1</td>
        <td class="price">42.00</td>
    </tr>
    <tr id="item_2">
       <td class="firmname">Example2</td>
       <td class="price">24.00</td>
    </tr>
  </tbody>
</table>

Ruby:

require 'rubygems'
require 'nokogiri'
require 'pp'

html = open('http://www.example.com/page')

doc = Nokogiri::HTML(html)
doc.encoding = 'utf-8'

rows = doc.search('//tr[starts-with(@id, "item_")]')
  @details = rows.collect do |row|
      detail = {}
      [
        [:firmname, 'td[1]/text()'],
        [:price, 'td[2]/text()'],
      ].each do |name, xpath|
        detail[name] = row.at_xpath(xpath).to_s.strip
      end
      detail
  end
pp @details

I presumed that you want to get the data from all the tr elements with an id like "item_\d+" hence I used doc.search('//tr[starts-with(@id, "item_")]'). Change it to suit your needs.

Ruby - nokogiri - parse only specific html table

Have you tried the xpath of //table[2]/tr/td to get the second table. If you can change the source of the HTML the best solution would be to provide id attributes for your tables.

Parse a HTML table using Ruby, Nokogiri omitting the column headers

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')

# OR

rows = doc.xpath("//table/tbody/tr")
header = rows.shift

After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:

<tr><td>ding</td>
       <td>dong</td>
       <td>ling</td>
    </tr>

To get the inner text, removing all the html tags, run puts rows.text

ding
       dong
       ling

To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }

["ding", "dong", "ling"]

How do I parse a plain HTML table with Nokogiri?

As a quick and dirty first pass I'd do:

html = <<EOT
<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td>
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>
EOT

#    Today              Yesterday
#    Qnty Size   Length Length Size  Qnty
#    3    455    34     3454   5656  3
#    1    1300   3664   3545   1000  10
#    10   100000 3444   3411   36223 15

require 'nokogiri'

doc = Nokogiri::HTML(html)

Use CSS to find the start of the table, and define some places to hold the data we're capturing:

table = doc.at('div#__DailyStat__ table')

today_data     = []
yesterday_data = []

Loop over the rows in the table, rejecting the headers:

table.search('tr').each do |tr|

  next if (tr['class'] == 'blh')

Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:

  today_td_data     = [ 'Today'     ]
  yesterday_td_data = [ 'Yesterday' ]

  tr.search('td').each do |td|
    if (td['class'] == 'r')
      yesterday_td_data << td.text.to_i
    else
      today_td_data << td.text.to_i
    end
  end

  today_data     << today_td_data
  yesterday_data << yesterday_td_data

end

And output the data:

puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }

> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15

Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:

[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]

Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:

  tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }

  today_td_data     = [ 'Today',     *tr_data[0, 3] ]
  yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]

In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.

And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.

Parse an HTML table with Nokogiri in Ruby

Here is one approach I tried. But yes, you can take it further from here to meet the need you have :

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML.parse(File.read("#{__dir__}/out1.html"))

data = doc.css('.TTdata, .TTdata_lgrey').map do |tr|
  %i(position year name).zip(tr.css("td:nth-child(-n+3)").map(&:text)).to_h
end

pp data

output

[{:position=>"1.", :year=>"2015", :name=>"Yasmani Grandal"},
 {:position=>"3.", :year=>"2015", :name=>"Francisco Cervelli"},
 {:position=>"5.", :year=>"2015", :name=>"Caleb Joseph"},
 {:position=>"7.", :year=>"2015", :name=>"Jason Castro"},
 {:position=>"9.", :year=>"2015", :name=>"Martin Maldonado"},
 {:position=>"11.", :year=>"2015", :name=>"Rene Rivera"},
 {:position=>"13.", :year=>"2015", :name=>"Kevin Plawecki"},
 {:position=>"15.", :year=>"2015", :name=>"Roberto Perez"},
 {:position=>"17.", :year=>"2015", :name=>"Hank Conger"},
 {:position=>"19.", :year=>"2015", :name=>"Tucker Barnhart"}]

How to parse TABLE text with Nokogiri?

If you want all text in the cell, hyperlinked or not:

doc.xpath('//td[1]').each do |cell|
   puts cell.text.strip
end

Note: in a valid HTML document, a td will always be within a table and a tr. If you don't have any other selector requirements, you can simplify as above.

How to Parse a HTML Table with Nokogiri