Import from CSV into Ruby Array, with 1St Field as Hash Key, Then Lookup a Field's Value Given Header Row

import from CSV into Ruby array, with 1st field as hash key, then lookup a field's value given header row

To get the best of both worlds (very fast reading from a huge file AND the benefits of a native Ruby CSV object) my code had since evolved into this method:

$stock="XTEX"
csv_data = CSV.parse IO.read(%`|sed -n "1p; /^#{$stock},/p" stocks.csv`), {:headers => true, :return_headers => false, :header_converters => :symbol, :converters => :all}

# Now the 1-row CSV object is ready for use, eg:
$company = csv_data[:company][0]
$volatility_month = csv_data[:volatility_month][0].to_f
$sector = csv_data[:sector][0]
$industry = csv_data[:industry][0]
$rsi14d = csv_data[:relative_strength_index_14][0].to_f

which is closer to my original method, but only reads in one record plus line 1 of the input csv file containing the headers. The inline sed instructions take care of that--and the whole thing is noticably instant. This this is better than last because now I can access all the fields from Ruby, and associatively, not caring about column numbers anymore as was the case with awk.

Parse tab delimited CSV file to array of hashes in Ruby 2.0

It seems that the options you pass to parse are listed in ::new

>> CSV.parse("qwe\tq\twe", col_sep: "\t"){|a| p a}
["qwe", "q", "we"]

Selecting a single value field from CSV file

Yes, hashes are the way to go:

require 'csv'

data = 'Name,Times arrived,Total $ spent,Food feedback
Dan,34,2548,Lovin it!
Maria,55,5054,"Good, delicious food"
Carlos,22,4352,"I am ""pleased"", but could be better"
Stephany,34,6542,I want bigger steaks!!!!!
'

CSV.parse(data, headers: :first_row).map{ |row| row["Total $ spent"] }
# => ["2548", "5054", "4352", "6542"]

Pretend that

CSV.parse(data, headers: :first_row)

is really

CSV.foreach('some/file.csv', headers: :first_row)

and the data is really in a file.

The reason you want to use headers: :first_row is that tells CSV to gobble up the first line. Then it'll return a hash for each record, using the associated header field for the keys, making it easier to retrieve specific fields by name.

From the documentation:

:headers

If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers.

Alternate ways of doing this are:

spent = CSV.parse(data).map{ |row| row[2] }
spent.shift

spent
# => ["2548", "5054", "4352", "6542"]

spent.shift drops the first element from the array, which was the header field for that column, leaving the array containing only values.

Or:

spent = []
skip_headers = true
CSV.parse(data).each do |row|

if skip_headers
skip_headers = false
next
end

spent << row[2]
end

spent
# => ["2548", "5054", "4352", "6542"]

Similar to the shift statement above, the next is telling Ruby to skip to the next iteration of the loop and not process the rest of the instructions in the block, which results in the header record being skipped in the final output.

Once you have the values from the fields you want you can selectively extract specific ones. If you want the values "2548" and "4352", you have to have a way of determining which rows those are in. Using arrays (the non-header method) makes it more awkward to do, so I'd do it using hashes again:

spent = CSV.parse(data, headers: :first_row).each_with_object([]) do |row, ary| 
case row['Name']
when 'Dan', 'Carlos'
ary << row['Total $ spent']
end
end

spent
# => ["2548", "4352"]

Notice that it's very clear what is going on which is important in code. Using the case and when allow me to easily add additional names to include. That acts like a chained "or" conditional test on an if statement, but without the additional noise.

each_with_object is similar to inject, except it is cleaner when we need to aggregate values into an Array, Hash or some object.

Summing the array is easy and there are many different ways to get there, but I'd use:

spent.map(&:to_i).inject(:+) # => 6900

Basically that converts the individual elements to integers and adds them together. (There's more to it but that's not important until farther up your learning curve.)


I am just wondering if it is possible to replace the contents of the 'when' condition with an array of strings to iterate over rather than hard coded strings?

Here's a solution using an Array:

NAMES = %w[Dan Carlos]

spent = CSV.parse(data, headers: :first_row).each_with_object([]) do |row, ary|
case row['Name']
when *NAMES
ary << row['Total $ spent']
end
end

spent
# => ["2548", "4352"]

If the list of names is large I think this solution will run slower than necessary. Arrays are great for storing data you're going to get to, as a queue, or for remembering their order like a stack, but they're bad when you have to walk it just to find something. Even a sorted Array and using a binary search is likely to be slower than using a Hash because of the extra steps involved in using them. Here's an alternate way of doing this, but using a Hash:

NAMES = %w[Dan Carlos].map{ |n| [n, true] }.to_h

spent = CSV.parse(data, headers: :first_row).each_with_object([]) do |row, ary|
case
when NAMES[row['Name']]
ary << row['Total $ spent']
end
end

spent
# => ["2548", "4352"]

But that can be refactored to be more readable:

NAMES = %w[Dan Carlos].each_with_object({}) { |a, h| h[a] = true }
# => {"Dan"=>true, "Carlos"=>true}

spent = CSV.parse(data, headers: :first_row).each_with_object([]) do |row, ary|
ary << row['Total $ spent'] if NAMES[row['Name']]
end

spent
# => ["2548", "4352"]

How to read CSV data into a hash

Let's first create the CSV file.

str =<<~_
date,name,st,code,num
2020-03-25,AB,53,2585,130
2020-03-26,AB,53,3208,151
2020-03-26,BA,35,136,1
2020-03-27,BA,35,191,1
_

FName = 't'
File.write(FName, str)
#=> 120

Now we can simply read the file line-by-line, using CSV::foreach, which, without a block, returns an enumerator, and build the hash as we go along.

require 'csv'

CSV.foreach(FName, headers: true).
with_object(Hash.new { |h,k| h[k] = [] }) do |row,h|
h[row['name'].to_sym] << [row['date'], row['code']]
end
#=> {:AB=>[["2020-03-25", "2585"], ["2020-03-26", "3208"]],
# :BA=>[["2020-03-26", "136"], ["2020-03-27", "191"]]}

I've used the method Hash::new with a block to create a hash h such that if h does not have a key k, h[k] causes h[k] #=> []. That way, h[k] << 123, when h has no key k results in h[k] #=> [123].

Alternatively, one could write:

CSV.foreach(FName, headers: true).with_object({}) do |row,h|
(h[row['name'].to_sym] ||= []) << [row['date'], row['code']]
end

One could also use a converter to convert the values of name to symbols, but some might see that as over-kill here:

CSV.foreach(FName, headers: true,
converters: [->(v) { v.match?(/\p{Alpha}+/) ? v.to_sym : v }] ).
with_object(Hash.new { |h,k| h[k] = [] }) do |row,h|
h[row['name']] << [row['date'], row['code']]
end

How to parse a Hash of Hashes from a CSV file

I would only store rows in the data hash that are within the range. IMO that performs betters, because it needs less memory than reading all data into data and remove the unwanted entries in a second step.

DATE_RANGE = (1403321503..1406082945)

CSV.foreach("sample_data.csv",
:headers => true,
:header_converters => :symbol,
:converters => :all) do |row|
attrs = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
data[row.fields[0]] = attrs if DATE_RANGE.cover?(attrs[:created_at])
end

It might make sense to check the condition before actually creating the hash by checking DATE_RANGE.cover? against the column number (is created_at in row.fields[1]?).

Parse CSV into multiple lines where each value is printed after its header

require 'csv'
lineN = 0

CSV.read( filename ).each do |arr|
if lineN == 0
headers = arr
else
puts "line #{lineN}"

headers.zip(arr).each do |a|
puts "#{a.first} : #{a.last}"
end
end
lineN += 1
end

creates:

line 1
key1 : a
key2 : b
key3 : c

line 2
key1 : d
key2 :
key3 : f

Parse CSV file with header fields as attributes for each row

Using Ruby 1.9 and above, you can get a an indexable object:

CSV.foreach('my_file.csv', :headers => true) do |row|
puts row['foo'] # prints 1 the 1st time, "blah" 2nd time, etc
puts row['bar'] # prints 2 the first time, 7 the 2nd time, etc
end

It's not dot syntax but it is much nicer to work with than numeric indexes.

As an aside, for Ruby 1.8.x FasterCSV is what you need to use the above syntax.



Related Topics



Leave a reply



Submit