Ruby CSV.Each in While Loop Not Executing Second Time Through

Ruby CSV.each in while loop not executing second time through

According to the docs, the CSV object behaves basically like a regular IO object. They keep track of their current position in the file which is advanced by reading through it, generally line by line. So on your first attendees.each, you read through the entire file. Subsequent calls to .each will try to read the next line, but there is not any since we are already at the end of the file hence your loop does not execute anymore.

You can fix this by rewinding the underlying IO instance to the beginning of the file, using #rewind. In your specific case, put it after iterating through the attendees.

attendees.each do |attendee|
# ...
end
attendees.rewind

Ruby: reading two files

You are executing your Ruby program 400 times, once for each host. Instead, try making the program more flexible so you can just execute it once. That way, it only needs to parse that 9000-line CSV file once. You could read myhosts.txt with Ruby instead of with a Bash script.

Another problem is that you are iterating through the 9000-line CSV file to locate rows using Array#find. This will take O(N) time, which might be slow in this case. Instead, you should use an index so you can efficiently look up rows in O(log(N)) time. A simple Ruby hash is a fine type of index to use.

Here is a script I came up with and tested:

#!/usr/bin/ruby
require 'csv'

raise if ARGV.size != 2
hosts_fname, csv_fname = ARGV

row_by_src = {}
row_by_dst = {}
CSV.foreach(csv_fname, headers: true) do |row|
row_by_src[row['src-hostclass']] = row
row_by_dst[row['dst-hostclass']] = row
end

File.foreach(hosts_fname) do |host|
host = host.chomp
s = row_by_src[host] and puts s
d = row_by_dst[host] and puts d
end

Choose starting row for CSV.foreach or similar method? Don't want to load file into memory

I think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:

begin
file = File.open(FILENAME)

# Get the headers from the first line
headers = CSV.parse_line(file.gets)

# Seek in the file until we find a matching line
match = "2,"
while line = file.gets
break if line.start_with?(match)
end

# Rewind the cursor to the beginning of the line
file.seek(-line.size, IO::SEEK_CUR)

csv = CSV.new(file, headers: headers)

# ...do whatever you want...
ensure
# Don't forget the close the file
file.close
end

The result of the above is that csv will be a CSV object whose first row is the row that starts with 2,.

I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7

Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.

Execute loop n times with a sleep delay

You can keep track of the iterations count and then sleep if it's a multiple of 150:

conversations.each.with_index do |conversation, index|
customer = conversation.customer
threaded_conversation = helpscout.conversation(conversation.id)
if index % 150 == 0
sleep 90
end
end

RUBY CSV Calculate Return

Your code lead into some new requirements such as finding nth business day, but they are not clearly defined in the question, maybe more proper way is to open another question about "quickest way of finding nth business day in ruby" .

So let's only foucus on the requiment you commented in the result sample.

Main points of the requirement:

  • read a large csv file, with date formatted string in it
  • for each day find price in n days after (n=2) in one group
  • for each day's record append a ratio calculated by two days price, if there is no price after n days, leave it blank

Basic Benchmark:

With the sample data repeated by 45,000 times, I got a 10MB csv file of 360,000 records in it.

My first thought is to generate a Buffer class to buffer the records that haven't met the next n days record yet. When push a new record to the buffer, the buffer will shift out all the records that is n days before the new record.

But I need to know some basic operations' processing time maybe used in this implementation, then I can figure out the lower limit of total processing time by choosing more efficient operations:

  1. convert date formatted string to date at least 360,000 times
  2. compare two days for 360,000 times
  3. get the date that is n days after another date for 360,000 times
  4. calculate the days between two dates for 360,000 times
  5. compare two dates stored in an array of arrays for 360,000 times
  6. push a row into buffer and shift out for 360,000 times
  7. append a ratio or empty string to every record for 360,000 times

And I was heard that CSV is a very unefficient way , so I will compare two file parsing proccesing time too:

  1. using CSV.foreach to read csv file row by row, and parse them into an array
  2. using IO.read to read csv file into a string at once, and split the string into an array

Basic benchmark scripts:

require 'csv'
require 'benchmark'

Benchmark.bm{|x|
epoch=Date.new(1970,1,1)
date1=Date.today
date2=Date.today.next
i1=1
i2=200000
date_str='2014-6-1'
a = [[1,2,4,date2],2,[1,2,4,date1]]
# 1. convert date formatted string to date at least 360,000 times
x.report("1.string2date"){
360000.times{Date.strptime(date_str,"%Y-%m-%d")}
}
# 2. compare two days for 360,000 times
x.report("2.DateCompare"){
360000.times{date2>=date1}
}
# 3. get the date that is n days after another date for 360,000 times
x.report("3.DateAdd2 "){
360000.times{date1 + 2}
}
# 4. calculate the days between two dates for 360,000 times
x.report("4.Date Differ"){
360000.times{date2-date1}
}
# 5. compare two dates stored in an array of arrays for 360,000 times
x.report("5.ArrDateComp"){
360000.times{ a.last[3] > a.first[3]}
}
# 6. push a row into buffer and shift out for 360,000 times
x.report("6.array shift"){
360000.times{ a<<[1,2,3]; a.shift}
}
# 7. append a ratio or empty string to every record for 360,000 times
x.report("7.Add Ratio "){
360000.times{ res << (['1','2014-6-1',"3"]<< (2==2 ? (3.to_f/2.to_f).round(2) : "" ))}
}

x.report('CSVparse '){
CSV.foreach("data.csv"){|row|
}
}
x.report('IOread '){
data = IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each{|x| }
}
}

The result:

                   user     system      total        real
1.string2date 0.827000 0.000000 0.827000 ( 0.820001)
2.DateCompare 0.078000 0.000000 0.078000 ( 0.070000)
3.DateAdd2 0.109000 0.000000 0.109000 ( 0.110000)
4.Date Differ 0.359000 0.000000 0.359000 ( 0.360000)
5.ArrDateComp 0.109000 0.000000 0.109000 ( 0.110001)
6.array shift 0.094000 0.000000 0.094000 ( 0.090000)
7.Add Ratio 0.530000 0.000000 0.530000 ( 0.530000)
CSVparse 2.902000 0.016000 2.918000 ( 2.910005)
IOread 0.515000 0.015000 0.530000 ( 0.540000)

Analyze the result

  • Transfer date formatted string to date is the slowest operation of all those operations, so it should be used at file parsing procedure, to ensure that the operation of transfering string to date will be executed only once for every record.
  • Compare two dates is about 7 times faster than calculate the days between two dates, so I'll store the date after n days instead of store an integer of the days since epoch date in the buffer.
  • The total proccessing time at least includes 1,2,3,5,6,7 those parts. So the lower limit of the estimate proccessing time should be 1.75 seconds. There are some overheads not included.
  • With CSV parsing the lower limit would be 4.24 seconds.
  • With IO#read and split the lower limit would be 2.262 seconds.

The implementation of Buffer Class and push method

class Buff
def initialize
@buff=[]
@epoch = Date.new(1970,1,1)
@n=2
end
def push_date( row )
# store buff with two date value appended, ["a", "2014-6-1", "1", #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
# the last element of date is n days after the record's date
res = []
@buff << (row << (row[3] + @n) )
while (@buff.last[3] >= @buff.first[4] || row[0] != @buff.first[0])
v = (@buff.last[3] == @buff.first[4] && row[0] == @buff.first[0] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
res <<(@buff.shift[0..2]<< v)
end
return res
end

def tails
@buff.inject([]) {|res,x| res << (x[0..2]<< "")}
end
def clear
@buff=[]
end
end

Benchmark

buff=Buff.new
res=[]
Benchmark.bm{|x|
buff.clear
res = []
x.report("CSVdate"){
CSV.foreach("data.csv"){|row|
buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
}
buff.tails.each{|x| res << x}
}

buff.clear
res = []
x.report("IOdate"){
IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each {|row|
buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
}
buff.tails.each{|x| res << x}
}

}
puts "output result count:#{res.size}"
puts "Here is the fist 12 sample outputs:"
res[0..11].each{|x| puts x.to_s}

Result

             user     system      total        real
CSVdate 6.411000 0.047000 6.458000 ( 6.500009)
IOdate 3.557000 0.109000 3.666000 ( 3.710005)

output result count:360000
Here is the fist 12 sample outputs:
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]
["a", "2014-6-5", "4", ""]
["b", "2014-6-1", "1", 3.0]
["b", "2014-6-2", "2", 2.0]
["b", "2014-6-3", "3", 1.67]
["b", "2014-6-4", "4", ""]
["b", "2014-6-5", "5", ""]
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]

Conclusion

  • The real proccesing time is 3.557 seconds, it's about 57% slower than the estimated lower limit , but there are still some overheads not considerd.
  • The CSV version is 2 times slower than the IO#read version.
  • We should read the input file block by block with IO#read to prevent insufficient memory error.
  • It must have some space to tuning.

UPDATE1:

Tuning

More faster push by changeing the order of group compare and date compare:

class Buff
def push_fast( row )
# store buff with two date value appended, ["a", "2014-6-1", "1", #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
# the last element of date is n days after the record's date
res = []
row << (row[3] + @n)
# change the order of the two compares, can reduce the counts of date compares
while @buff.first && (row[0] != @buff.first[0] || row[3] >= @buff.first[4] )
v = (row[0] == @buff.first[0] && row[3] == @buff.first[4] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
res <<(@buff.shift[0..2]<< v)
end
@buff << row
return res
end
end

Benchmark result

            user     system      total        real
IOdate 3.806000 0.031000 3.837000 ( 3.830005)
IOfast 3.323000 0.062000 3.385000 ( 3.390005)

Can get 0.480 seconds promotion. Saving many date compare time by compare group first, if group changes, shift out all buffer records out without date compare.

Ruby to Search and Combine CSV files when dealing with large files

As far as I was able to determine, OCLC IDs are alphanumeric. This means we want to use a Hash to store these IDs. A Hash has a general lookup complexity of O(1), while your unsorted Array has a lookup complexity of O(n).

If you use an Array, you worst case lookup is 18 million comparisons (to find a single element, Ruby has to go through all 18 million IDs), while with a Hash it will be one comparison. To put it simply: using a Hash will be millions of times faster than your current implementation.

The pseudocode below will give you an idea how to proceed. We will use a Set, which is like a Hash, but handy when all you need to do is check for inclusion:

oclc_ids = Set.new

CSV.foreach(...) {
oclc_ids.add(row[:oclc]) # Add ID to Set
...
}

# No need to call unique on a Set.
# The elements in a Set are always unique.

processed_keys = Set.new

CSV.foreach(...) {
next unless oclc_ids.include?(row[:oclc_num]) # Extremely fast lookup
next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
...
processed_keys.add(row[:oclc_num])
}

Write to a file in Ruby with every iteration of a loop

Let's use IO::write to create two input files.

FNameIn1 = 'in1'
File.write(FNameIn1, "cow\npig\ngoat\nhen\n")
#=> 17

We can use IO::read to confirm what was written.

puts File.read(FNameIn1)
cow
pig
goat
hen

FNameIn2 = 'in2'
File.write(FNameIn2, "12\n34\n56\n78\n")
#=> 12
puts File.read(FNameIn2)
12
34
56
78

Next, use File::open to open the two input files for reading, obtaining a file handle for each.

f1 = File.open(FNameIn1)
#=> #<File:in1>
f2 = File.open(FNameIn2)
#=> #<File:in2>

Now open a file for writing.

FNameOut = 'out'
f = File.open(FNameOut, "w")
#=> #<File:out>

Assuming the two input files have the same number of lines, in a while loop read the next line from each, combine the two lines in some ways and the write the resulting line to the output file.

until f1.eof
line11 = f1.gets.chomp
line12 = f1.gets.chomp
line21 = f2.gets.chomp
line22 = f2.gets.chomp
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end

See IO#eof, IO#gets and IO#puts.

Lastly, use IO#close to close the files.

f1.close
f2.close
f.close

Let's see that FileOut looks like.

puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78

We can have Ruby close the files by using a block for each File::open:

File.open(FNameIn1) do |f1|
File.open(FNameIn2) do |f2|
File.open(FNameOut, "w") do |f|
until f1.eof
line11 = f1.gets.chomp
line12 = f1.gets.chomp
line21 = f2.gets.chomp
line22 = f2.gets.chomp
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
end
end
end
puts File.read FNameOut
cow 12, pig 34
goat 56, hen 78

This is in fact how it's normally done in Ruby, in part to avoid the possibility of forgetting to close files.

Here's another way, using IO::foreach, which, without a block, returns an enumerator, allowing the use of Enumerable#each_slice, as referenced in the question.

e1 = File.foreach(FNameIn1).each_slice(2)
#=> #<Enumerator: #<Enumerator: File:foreach("in1")>:each_slice(2)>
e2 = File.foreach(FNameIn2).each_slice(2)
#=> #<Enumerator: #<Enumerator: File:foreach("in2")>:each_slice(2)>

File.open(FNameOut, "w") do |f|
loop do
line11, line12 = e1.next.map(&:chomp)
line21, line22 = e2.next.map(&:chomp)
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
end
puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78

We may observe the values generated by the enumerator

e1 = File.foreach(FNameIn1).each_slice(2)

by repeatedly executing Enumerator#next:

e1.next
#=> ["cow\n", "pig\n"]
e1.next
#=> ["goat\n", "hen\n"]
e1.next
#=> StopIteration (iteration reached an end)

The StopIteration exception, when raised, is handled by Kernel#loop by breaking out of the loop (which is one reason why loop is so useful).



Related Topics



Leave a reply



Submit