Ruby CSV.Each in While Loop Not Executing Second Time Through

Ruby CSV.each in while loop not executing second time through

According to the docs, the CSV object behaves basically like a regular IO object. They keep track of their current position in the file which is advanced by reading through it, generally line by line. So on your first attendees.each, you read through the entire file. Subsequent calls to .each will try to read the next line, but there is not any since we are already at the end of the file hence your loop does not execute anymore.

You can fix this by rewinding the underlying IO instance to the beginning of the file, using #rewind. In your specific case, put it after iterating through the attendees.

attendees.each do |attendee|
  # ...
end
attendees.rewind

Ruby: reading two files

You are executing your Ruby program 400 times, once for each host. Instead, try making the program more flexible so you can just execute it once. That way, it only needs to parse that 9000-line CSV file once. You could read myhosts.txt with Ruby instead of with a Bash script.

Another problem is that you are iterating through the 9000-line CSV file to locate rows using Array#find. This will take O(N) time, which might be slow in this case. Instead, you should use an index so you can efficiently look up rows in O(log(N)) time. A simple Ruby hash is a fine type of index to use.

Here is a script I came up with and tested:

#!/usr/bin/ruby
require 'csv'

raise if ARGV.size != 2
hosts_fname, csv_fname = ARGV

row_by_src = {}
row_by_dst = {}
CSV.foreach(csv_fname, headers: true) do |row|
  row_by_src[row['src-hostclass']] = row
  row_by_dst[row['dst-hostclass']] = row
end

File.foreach(hosts_fname) do |host|
  host = host.chomp
  s = row_by_src[host] and puts s
  d = row_by_dst[host] and puts d
end

Choose starting row for CSV.foreach or similar method? Don't want to load file into memory

I think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:

begin
  file = File.open(FILENAME)

  # Get the headers from the first line
  headers = CSV.parse_line(file.gets)

  # Seek in the file until we find a matching line
  match = "2,"
  while line = file.gets
    break if line.start_with?(match)
  end

  # Rewind the cursor to the beginning of the line
  file.seek(-line.size, IO::SEEK_CUR)

  csv = CSV.new(file, headers: headers)

  # ...do whatever you want...
ensure
  # Don't forget the close the file
  file.close
end

The result of the above is that csv will be a CSV object whose first row is the row that starts with 2,.

I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7

Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.

Execute loop n times with a sleep delay

You can keep track of the iterations count and then sleep if it's a multiple of 150:

conversations.each.with_index do |conversation, index|
    customer = conversation.customer
    threaded_conversation = helpscout.conversation(conversation.id)
    if index % 150 == 0
      sleep 90
    end
end

RUBY CSV Calculate Return

Your code lead into some new requirements such as finding nth business day, but they are not clearly defined in the question, maybe more proper way is to open another question about "quickest way of finding nth business day in ruby" .

So let's only foucus on the requiment you commented in the result sample.

Main points of the requirement:

read a large csv file, with date formatted string in it
for each day find price in n days after (n=2) in one group
for each day's record append a ratio calculated by two days price, if there is no price after n days, leave it blank

Basic Benchmark:

With the sample data repeated by 45,000 times, I got a 10MB csv file of 360,000 records in it.

My first thought is to generate a Buffer class to buffer the records that haven't met the next n days record yet. When push a new record to the buffer, the buffer will shift out all the records that is n days before the new record.

But I need to know some basic operations' processing time maybe used in this implementation, then I can figure out the lower limit of total processing time by choosing more efficient operations:

convert date formatted string to date at least 360,000 times
compare two days for 360,000 times
get the date that is n days after another date for 360,000 times
calculate the days between two dates for 360,000 times
compare two dates stored in an array of arrays for 360,000 times
push a row into buffer and shift out for 360,000 times
append a ratio or empty string to every record for 360,000 times

And I was heard that CSV is a very unefficient way , so I will compare two file parsing proccesing time too:

using CSV.foreach to read csv file row by row, and parse them into an array
using IO.read to read csv file into a string at once, and split the string into an array

Basic benchmark scripts:

require 'csv'
require 'benchmark'

Benchmark.bm{|x|
  epoch=Date.new(1970,1,1)
  date1=Date.today
  date2=Date.today.next
  i1=1
  i2=200000
  date_str='2014-6-1'
  a = [[1,2,4,date2],2,[1,2,4,date1]]
  #  1. convert date formatted string to date at least 360,000 times
  x.report("1.string2date"){
    360000.times{Date.strptime(date_str,"%Y-%m-%d")}
  }
  #  2. compare two days for 360,000 times
  x.report("2.DateCompare"){
    360000.times{date2>=date1}
  }
  #  3. get the date that is n days after another date for 360,000 times
  x.report("3.DateAdd2   "){
    360000.times{date1 + 2}
  }
  #  4. calculate the days between two dates for 360,000 times
  x.report("4.Date Differ"){
    360000.times{date2-date1}
  }
  #  5. compare two dates stored in an array of arrays for 360,000 times 
  x.report("5.ArrDateComp"){
    360000.times{ a.last[3] > a.first[3]}
  }
  #  6. push a row into buffer and shift out for 360,000 times 
  x.report("6.array shift"){
    360000.times{ a<<[1,2,3]; a.shift}
  }
  #  7. append a ratio or empty string to every record for 360,000 times 
  x.report("7.Add Ratio  "){
    360000.times{ res << (['1','2014-6-1',"3"]<< (2==2 ? (3.to_f/2.to_f).round(2) : "" ))}
  }

  x.report('CSVparse    '){
    CSV.foreach("data.csv"){|row| 
    }
  }
  x.report('IOread      '){
  data = IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each{|x| }
  }
}

The result:

                   user     system      total        real
1.string2date  0.827000   0.000000   0.827000 (  0.820001)
2.DateCompare  0.078000   0.000000   0.078000 (  0.070000)
3.DateAdd2     0.109000   0.000000   0.109000 (  0.110000)
4.Date Differ  0.359000   0.000000   0.359000 (  0.360000)
5.ArrDateComp  0.109000   0.000000   0.109000 (  0.110001)
6.array shift  0.094000   0.000000   0.094000 (  0.090000)
7.Add Ratio    0.530000   0.000000   0.530000 (  0.530000)
 CSVparse      2.902000   0.016000   2.918000 (  2.910005)
 IOread        0.515000   0.015000   0.530000 (  0.540000)

Analyze the result

Transfer date formatted string to date is the slowest operation of all those operations, so it should be used at file parsing procedure, to ensure that the operation of transfering string to date will be executed only once for every record.
Compare two dates is about 7 times faster than calculate the days between two dates, so I'll store the date after n days instead of store an integer of the days since epoch date in the buffer.
The total proccessing time at least includes 1,2,3,5,6,7 those parts. So the lower limit of the estimate proccessing time should be 1.75 seconds. There are some overheads not included.
With CSV parsing the lower limit would be 4.24 seconds.
With IO#read and split the lower limit would be 2.262 seconds.

The implementation of Buffer Class and push method

class Buff
  def initialize
    @buff=[]
    @epoch = Date.new(1970,1,1)
    @n=2
  end
  def push_date( row ) 
    # store buff with two date value appended, ["a", "2014-6-1", "1",  #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
    # the last element of date is n days after the record's date 
    res = []
    @buff << (row << (row[3] + @n) )
    while (@buff.last[3] >= @buff.first[4] || row[0] != @buff.first[0])
      v = (@buff.last[3] == @buff.first[4] && row[0] == @buff.first[0] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
      res <<(@buff.shift[0..2]<< v)
    end
    return res
  end

  def tails
    @buff.inject([]) {|res,x|  res << (x[0..2]<< "")}
  end
  def clear
    @buff=[]
  end
end

Benchmark

buff=Buff.new
res=[]
Benchmark.bm{|x|
  buff.clear
  res = []
  x.report("CSVdate"){
    CSV.foreach("data.csv"){|row| 
      buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
    }
    buff.tails.each{|x| res << x}
  }

  buff.clear
  res = []
  x.report("IOdate"){
    IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each {|row| 
      buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
    }
    buff.tails.each{|x| res << x}
  }

}
puts "output result count:#{res.size}"
puts "Here is the fist 12 sample outputs:"
res[0..11].each{|x| puts x.to_s}

Result

             user     system      total        real
CSVdate  6.411000   0.047000   6.458000 (  6.500009)
IOdate   3.557000   0.109000   3.666000 (  3.710005)

output result count:360000
Here is the fist 12 sample outputs:
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]
["a", "2014-6-5", "4", ""]
["b", "2014-6-1", "1", 3.0]
["b", "2014-6-2", "2", 2.0]
["b", "2014-6-3", "3", 1.67]
["b", "2014-6-4", "4", ""]
["b", "2014-6-5", "5", ""]
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]

Conclusion

The real proccesing time is 3.557 seconds, it's about 57% slower than the estimated lower limit , but there are still some overheads not considerd.
The CSV version is 2 times slower than the IO#read version.
We should read the input file block by block with IO#read to prevent insufficient memory error.
It must have some space to tuning.

UPDATE1:

Tuning

More faster push by changeing the order of group compare and date compare:

class Buff
  def push_fast( row ) 
    # store buff with two date value appended, ["a", "2014-6-1", "1",  #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
    # the last element of date is n days after the record's date 
    res = []
    row << (row[3] + @n) 
    # change the order of the two compares, can reduce the counts of date compares
    while @buff.first && (row[0] != @buff.first[0] || row[3] >= @buff.first[4] )
      v = (row[0] == @buff.first[0] && row[3] == @buff.first[4] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
      res <<(@buff.shift[0..2]<< v)
    end
    @buff << row
    return res
  end
end

Benchmark result

            user     system      total        real
IOdate  3.806000   0.031000   3.837000 (  3.830005)
IOfast  3.323000   0.062000   3.385000 (  3.390005)

Can get 0.480 seconds promotion. Saving many date compare time by compare group first, if group changes, shift out all buffer records out without date compare.

Ruby to Search and Combine CSV files when dealing with large files

As far as I was able to determine, OCLC IDs are alphanumeric. This means we want to use a Hash to store these IDs. A Hash has a general lookup complexity of O(1), while your unsorted Array has a lookup complexity of O(n).

If you use an Array, you worst case lookup is 18 million comparisons (to find a single element, Ruby has to go through all 18 million IDs), while with a Hash it will be one comparison. To put it simply: using a Hash will be millions of times faster than your current implementation.

The pseudocode below will give you an idea how to proceed. We will use a Set, which is like a Hash, but handy when all you need to do is check for inclusion:

oclc_ids = Set.new

CSV.foreach(...) {
  oclc_ids.add(row[:oclc])  # Add ID to Set
  ...
}

# No need to call unique on a Set. 
# The elements in a Set are always unique.

processed_keys = Set.new

CSV.foreach(...) {
   next unless oclc_ids.include?(row[:oclc_num])   # Extremely fast lookup
   next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
   ...
   processed_keys.add(row[:oclc_num])
}

Write to a file in Ruby with every iteration of a loop

Let's use IO::write to create two input files.

FNameIn1 = 'in1'
File.write(FNameIn1, "cow\npig\ngoat\nhen\n")
  #=> 17

We can use IO::read to confirm what was written.

puts File.read(FNameIn1)
cow
pig
goat
hen

FNameIn2 = 'in2'
File.write(FNameIn2, "12\n34\n56\n78\n")
  #=> 12 
puts File.read(FNameIn2)
12
34
56
78

Next, use File::open to open the two input files for reading, obtaining a file handle for each.

f1 = File.open(FNameIn1)
  #=> #<File:in1> 
f2 = File.open(FNameIn2)
  #=> #<File:in2>

Now open a file for writing.

FNameOut = 'out'
f = File.open(FNameOut, "w")
  #=> #<File:out>

Assuming the two input files have the same number of lines, in a while loop read the next line from each, combine the two lines in some ways and the write the resulting line to the output file.

until f1.eof
  line11 = f1.gets.chomp
  line12 = f1.gets.chomp
  line21 = f2.gets.chomp
  line22 = f2.gets.chomp
  f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end

See IO#eof, IO#gets and IO#puts.

Lastly, use IO#close to close the files.

f1.close
f2.close
f.close

Let's see that FileOut looks like.

puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78

We can have Ruby close the files by using a block for each File::open:

File.open(FNameIn1) do |f1|
  File.open(FNameIn2) do |f2|
    File.open(FNameOut, "w") do |f|
      until f1.eof
        line11 = f1.gets.chomp
        line12 = f1.gets.chomp
        line21 = f2.gets.chomp
        line22 = f2.gets.chomp
        f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
      end
    end
  end 
end

puts File.read FNameOut
cow 12, pig 34
goat 56, hen 78

This is in fact how it's normally done in Ruby, in part to avoid the possibility of forgetting to close files.

Here's another way, using IO::foreach, which, without a block, returns an enumerator, allowing the use of Enumerable#each_slice, as referenced in the question.

e1 = File.foreach(FNameIn1).each_slice(2)
  #=> #<Enumerator: #<Enumerator: File:foreach("in1")>:each_slice(2)>
e2 = File.foreach(FNameIn2).each_slice(2)
  #=> #<Enumerator: #<Enumerator: File:foreach("in2")>:each_slice(2)> 

File.open(FNameOut, "w") do |f|
  loop do
    line11, line12 = e1.next.map(&:chomp)
    line21, line22 = e2.next.map(&:chomp)
    f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
  end
end

puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78

We may observe the values generated by the enumerator

e1 = File.foreach(FNameIn1).each_slice(2)

by repeatedly executing Enumerator#next:

e1.next
  #=> ["cow\n", "pig\n"] 
e1.next
  #=> ["goat\n", "hen\n"] 
e1.next
  #=> StopIteration (iteration reached an end)

The StopIteration exception, when raised, is handled by Kernel#loop by breaking out of the loop (which is one reason why loop is so useful).

Ruby CSV.Each in While Loop Not Executing Second Time Through