How can I speed up my Ruby/Rake task, which counts occurrences of dates among 300K date strings?
Yes, you don't need to parse the dates at all if they are formatted the same. Knowing your data is one of the most powerful tools you can have.
If the datetime strings are all in the same format (yyyy-mm-dd HH:MM:SS) then you could do something like
data_array.group_by{|datetime| datetime[0..9]}
This will give you a hash like with the date strings as the keys and the array of dates as values
{
"2007-05-06" => [...],
"2007-05-07" => [...],
...
}
So you'd have to get the length of each array
data_array.group_by{|datetime| datatime[0..9]}.each do |date_string, date_array|
puts "#{date_string} occurred #{date_array.length} times."
end
Of course that method is wasting memory by arrays of dates when you don't need them.
so how about
A more memory-efficient method
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
You'll end up with a hash with the date strings as the keys and the counts as values
{
"2007-05-06" => 123,
"2007-05-07" => 456,
...
}
Putting everything together
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{date_counts[date_to_count.to_s].to_i} times."
end
How to diagnose slow rails / rake / rspec tasks
Thanks to @MaxWilliams for the link to this post How do I debug a slow rails app boot time?
I started using Mark Ellul's Bumbler - http://github.com/mark-ellul/Bumbler
It gave me exactly what I wanted - an insight into what's going in the background and which gems are taking the time. Of course I still need to speed up the slow ones (fog and authlogic seem to be two of the main culprits). But that's at different question.
How do I output performance times for rake tasks
There is a simple benchmarking library in Ruby's Stdlib:
require 'benchmark'
puts Benchmark.measure { "a"*1_000_000 }
You could drop that in your rake tasks, as for an automatic "benchmark all rake task executions", that would take a little digging into the innards of rake.
More info at: http://ruby-doc.org/stdlib/libdoc/benchmark/rdoc/index.html
Ruby rake tasks thread optimization
When number of such files is low, you do not care for order of execution and can afford some extra memory - simpliest solution is just to run them in different processes by cron (for example - gem 'whenever'
).
If there're more - use some http gems for parallel downloading - typhoeus
, curb
, em-http-request
etc
Related Topics
Actionmailer Smtp "Certificate Verify Failed"
Can You Have Multiple Versions of a Gem in a Gemfile
When Is the Do Keyword Required in Ruby
Error: Error Installing JSON: Error: Failed to Build Gem Native Extension
Ruby#Index Method VS Binary Search
In Ruby How to Create a Local Variable Explicitly
Convert String to Datetime Ruby on Rails
Ruby CSV.Each in While Loop Not Executing Second Time Through
How to Interpolate a Variable in a Ruby Regex
Change Emacs Ruby-Mode Indent to 4 Spaces
Where Does Rails Store Data Created by Saving Activerecord Objects During Tests
Error When Starting Sinatra: "Tried to Create Proc Object Without a Block"
Obfuscating Rails App on Mri Ruby/Jruby for a Enterprise Customer