Stream and unzip large csv file with ruby
It's been a while since I posted this question and in case anyone else comes across it I thought it might be worth sharing what I found.
- For the number of rows I was dealing with Ruby's standard library
CSV
was too slow. My csv file was simple enough that I didn't need all that stuff to deal with quoted strings or type coercion anyway. It was much easier just useIO#gets
and then split the line on commas. - I was unable to stream the entire thing from http to a
Zip::Inputstream
to someIO
containing the csv data. This is because the zip file structure has the End of Central Directory (EOCD) at the end of the file. That is needed in order to extract the file so streaming it from http doesn't seem like it would work.
The solution I ended up going with was to download the file to disk and then use Ruby's open3 library and the Linux unzip
package to stream the uncompressed csv file from the zip.
require 'open3'
IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
line = io.gets
# do stuff to process the CSV line
end
The -p
switch on unzip sends the extracted file to stdout. IO.popen
then use pipes to make that an IO
object in ruby. Works pretty nice. You could use it with the CSV
too if you wanted that extra processing, it was just too slow for me.
require 'open3'
require 'csv'
IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
CSV.foreach(io) do |row|
# process the row
end
end
Ruby to Search and Combine CSV files when dealing with large files
As far as I was able to determine, OCLC IDs are alphanumeric. This means we want to use a Hash to store these IDs. A Hash has a general lookup complexity of O(1), while your unsorted Array has a lookup complexity of O(n).
If you use an Array, you worst case lookup is 18 million comparisons (to find a single element, Ruby has to go through all 18 million IDs), while with a Hash it will be one comparison. To put it simply: using a Hash will be millions of times faster than your current implementation.
The pseudocode below will give you an idea how to proceed. We will use a Set, which is like a Hash, but handy when all you need to do is check for inclusion:
oclc_ids = Set.new
CSV.foreach(...) {
oclc_ids.add(row[:oclc]) # Add ID to Set
...
}
# No need to call unique on a Set.
# The elements in a Set are always unique.
processed_keys = Set.new
CSV.foreach(...) {
next unless oclc_ids.include?(row[:oclc_num]) # Extremely fast lookup
next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
...
processed_keys.add(row[:oclc_num])
}
ruby fastercsv, pre- and append identical text to every cell in file
Do you necessarily have to use fastercsv
? If your input is really as simple as you show, the following should suffice:
pre_text = '%text'
post_text = '%'
File.open('outfile.csv', 'w') {|of|
File.readlines('input_file.csv').each {|line|
of.puts line.strip.split(',').map{|x| pre_text + x + post_text}.join(',')
}
}
Related Topics
What Are the Limitations of Opal
How to Push Keys and Values into an Empty Hash W/ Ruby
Example of a Prepared Insert Statement Using Ruby Pg Gem
Multiple Level Nested Layout in Rails 3
How Does Ruby Handle Bytes/Binary
Rails - Paperclip Validating Attachment Size When It Shouldn't Be
Setting Up Env, Osx Rbenv and Bundle Battle
Ruby Time Object Converted from Float Doesn't Equal to Orignial Time Object
Gem Install Mongrel Fails with Ruby 1.9.1
Ruby - Array.Join Versus String Concatenation (Efficiency)
How to Get List of All Countries and Cities in Rails
Ruby Gem Listed, But Won't Load (Gem in User Dir, Not Ruby Dir)
Rake Db:Migration Not Working on Travis-Ci Build
Get Jekyll Configuration Inside Plugin
Ruby Net::Smtp - Send Email with Bcc: Recipients
Paperclip: Upload from Url with Extension
Unpacking/Freezing Gems into a Non-Rails Ruby App
How to Globally Ignore Invalid Byte Sequences in Utf-8 Strings