Dealing with Large CSV Files (20G) in Ruby

How to persist large amounts of data by reading from a CSV file

You can try to first split the file into several smaller files, then you will be able to process several files in parallel.

Probably for splinting the file it will be faster to user a tool like split

split -l 1000000 ./test.txt ./out-files-

Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:

INSERT INTO some_table 
VALUES
(1,'data1'),
(2, 'data2')

For better performance you'll need to build the SQL statement yourself and execute it:

ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')

In Ruby, what is the best way to convert alphanumeric entries to integers for a column of a CSV containing a huge number of rows?

I think this should work:

udids = {}
unique_count = 1

output_csv = CSV.open("Processed.csv", "w")

CSV.foreach("Original.csv").with_index do |row, i|
output_csv << row and next if i == 0 # skip first row (header info)
val = row[9]
if udids[val.to_sym]
row[9] = udids[val.to_sym]
else
udids[val.to_sym] = unique_count
row[9] = unique_count
unique_count += 1
end
output_csv << row
end

output_csv.close

The performance depends heavily on how many duplicates there are (the more the better), but basically it keeps track of each value as a key in a hash, and checks to see if it has encountered that value yet or not. If so, it uses the corresponding value, and if not it increments a counter, stores that count as the new value for that key and continues.

I was able to process a 10 million line test CSV file in about 3 minutes.

Insert white space between double quotation marks using gsub in Ruby

You need to match locations that are at the start/end of string or double quotation marks.

You can use

replace_empty_string = read_file.gsub(/(?<![^,])""(?![^,])/, '" "')

See the regex demo.

Details

  • (?<![^,]) - (same as (?<=\A|,)) - a location at the start of string or immediately preceded with a comma
  • "" - two " chars
  • (?![^,]) - (same as (?=\z|,)) - a location at the end of string or immediately followed with a comma.

Why does it output one document, when it used to output multiple?

When you read CSV file, you read it line by line moving internal pointer. Once you reached the end of file, this pointer stays there so every time you try to fetch new row you'll get nil unless you rewind the file. So, your code started iteration on this line:

contents.each do |row|

This fetched the first row and moved the cursor to the next line. However inside the loop you did contents.map {...} which read the whole csv file and left the curses at the end of the file.
So to fix it you need to move the statistic bits outside the loop (before or after) and rewind the file (reset the cursor) before second iteration:

contents.each do |row|
id = row[0]
name = row[:first_name]
zipcode = clean_zipcode(row[:zipcode])

phone = clean_phonenumber(row[:homephone])

legislators = legislators_by_zipcode(zipcode)

form_letter = erb_template.result(binding)

save_thank_you_letters(id,form_letter)

end

contents.rewind
times = contents.map { |row| row[:regdate] }
target_times = Hash[times.group_by do |t|
DateTime.strptime(t, '%m/%d/%y %H:%M').hour
end.map do |k,v|
[k, v.count]
end.sort_by do |k,v|
v
end.reverse]

target_days = Hash[times.group_by do |t|
DateTime.strptime(t, '%m/%d/%y %H:%M').wday
end.map do |k,v|
[Date::ABBR_DAYNAMES[k], v.count]
end.sort_by do |k,v|
v
end.reverse]
puts target_times
puts target_days

count specific lines in specific files in a folder

Untested, since I have no input files, but likely working:

# `Dir[]` expects it’s own format
# ⇓ will inject results into hash
Dir['/Volumes/.../*.bowtie.txt'].inject({}) do |memo, file|
memo[file] = File.readlines(file).select do |line|
line =~ /^[0-9]+\s*(\+|\-)/ # only those, matching
end.count
memo
end

Additional references: IO#readlines, Enumerable#select, Enumerable#inject.

Native Makefile alternative for windows

VisualStudio comes with nmake which would not require any 3rd party tools.

When should space be encoded to plus (+) or %20?

+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:

http://www.example.com/path/foo+bar/path?query+name=query+value

In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.

%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).

See Also

HTML 4.01 Specification application/x-www-form-urlencoded



Related Topics



Leave a reply



Submit