How to persist large amounts of data by reading from a CSV file
You can try to first split the file into several smaller files, then you will be able to process several files in parallel.
Probably for splinting the file it will be faster to user a tool like split
split -l 1000000 ./test.txt ./out-files-
Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:
INSERT INTO some_table
VALUES
(1,'data1'),
(2, 'data2')
For better performance you'll need to build the SQL statement yourself and execute it:
ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
In Ruby, what is the best way to convert alphanumeric entries to integers for a column of a CSV containing a huge number of rows?
I think this should work:
udids = {}
unique_count = 1
output_csv = CSV.open("Processed.csv", "w")
CSV.foreach("Original.csv").with_index do |row, i|
output_csv << row and next if i == 0 # skip first row (header info)
val = row[9]
if udids[val.to_sym]
row[9] = udids[val.to_sym]
else
udids[val.to_sym] = unique_count
row[9] = unique_count
unique_count += 1
end
output_csv << row
end
output_csv.close
The performance depends heavily on how many duplicates there are (the more the better), but basically it keeps track of each value as a key in a hash, and checks to see if it has encountered that value yet or not. If so, it uses the corresponding value, and if not it increments a counter, stores that count as the new value for that key and continues.
I was able to process a 10 million line test CSV file in about 3 minutes.
Insert white space between double quotation marks using gsub in Ruby
You need to match locations that are at the start/end of string or double quotation marks.
You can use
replace_empty_string = read_file.gsub(/(?<![^,])""(?![^,])/, '" "')
See the regex demo.
Details
(?<![^,])
- (same as(?<=\A|,)
) - a location at the start of string or immediately preceded with a comma""
- two"
chars(?![^,])
- (same as(?=\z|,)
) - a location at the end of string or immediately followed with a comma.
Why does it output one document, when it used to output multiple?
When you read CSV file, you read it line by line moving internal pointer. Once you reached the end of file, this pointer stays there so every time you try to fetch new row you'll get nil unless you rewind the file. So, your code started iteration on this line:
contents.each do |row|
This fetched the first row and moved the cursor to the next line. However inside the loop you did contents.map {...}
which read the whole csv file and left the curses at the end of the file.
So to fix it you need to move the statistic bits outside the loop (before or after) and rewind the file (reset the cursor) before second iteration:
contents.each do |row|
id = row[0]
name = row[:first_name]
zipcode = clean_zipcode(row[:zipcode])
phone = clean_phonenumber(row[:homephone])
legislators = legislators_by_zipcode(zipcode)
form_letter = erb_template.result(binding)
save_thank_you_letters(id,form_letter)
end
contents.rewind
times = contents.map { |row| row[:regdate] }
target_times = Hash[times.group_by do |t|
DateTime.strptime(t, '%m/%d/%y %H:%M').hour
end.map do |k,v|
[k, v.count]
end.sort_by do |k,v|
v
end.reverse]
target_days = Hash[times.group_by do |t|
DateTime.strptime(t, '%m/%d/%y %H:%M').wday
end.map do |k,v|
[Date::ABBR_DAYNAMES[k], v.count]
end.sort_by do |k,v|
v
end.reverse]
puts target_times
puts target_days
count specific lines in specific files in a folder
Untested, since I have no input files, but likely working:
# `Dir[]` expects it’s own format
# ⇓ will inject results into hash
Dir['/Volumes/.../*.bowtie.txt'].inject({}) do |memo, file|
memo[file] = File.readlines(file).select do |line|
line =~ /^[0-9]+\s*(\+|\-)/ # only those, matching
end.count
memo
end
Additional references: IO#readlines
, Enumerable#select
, Enumerable#inject
.
Native Makefile alternative for windows
VisualStudio comes with nmake which would not require any 3rd party tools.
When should space be encoded to plus (+) or %20?
+
means a space only in application/x-www-form-urlencoded
content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name
with a space and the value is query value
with a space, but the folder name in the path is literally foo+bar
, not foo bar
.
%20
is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20
and pluses with %2B
. This is what, e.g., encodeURIComponent()
does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded
Related Topics
Is Assignment in a Conditional Clause Good Ruby Style
How to Get a Substring from Position N to the Last Char in Ruby
Does Begin . . . End While Denote a 'Block'
Machinist VS Factorygirl - Pros and Cons
How to Select the Longest String from a Ruby Array
Controlling Tor Client with Ruby
Rails 3: User Created Custom Forms
Refactoring Activerecord Models with a Base Class Versus a Base Module
Use Separate Authentication Model with Devise on Rails
Ruby, Tor and Net::Http::Proxy
Ruby on Rails Model Inside Namespace Can't Be Found in Controller
Why Doesn't Module.Method_Defined(:Method) Work Correctly
Differences Between Literals and Constructors? ([] VS Array.New and {} VS Hash.New)
Safe Navigation Equivalent to Rails Try for Hashes