Uploading Large File to S3 with Ruby Fails with Out of Memory Error, How to Read and Upload in Chunks

Uploading Large File to S3 with Ruby Fails with Out of Memory Error, How to Read and Upload in Chunks?

The v2 AWS SDK for Ruby, aws-sdk gem, supports streaming objects directly over over the network without loading them into memory. Your example requires only a small correction to do this:

File.open(filepath, 'rb') do |file|
resp = s3.put_object(
:bucket => bucket,
:key => s3key,
:body => file
)
end

This works because it allows the SDK to call #read on the file object passing in a small number of bytes each time. Calling #read on a Ruby IO object, such as a file, without a first argument will read the entire object into memory, returning it as a string. This is what has caused your out-of-memory errors.

That said, the aws-sdk gem provides another, more useful interface for uploading files to Amazon S3. This alternative interface automatically:

  • Uses multipart APIs for large objects
  • Can use multiple threads to upload parts in parallel, improving upload speed
  • Computes MD5s of data client-side to for service-side data integrity checks.

A simple example:

# notice this uses Resource, not Client
s3 = Aws::S3::Resource.new(
:access_key_id => accesskeyid,
:secret_access_key => accesskey,
:region => region
)

s3.bucket(bucket).object(s3key).upload_file(filepath)

This is part of the aws-sdk resource interfaces. There are quite a few helpful utilities in here. The Client class only provides basic API functionality.

How can I pipe a rack file upload directly to S3?

You're exactly right that you need to turn the request body into a readable stream. Specifically, S3 expects a Ruby IO class (in that it wants a #read method and a #eof? method). Rack request bodies don't have #eof? defined, however, so you have to make a little wrapper class:

class RackS3Wrapper
def initialize(body)
@body = body
@eof = false
end

def read(*args)
ret = @body.read(*args)

if ret == nil or ret == ""
@eof = true
end

ret
end

def eof?
@eof
end
end

Then you can use this wrapper to stream the request to S3 directly:

s3.buckets['com.mydomain.mybucket'].objects['filename'].write(
:data => RackS3Wrapper.new(request.body),
:content_length => request.env['CONTENT_LENGTH'].to_i)

This hasn't been tested in production or anything, but it should work fine.

Read a file in chunks in Ruby

Adapted from the Ruby Cookbook page 204:

FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024

class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end

open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end

Disclaimer: I'm a ruby newbie and haven't tested this.

reading large csv files in a rails app takes up a lot of memory - Strategy to reduce memory consumption?

You can make use of CSV.foreach to read just chunks of your CSV file:

 path = Rails.root.join('data/uploads/.../upload.csv') # or, whatever
CSV.foreach(path) do |row|
# process row[i] here
end

If it's run in a background job, you could additionally call GC.start every n rows.


How it works

CSV.foreach operates on an IO stream, as you can see here:

def IO.foreach(path, options = Hash.new, &block)
# ...
open(path, options) do |csv|
csv.each(&block)
end
end

The csv.each part is a call to IO#each, which reads the file line by line (rb_io_getline_1 invokation) and leaves the line read to be garbage collected:

static VALUE
rb_io_each_line(int argc, VALUE *argv, VALUE io)
{
// ...
while (!NIL_P(str = rb_io_getline_1(rs, limit, io))) {
rb_yield(str);
}
// ...
}


Related Topics



Leave a reply



Submit