Ruby Create Tar Ball in Chunks to Avoid Out of Memory Error

Ruby create tar ball in chunks to avoid out of memory error

It looks like the problem could be in this line:

File.open(file, "rb") { |f| tf.write f.read }

You are "slurping" your input file by doing f.read. slurping means the entire file is being read into memory, which isn't scalable at all, and is the result of using read without a length.

Instead, I'd do something to read and write the file in blocks so you have a consistent memory usage. This reads in 1MB blocks. You can adjust that for your own needs:

BLOCKSIZE_TO_READ = 1024 * 1000

File.open(file, "rb") do |fi|
  while buffer = fi.read(BLOCKSIZE_TO_READ)
    tf.write buffer
  end
end

Here's what the documentation says about read:

If length is a positive integer, it try to read length bytes without any conversion (binary mode). It returns nil or a string whose length is 1 to length bytes. nil means it met EOF at beginning. The 1 to length-1 bytes string means it met EOF after reading the result. The length bytes string means it doesn’t meet EOF. The resulted string is always ASCII-8BIT encoding.

An additional problem is it looks like you're not opening the output file correctly:

tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","w")

You're writing it in "text" mode because of "w". Instead, you need to write in binary mode, "wb", because tarballs contain binary (compressed) data:

tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","wb")

Rewriting the original code to be more like I'd want to see it, results in:

BLOCKSIZE_TO_READ = 1024 * 1000

def create_tarball(path)

  tar_filename = Pathname.new(path).realpath.to_path + '.tar'

  File.open(tar_filename, 'wb') do |tarfile|

    Gem::Package::TarWriter.new(tarfile) do |tar|

      Dir[File.join(path, '**/*')].each do |file|

        mode = File.stat(file).mode
        relative_file = file.sub(/^#{ Regexp.escape(path) }\/?/, '')

        if File.directory?(file)
          tar.mkdir(relative_file, mode)
        else

          tar.add_file(relative_file, mode) do |tf|
            File.open(file, 'rb') do |f|
              while buffer = f.read(BLOCKSIZE_TO_READ)
                tf.write buffer
              end
            end
          end

        end
      end
    end
  end

  tar_filename

end

BLOCKSIZE_TO_READ should be at the top of your file since it's a constant and is a "tweakable" - something more likely to be changed than the body of the code.

The method returns the path to the tarball, not an IO handle like the original code. Using the block form of IO.open automatically closes the output, which would cause any subsequent open to automatically rewind. I much prefer passing around path strings than IO handles for files.

I also wrapped some of the method parameters in enclosing parenthesis. While parenthesis aren't required around method parameters in Ruby, and some people eschew them, I think they make the code more maintainable by delimiting where the parameters start and end. They also avoid confusing Ruby when you're passing parameters and a block to a method -- a well-known cause for bugs.

Create a tar.gz with contens of a specific path (without chdir) with Ruby

As described in the OP, the solution is to use add_file_simple method instead of add_file, this also requires that you obtain the file size using File.stat method.

Here's a working method:

  # similar as 'tar -C cdpath -zcf targzfile srcs', the difference is 'srcs' is related
  # to the current working directory, instead of 'cdpath'
  def self.cdtargz(cdpath, targzfile, *src)
    path = Pathname.new(cdpath)
    raise "path #{cdpath} should be an absolute path" unless path.absolute?
    raise "path #{cdpath} should be a directory" unless File.directory? cdpath
    raise "Destination tar.gz file #{targzfile} already exists" if File.exist? targzfile
    raise "no file or directory to tar" if !src || src.length == 0

    src.each { |p| p.sub! /^/, "#{cdpath}/" }
    File.open targzfile, 'wb' do |otargzfile|
      Zlib::GzipWriter.wrap otargzfile do |gz|
        Gem::Package::TarWriter.new gz do |tar|
          Find.find *src do |f|
            relative_path = f.sub "#{cdpath}/", ""
            mode = File.stat(f).mode
            size = File.stat(f).size
            if File.directory? f
              tar.mkdir relative_path, mode
            else
              tar.add_file_simple relative_path, mode, size do |tio|
                File.open f, 'r' do |rio|
                  tio.write rio.read
                end
              end
            end
          end
        end
      end
    end
  end

EDIT: After reviewing the answer in this question, I revised the above slightly to avoid "slurping" the files, in my case 95% of the files are quite small, but few very BIG ones, so this makes a lot of sense. Here's the updated version:

  BLOCKSIZE_TO_READ = 1024 * 1000

  def self.cdtargz(cdpath, targzfile, *src)
    path = Pathname.new(cdpath)
    raise "path #{cdpath} should be an absolute path" unless path.absolute?
    raise "path #{cdpath} should be a directory" unless File.directory? cdpath
    raise "Destination tar.gz file #{targzfile} already exists" if File.exist? targzfile
    raise "no file or directory to tar" if !src || src.length == 0

    src.each { |p| p.sub! /^/, "#{cdpath}/" }
    File.open targzfile, 'wb' do |otargzfile|
      Zlib::GzipWriter.wrap otargzfile do |gz|
        Gem::Package::TarWriter.new gz do |tar|
          Find.find *src do |f|
            relative_path = f.sub "#{cdpath}/", ""
            mode = File.stat(f).mode
            size = File.stat(f).size
            if File.directory? f
              tar.mkdir relative_path, mode
            else
              tar.add_file_simple relative_path, mode, size do |tio|
                File.open f, 'rb' do |rio|
                  while buffer = rio.read(BLOCKSIZE_TO_READ)
                    tio.write buffer
                  end
                end
              end
            end
          end
        end
      end
    end
  end

TarWriter help adding multiple directories and files

You can add multiple paths to Dir object

Dir[File.join(path1, '**/*'), File.join(path2, '**/*')]

After which the code would be something like this:

BLOCKSIZE_TO_READ = 1024 * 1000

def create_tarball(path)

  tar_filename = Pathname.new(path).realpath.to_path + '.tar'

  File.open(tar_filename, 'wb') do |tarfile|

    Gem::Package::TarWriter.new(tarfile) do |tar|

      Dir[File.join(path1, '**/*'), File.join(path2, '**/*')].each do |file|

        mode = File.stat(file).mode
        relative_file = file.sub(/^#{ Regexp.escape(path) }\/?/, '')

        if File.directory?(file)
          tar.mkdir(relative_file, mode)
        else

          tar.add_file(relative_file, mode) do |tf|
            File.open(file, 'rb') do |f|
              while buffer = f.read(BLOCKSIZE_TO_READ)
                tf.write buffer
              end
            end
          end

        end
      end
    end
  end

  tar_filename

end

How do I specify heap size configuration in a config file

This error and message are actually coming from jruby, not gem. Fortunately, jruby checks for default options in the JRUBY_OPTS environment variable.

So, try export JRUBY_OPTS=-J-Xmx1024m, and then, whenever you call gem install, jruby should automatically run with a 1024MB memory cap.

Read a file in chunks in Ruby

Adapted from the Ruby Cookbook page 204:

FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024

class File
  def each_chunk(chunk_size = MEGABYTE)
    yield read(chunk_size) until eof?
  end
end

open(FILENAME, "rb") do |f|
  f.each_chunk { |chunk| puts chunk }
end

Disclaimer: I'm a ruby newbie and haven't tested this.

Why are the top-most if conditions ignored?

There's no explicit return in selected_servers, so it's returning the value of the last expression it runs, which is usually a failing unless. A failing if/unless returns nil.

Zlib in Ruby to uncompress .gz

Zlib::GzipReader works like most IO-like classes do in Ruby. You have an open call, and when you pass a block to it, the block will receive the IO-like object. Think of it is convenient way of doing something with a file or resource for the duration of the block.

But that means that in your example gz is an IO-like object, and not actually the contents of the gzip file, as you expect. You still need to read from it to get to that. The simplest fix would then be:

g.write(gz.read)

Note that this will read the entire contents of the uncompressed gzip into memory.

If all you're really doing is copying from one file to another, you can use the more efficient IO.copy_stream method. Your example might then look like:

Zlib::GzipReader.open('PRIDE_Exp_Complete_Ac_1015.xml.gz') do | input_stream |
  File.open("PRIDE_Exp_Complete_Ac_1015.xml", "w") do |output_stream|
    IO.copy_stream(input_stream, output_stream)
  end
end

Behind the scenes, this will try to use the sendfile syscall available in some specific situations on Linux. Otherwise, it will do the copying in fast C code 16KB blocks at a time. This I learned from the Ruby 1.9.1 source code.

what ruby gem should I use to handle tar archive manipulation?

I ended up giving up with using a gem to manipulate the tar archives, and just doing it by shelling out to the commandline.

`cd #{container} && tar xvfz sdk.tar.gz`    
`cd #{container} && tar xvfz Wizard.tar.gz`

    #update the framework packaged with the wizard
    FileUtils.rm_rf(container + "/Wizard.app/Contents/Resources/SDK.bundle")
    FileUtils.rm_rf(container + "/Wizard.app/Contents/Resources/SDK.framework")
    FileUtils.mv(container +  "/resources/SDK.bundle", container + "/Wizard.app/Contents/Resources/")
    FileUtils.mv(container +  "/resources/SDK.framework", container + "/Wizard.app/Contents/Resources/")

    config_plist =  render_to_string({
             file: 'site/_wizard_config',
             layout: false,
             locals: { app_id: @version.app.id },
             formats: 'xml'
            })  

File.open(container + "/Wizard.app/Contents/Resources/Configuration.plist", 'w') { |file| file.write( config_plist ) } 

`cd #{container} && rm Wizard.tar.gz`    
`cd #{container} && tar -cvf Wizard.tar 'Wizard.app'`
`cd #{container} && gzip Wizard.tar`

All these backticks make me feel like I'm writing Perl again.

Ruby Create Tar Ball in Chunks to Avoid Out of Memory Error