Read a File in Chunks in Ruby

Read a file in chunks in Ruby

Adapted from the Ruby Cookbook page 204:

FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024

class File
  def each_chunk(chunk_size = MEGABYTE)
    yield read(chunk_size) until eof?
  end
end

open(FILENAME, "rb") do |f|
  f.each_chunk { |chunk| puts chunk }
end

Disclaimer: I'm a ruby newbie and haven't tested this.

Read binary file in chunks of different size with ruby

Just surround:

puts "############### Chunk No. 1 ######################"

chunkheader = z.read(16)                                 
chunksize = z.read(4).unpack('H*')[0].hex
data = z.read(chunksize).unpack('H*')

puts chunkheader.unpack('H*')
puts chunksize
puts data

with loop:

while chunkheader = z.read(16) do
  puts "############### Chunk ######################"
  chunksize = z.read(4).unpack('H*')[0].hex
  data = z.read(chunksize).unpack('H*')

  puts chunkheader.unpack('H*')
  puts chunksize
  puts data 
end

the loop above will be terminated as there is no more data in the file remained. Please note, that the snipped above is in general error-prone, since it expects the file to be not corrupted and will fail if last chunk header reports erroneous amount of bytes.

But in your case it seems to be ok.

Ruby: how to read an mp4 file into chunks

I suppose I would do:

chnk_size=4*1024*1024
f=File.open(fn, 'rb')
until f.eof?    
    chnk=f.read(chnk_size)
    # process the chnk
end

Ruby - Read file in batches

there's no universal way.

1) you can read file by chunks:

File.open('filename','r') do |f|
  chunk = f.read(2048)
  ...
end

disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk

2) you can read file line-by-line

File.open('filename','r') do |f|
  line = f.gets
  ...
end

disadvantage: this way it'd be 2x..5x slower than first method

How to read whole file in Ruby?

IO.read("filename")

File.read("filename")

How to efficiently split file into arbitrary byteranges in ruby

The easiest way to generate an enumerator for each "byterange" would be to let the part function handle the opening of the file:

def part(filepath, from, to = nil, chunk_size = 4096, &block)
  return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
  size = File.size(filepath)
  to = size-1 unless to and to >= from and to < size
  io = File.open(filepath, "rb")
  io.seek(from, IO::SEEK_SET)
  while (io.pos <= to)
    size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
    chunk = io.read(size)
    yield chunk
  end
ensure
  io.close if io
end

Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)

Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from is not a multiple of the physical HDD block.

The part function now returns an Enumerator when called without a block:

part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)

And of course you can call it directly with a block:

part("bigFile", 0, 1300, 512) do |chunk|
  puts "#{chunk.inspect}"
end

Ruby: Split and Read part of a file Depending on Thread Count

The file is read into an array, which is split in chunks using Enumerable#each_slice.

require 'benchmark'
require 'csv'

@file_name = 'xxx.txt'
file = File.open(@file_name, 'w')
1000.times do | i |
    file.puts "#{i.to_s}"
end
file.close

@lines = []
CSV.foreach(@file_name) { | line | @lines << line }
FILE_RECORD_COUNT = @lines.size
puts FILE_RECORD_COUNT

def setup(thread_count)
  puts "----- thread_count=#{thread_count}"
  threads = []
  fetches_per_thread = FILE_RECORD_COUNT / thread_count
  puts "----- fetches_per_thread=#{fetches_per_thread}"
  raise 'invalid slice size' if fetches_per_thread < 1

  @lines.each_slice(fetches_per_thread) do | slice |
    threads << Thread.new do
      puts "===== slice from #{slice.first} to #{slice.last}"
      slice.each do | id |
#        puts id
#          response = RestClient.get("https://api.examplerest/names/{#id}",{accept: :json})
          # do some quick validation...
      end # slice.each
    end # Thread.new
  end # @lines.each_slice

  threads.each(&:join)
end # def setup

def run_benchmark
  Benchmark.bm(20) do |bm|
    [1, 2, 3, 5, 6, 10, 15, 30, 100].each do |thread_count|
      bm.report("with #{thread_count} threads") do
        setup(thread_count)
      end
    end
  end
end

run_benchmark

Execution :

$ --------------------------------
-bash: --------------------------------: command not found
$ ruby -w t.rb 
1000
                           user     system      total        real
with 1 threads       ----- thread_count=1
----- fetches_per_thread=1000
===== slice from ["0"] to ["999"]
  0.000000   0.000000   0.000000 (  0.000288)
with 2 threads       ----- thread_count=2
----- fetches_per_thread=500
===== slice from ["0"] to ["499"]
===== slice from ["500"] to ["999"]
  0.000000   0.000000   0.000000 (  0.000318)
with 3 threads       ----- thread_count=3
----- fetches_per_thread=333
===== slice from ["0"] to ["332"]
===== slice from ["666"] to ["998"]
===== slice from ["999"] to ["999"]
===== slice from ["333"] to ["665"]
  0.000000   0.000000   0.000000 (  0.000549)
with 5 threads       ----- thread_count=5
----- fetches_per_thread=200
===== slice from ["0"] to ["199"]
===== slice from ["200"] to ["399"]
===== slice from ["400"] to ["599"]
===== slice from ["600"] to ["799"]
===== slice from ["800"] to ["999"]
  0.000000   0.000000   0.000000 (  0.000536)
with 6 threads       ----- thread_count=6
----- fetches_per_thread=166
===== slice from ["166"] to ["331"]
===== slice from ["664"] to ["829"]
===== slice from ["830"] to ["995"]
===== slice from ["996"] to ["999"]
===== slice from ["0"] to ["165"]
===== slice from ["332"] to ["497"]
===== slice from ["498"] to ["663"]
  0.000000   0.000000   0.000000 (  0.000735)
with 10 threads      ----- thread_count=10
----- fetches_per_thread=100
===== slice from ["900"] to ["999"]
...
===== slice from ["190"] to ["199"]
===== slice from ["200"] to ["209"]
===== slice from ["210"] to ["219"]
===== slice from ["220"] to ["229"]
===== slice from ["230"] to ["239"]
===== slice from ["240"] to ["249"]
...
===== slice from ["970"] to ["979"]
===== slice from ["980"] to ["989"]
===== slice from ["990"] to ["999"]
===== slice from ["20"] to ["29"]
===== slice from ["30"] to ["39"]
  0.000000   0.000000   0.000000 (  0.011656)

Then I use the find command in the terminal to find -------------------------------------- and jump to the beginning of the execution.

Read a File in Chunks in Ruby