Read a File in Chunks in Ruby

Read a file in chunks in Ruby

Adapted from the Ruby Cookbook page 204:

FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024

class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end

open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end

Disclaimer: I'm a ruby newbie and haven't tested this.

Read binary file in chunks of different size with ruby

Just surround:

puts "############### Chunk No. 1 ######################"

chunkheader = z.read(16)
chunksize = z.read(4).unpack('H*')[0].hex
data = z.read(chunksize).unpack('H*')

puts chunkheader.unpack('H*')
puts chunksize
puts data

with loop:

while chunkheader = z.read(16) do
puts "############### Chunk ######################"
chunksize = z.read(4).unpack('H*')[0].hex
data = z.read(chunksize).unpack('H*')

puts chunkheader.unpack('H*')
puts chunksize
puts data
end

the loop above will be terminated as there is no more data in the file remained. Please note, that the snipped above is in general error-prone, since it expects the file to be not corrupted and will fail if last chunk header reports erroneous amount of bytes.

But in your case it seems to be ok.

Ruby: how to read an mp4 file into chunks

I suppose I would do:

chnk_size=4*1024*1024
f=File.open(fn, 'rb')
until f.eof?
chnk=f.read(chnk_size)
# process the chnk
end

Ruby - Read file in batches

there's no universal way.

1) you can read file by chunks:

File.open('filename','r') do |f|
chunk = f.read(2048)
...
end

disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk

2) you can read file line-by-line

File.open('filename','r') do |f|
line = f.gets
...
end

disadvantage: this way it'd be 2x..5x slower than first method

How to read whole file in Ruby?


IO.read("filename")

or

File.read("filename")

How to efficiently split file into arbitrary byteranges in ruby

The easiest way to generate an enumerator for each "byterange" would be to let the part function handle the opening of the file:

def part(filepath, from, to = nil, chunk_size = 4096, &block)
return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
size = File.size(filepath)
to = size-1 unless to and to >= from and to < size
io = File.open(filepath, "rb")
io.seek(from, IO::SEEK_SET)
while (io.pos <= to)
size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
chunk = io.read(size)
yield chunk
end
ensure
io.close if io
end

Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)

Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from is not a multiple of the physical HDD block.

The part function now returns an Enumerator when called without a block:

part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)

And of course you can call it directly with a block:

part("bigFile", 0, 1300, 512) do |chunk|
puts "#{chunk.inspect}"
end

Ruby: Split and Read part of a file Depending on Thread Count

The file is read into an array, which is split in chunks using Enumerable#each_slice.

require 'benchmark'
require 'csv'

@file_name = 'xxx.txt'
file = File.open(@file_name, 'w')
1000.times do | i |
file.puts "#{i.to_s}"
end
file.close

@lines = []
CSV.foreach(@file_name) { | line | @lines << line }
FILE_RECORD_COUNT = @lines.size
puts FILE_RECORD_COUNT

def setup(thread_count)
puts "----- thread_count=#{thread_count}"
threads = []
fetches_per_thread = FILE_RECORD_COUNT / thread_count
puts "----- fetches_per_thread=#{fetches_per_thread}"
raise 'invalid slice size' if fetches_per_thread < 1

@lines.each_slice(fetches_per_thread) do | slice |
threads << Thread.new do
puts "===== slice from #{slice.first} to #{slice.last}"
slice.each do | id |
# puts id
# response = RestClient.get("https://api.examplerest/names/{#id}",{accept: :json})
# do some quick validation...
end # slice.each
end # Thread.new
end # @lines.each_slice

threads.each(&:join)
end # def setup

def run_benchmark
Benchmark.bm(20) do |bm|
[1, 2, 3, 5, 6, 10, 15, 30, 100].each do |thread_count|
bm.report("with #{thread_count} threads") do
setup(thread_count)
end
end
end
end

run_benchmark

Execution :

$ --------------------------------
-bash: --------------------------------: command not found
$ ruby -w t.rb
1000
user system total real
with 1 threads ----- thread_count=1
----- fetches_per_thread=1000
===== slice from ["0"] to ["999"]
0.000000 0.000000 0.000000 ( 0.000288)
with 2 threads ----- thread_count=2
----- fetches_per_thread=500
===== slice from ["0"] to ["499"]
===== slice from ["500"] to ["999"]
0.000000 0.000000 0.000000 ( 0.000318)
with 3 threads ----- thread_count=3
----- fetches_per_thread=333
===== slice from ["0"] to ["332"]
===== slice from ["666"] to ["998"]
===== slice from ["999"] to ["999"]
===== slice from ["333"] to ["665"]
0.000000 0.000000 0.000000 ( 0.000549)
with 5 threads ----- thread_count=5
----- fetches_per_thread=200
===== slice from ["0"] to ["199"]
===== slice from ["200"] to ["399"]
===== slice from ["400"] to ["599"]
===== slice from ["600"] to ["799"]
===== slice from ["800"] to ["999"]
0.000000 0.000000 0.000000 ( 0.000536)
with 6 threads ----- thread_count=6
----- fetches_per_thread=166
===== slice from ["166"] to ["331"]
===== slice from ["664"] to ["829"]
===== slice from ["830"] to ["995"]
===== slice from ["996"] to ["999"]
===== slice from ["0"] to ["165"]
===== slice from ["332"] to ["497"]
===== slice from ["498"] to ["663"]
0.000000 0.000000 0.000000 ( 0.000735)
with 10 threads ----- thread_count=10
----- fetches_per_thread=100
===== slice from ["900"] to ["999"]
...
===== slice from ["190"] to ["199"]
===== slice from ["200"] to ["209"]
===== slice from ["210"] to ["219"]
===== slice from ["220"] to ["229"]
===== slice from ["230"] to ["239"]
===== slice from ["240"] to ["249"]
...
===== slice from ["970"] to ["979"]
===== slice from ["980"] to ["989"]
===== slice from ["990"] to ["999"]
===== slice from ["20"] to ["29"]
===== slice from ["30"] to ["39"]
0.000000 0.000000 0.000000 ( 0.011656)

Then I use the find command in the terminal to find -------------------------------------- and jump to the beginning of the execution.



Related Topics



Leave a reply



Submit