Read a file in chunks in Ruby
Adapted from the Ruby Cookbook page 204:
FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end
Disclaimer: I'm a ruby newbie and haven't tested this.
Read binary file in chunks of different size with ruby
Just surround:
puts "############### Chunk No. 1 ######################"
chunkheader = z.read(16)
chunksize = z.read(4).unpack('H*')[0].hex
data = z.read(chunksize).unpack('H*')
puts chunkheader.unpack('H*')
puts chunksize
puts data
with loop:
while chunkheader = z.read(16) do
puts "############### Chunk ######################"
chunksize = z.read(4).unpack('H*')[0].hex
data = z.read(chunksize).unpack('H*')
puts chunkheader.unpack('H*')
puts chunksize
puts data
end
the loop above will be terminated as there is no more data in the file remained. Please note, that the snipped above is in general error-prone, since it expects the file to be not corrupted and will fail if last chunk header reports erroneous amount of bytes.
But in your case it seems to be ok.
Ruby: how to read an mp4 file into chunks
I suppose I would do:
chnk_size=4*1024*1024
f=File.open(fn, 'rb')
until f.eof?
chnk=f.read(chnk_size)
# process the chnk
end
Ruby - Read file in batches
there's no universal way.
1) you can read file by chunks:
File.open('filename','r') do |f|
chunk = f.read(2048)
...
end
disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk
2) you can read file line-by-line
File.open('filename','r') do |f|
line = f.gets
...
end
disadvantage: this way it'd be 2x..5x slower than first method
How to read whole file in Ruby?
IO.read("filename")
or
File.read("filename")
How to efficiently split file into arbitrary byteranges in ruby
The easiest way to generate an enumerator for each "byterange" would be to let the part
function handle the opening of the file:
def part(filepath, from, to = nil, chunk_size = 4096, &block)
return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
size = File.size(filepath)
to = size-1 unless to and to >= from and to < size
io = File.open(filepath, "rb")
io.seek(from, IO::SEEK_SET)
while (io.pos <= to)
size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
chunk = io.read(size)
yield chunk
end
ensure
io.close if io
end
Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)
Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from
is not a multiple of the physical HDD block.
The part
function now returns an Enumerator
when called without a block:
part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)
And of course you can call it directly with a block:
part("bigFile", 0, 1300, 512) do |chunk|
puts "#{chunk.inspect}"
end
Ruby: Split and Read part of a file Depending on Thread Count
The file is read into an array, which is split in chunks using Enumerable#each_slice
.
require 'benchmark'
require 'csv'
@file_name = 'xxx.txt'
file = File.open(@file_name, 'w')
1000.times do | i |
file.puts "#{i.to_s}"
end
file.close
@lines = []
CSV.foreach(@file_name) { | line | @lines << line }
FILE_RECORD_COUNT = @lines.size
puts FILE_RECORD_COUNT
def setup(thread_count)
puts "----- thread_count=#{thread_count}"
threads = []
fetches_per_thread = FILE_RECORD_COUNT / thread_count
puts "----- fetches_per_thread=#{fetches_per_thread}"
raise 'invalid slice size' if fetches_per_thread < 1
@lines.each_slice(fetches_per_thread) do | slice |
threads << Thread.new do
puts "===== slice from #{slice.first} to #{slice.last}"
slice.each do | id |
# puts id
# response = RestClient.get("https://api.examplerest/names/{#id}",{accept: :json})
# do some quick validation...
end # slice.each
end # Thread.new
end # @lines.each_slice
threads.each(&:join)
end # def setup
def run_benchmark
Benchmark.bm(20) do |bm|
[1, 2, 3, 5, 6, 10, 15, 30, 100].each do |thread_count|
bm.report("with #{thread_count} threads") do
setup(thread_count)
end
end
end
end
run_benchmark
Execution :
$ --------------------------------
-bash: --------------------------------: command not found
$ ruby -w t.rb
1000
user system total real
with 1 threads ----- thread_count=1
----- fetches_per_thread=1000
===== slice from ["0"] to ["999"]
0.000000 0.000000 0.000000 ( 0.000288)
with 2 threads ----- thread_count=2
----- fetches_per_thread=500
===== slice from ["0"] to ["499"]
===== slice from ["500"] to ["999"]
0.000000 0.000000 0.000000 ( 0.000318)
with 3 threads ----- thread_count=3
----- fetches_per_thread=333
===== slice from ["0"] to ["332"]
===== slice from ["666"] to ["998"]
===== slice from ["999"] to ["999"]
===== slice from ["333"] to ["665"]
0.000000 0.000000 0.000000 ( 0.000549)
with 5 threads ----- thread_count=5
----- fetches_per_thread=200
===== slice from ["0"] to ["199"]
===== slice from ["200"] to ["399"]
===== slice from ["400"] to ["599"]
===== slice from ["600"] to ["799"]
===== slice from ["800"] to ["999"]
0.000000 0.000000 0.000000 ( 0.000536)
with 6 threads ----- thread_count=6
----- fetches_per_thread=166
===== slice from ["166"] to ["331"]
===== slice from ["664"] to ["829"]
===== slice from ["830"] to ["995"]
===== slice from ["996"] to ["999"]
===== slice from ["0"] to ["165"]
===== slice from ["332"] to ["497"]
===== slice from ["498"] to ["663"]
0.000000 0.000000 0.000000 ( 0.000735)
with 10 threads ----- thread_count=10
----- fetches_per_thread=100
===== slice from ["900"] to ["999"]
...
===== slice from ["190"] to ["199"]
===== slice from ["200"] to ["209"]
===== slice from ["210"] to ["219"]
===== slice from ["220"] to ["229"]
===== slice from ["230"] to ["239"]
===== slice from ["240"] to ["249"]
...
===== slice from ["970"] to ["979"]
===== slice from ["980"] to ["989"]
===== slice from ["990"] to ["999"]
===== slice from ["20"] to ["29"]
===== slice from ["30"] to ["39"]
0.000000 0.000000 0.000000 ( 0.011656)
Then I use the find command in the terminal to find --------------------------------------
and jump to the beginning of the execution.
Related Topics
Ruby - Replace the First Occurrence of a Substring with Another String
Test Whether a Variable Equals Either One of Two Values
Ruby on Rails: Debugging Rake Tasks
Ruby on Rails Error "Cannot Load Such File -- Less"
Full Url for an Image-Path in Rails 3
Ruby Methods and Optional Parameters
How to Get the Line of Code That Triggers a Query
Bundler Could Not Find Compatible Versions for Gem "Bundler":
Passing Headers and Query Params in Httparty
How to Unfreeze an Object in Ruby
How to Spawn a Child Process in Ruby
Handling Namespace Models (Classes) in Namespace
Multiple Sinatra Apps Using Rack-Mount
Rails: Validation in Model VS Migration
Ruby/Sinatra - Serving Up CSS, JavaScript, or Image Files