Ruby: Length of a line of a file in bytes?
IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
See the relevant Pickaxe section
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
Aha. I think I get it now.
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.
Count the number of lines in a file without reading entire file into memory?
If you are in a Unix environment, you can just let wc -l
do the work.
It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.
SSCCE:
filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count
Or if you want a collection of files passed on the command line:
wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count
Ruby Reads Different File Sizes for Line Reads
There are special characters stored in the file that delineate the lines:
- CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
- 0x0A (\n) on UNIX systems.
Ruby's gets
uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.
Also String.length
is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.
To get the size of a file, use File.size(file_name)
.
Read lines based on their length
This has a number of problems:
len = 3
d = Array.new
t = File.open('d.txt').read
t.each_line do |x|
#+2 accounting for \n\r
if x.length == (len + 2)
d.push(x)
end
end
First, the entire file is read into memory because of File.open('d.txt').read
, then split into lines using each_line
, and finally lines that are the desired length are captured. If the file consisted of 1,000,000 lines and only one was three characters long, there'd be a lot of wasted memory and CPU time.
Instead, write it like this:
len = 3
d = []
File.foreach('d.txt') do |x|
d << x if (x.chomp.length == len)
end
foreach
reads each line, maintaining the line-breaks. chomp
removes the line break so you can compare the actual line, without line-ends thanks to chomp
, to len
. Then, if the length matches, the line gets appended to the array. At no time is the entire file in memory, unless every line is the desired length. This saves memory, and will run extremely fast, maybe even faster than the original that used read
to slurp the entire file, because that process can take a while if the file is sufficiently big.
Ruby Reads Different File Sizes for Line Reads
There are special characters stored in the file that delineate the lines:
- CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
- 0x0A (\n) on UNIX systems.
Ruby's gets
uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.
Also String.length
is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.
To get the size of a file, use File.size(file_name)
.
Pretty file size in Ruby?
How about the Filesize gem ? It seems to be able to convert from bytes (and other formats) into pretty printed values:
example:
Filesize.from("12502343 B").pretty # => "11.92 MiB"
http://rubygems.org/gems/filesize
Getting accurate file size in megabytes?
Try:
compressed_file_size = File.size("Compressed/#{project}.tar.bz2").to_f / 2**20
formatted_file_size = '%.2f' % compressed_file_size
One-liner:
compressed_file_size = '%.2f' % (File.size("Compressed/#{project}.tar.bz2").to_f / 2**20)
or:
compressed_file_size = (File.size("Compressed/#{project}.tar.bz2").to_f / 2**20).round(2)
Further information on %
-operator of String:
http://ruby-doc.org/core-1.9/classes/String.html#M000207
BTW: I prefer "MiB" instead of "MB" if I use base2 calculations (see: http://en.wikipedia.org/wiki/Mebibyte)
What does the .size on number mean in ruby?
I always suggest to check the method you're not sure about using the following scheme:
Check where the method comes from (using Object#
method
):number.method(:size)
#=> #<Method: Fixnum#size>Open docs and learn what it does for
Fixnum#size
and how it works.2.1 If you're using IRB, you can run
help 'Fixnum#size'
to get the docs right in your console
2.2 If you're using pry, you can go withshow-doc Fixnum#size
(installpry-doc
gem first)
In Ruby 2.1.8 method was defined in Fixnum#size
.
Starting from Ruby 2.4 it's defined in
Integer#size
:
Returns the number of bytes in the machine representation of int.
Related Topics
Ruby File Reading Parallelisim
Share Session Between Two Rails4 Applications
How to Dry Up Method with Multiple { 'Not Found' }
How to Link to a Page with Page.Url Without the HTML Extension in Jekyll
Chromedriver Devtools Port Number Error
Ruby 1.9 - No Such File to Load 'Win32/Open3'
Rails 4 Wysiwyg Bootsy Not Displaying Formatting
How to Use Multiple Models for Tag_Cloud
Calling a Method of a Ruby Singleton Without the Reference of 'Instance'
Problems with Jslint-V8 Ruby Gem Installation on Windows7 64-Bit
What Is an Eoferror in Ruby File I/O
How to HTML_Escape Text Data in a Sinatra App
How to Fix in Ruby on Rails the Undefined Method 'Alias_Method_Chain' Error
How to Collapse Double Splat Arguments into Nothing
Why Does Ruby '**' Operator Have Higher Precedence Than Unary '-'