Ruby: Length of a Line of a File in Bytes

Ruby: Length of a line of a file in bytes?

IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.

See the relevant Pickaxe section

May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...

Aha. I think I get it now.

Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:

class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end

if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end

Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.

I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.

If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:

class String
def size_in_bytes
self.unpack("C*").size
end
end

The unpack version is about 8 times faster than the each_byte one on my machine, btw.

Count the number of lines in a file without reading entire file into memory?

If you are in a Unix environment, you can just let wc -l do the work.

It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.

SSCCE:

filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count

Or if you want a collection of files passed on the command line:

wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count

Ruby Reads Different File Sizes for Line Reads

There are special characters stored in the file that delineate the lines:

  • CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
  • 0x0A (\n) on UNIX systems.

Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.

Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.

To get the size of a file, use File.size(file_name).

Read lines based on their length

This has a number of problems:

len = 3
d = Array.new
t = File.open('d.txt').read
t.each_line do |x|
#+2 accounting for \n\r
if x.length == (len + 2)
d.push(x)
end
end

First, the entire file is read into memory because of File.open('d.txt').read, then split into lines using each_line, and finally lines that are the desired length are captured. If the file consisted of 1,000,000 lines and only one was three characters long, there'd be a lot of wasted memory and CPU time.

Instead, write it like this:

len = 3
d = []
File.foreach('d.txt') do |x|
d << x if (x.chomp.length == len)
end

foreach reads each line, maintaining the line-breaks. chomp removes the line break so you can compare the actual line, without line-ends thanks to chomp, to len. Then, if the length matches, the line gets appended to the array. At no time is the entire file in memory, unless every line is the desired length. This saves memory, and will run extremely fast, maybe even faster than the original that used read to slurp the entire file, because that process can take a while if the file is sufficiently big.

Ruby Reads Different File Sizes for Line Reads

There are special characters stored in the file that delineate the lines:

  • CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
  • 0x0A (\n) on UNIX systems.

Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.

Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.

To get the size of a file, use File.size(file_name).

Pretty file size in Ruby?

How about the Filesize gem ? It seems to be able to convert from bytes (and other formats) into pretty printed values:

example:

Filesize.from("12502343 B").pretty      # => "11.92 MiB"

http://rubygems.org/gems/filesize

Getting accurate file size in megabytes?

Try:

compressed_file_size = File.size("Compressed/#{project}.tar.bz2").to_f / 2**20
formatted_file_size = '%.2f' % compressed_file_size

One-liner:

compressed_file_size = '%.2f' % (File.size("Compressed/#{project}.tar.bz2").to_f / 2**20)

or:

compressed_file_size = (File.size("Compressed/#{project}.tar.bz2").to_f / 2**20).round(2)

Further information on %-operator of String:
http://ruby-doc.org/core-1.9/classes/String.html#M000207


BTW: I prefer "MiB" instead of "MB" if I use base2 calculations (see: http://en.wikipedia.org/wiki/Mebibyte)

What does the .size on number mean in ruby?

I always suggest to check the method you're not sure about using the following scheme:

  1. Check where the method comes from (using Object#method):

    number.method(:size)
    #=> #<Method: Fixnum#size>
  2. Open docs and learn what it does for Fixnum#size and how it works.

    2.1 If you're using IRB, you can run help 'Fixnum#size' to get the docs right in your console
    2.2 If you're using pry, you can go with show-doc Fixnum#size (install pry-doc gem first)


In Ruby 2.1.8 method was defined in Fixnum#size.

Starting from Ruby 2.4 it's defined in
Integer#size:

Returns the number of bytes in the machine representation of int.



Related Topics



Leave a reply



Submit