File.Open, Open and Io.Foreach in Ruby, What Is the Difference

File.open, open and IO.foreach in Ruby, what is the difference?

There are important differences beetween those 3 choices.

File.open("file").each_line { |line| puts line }

  • File.open opens a local file and returns a file object
  • the file stays open until you call IO#close on it

open("file").each_line { |line| puts line }

Kernel.open looks at the string to decide what to do with it.

open(".irbrc").class # => File
open("http://google.com/").class # => StringIO
File.open("http://google.com/") # => Errno::ENOENT: No such file or directory - http://google.com/

In the second case the StringIO object returned by Kernel#open actually holds the content of http://google.com/. If Kernel#open returns a File object, it stays open untill you call IO#close on it.

IO.foreach("file") { |line| puts line }

  • IO.foreach opens a file, calls the given block for each line it reads, and closes the file afterwards.
  • You don't have to worry about closing the file.

File.read("file").each { |line| puts line }

You didn't mention this choice, but this is the one I would use in most cases.

  • File.read reads a file completely and returns it as a string.
  • You don't have to worry about closing the file.
  • In comparison to IO.foreach this makes it clear, that you are dealing with a file.
  • The memory complexity for this is O(n). If you know you are dealing with a small file, this is no drawback. But if it can be a big file and you know your memory complexity can be smaller than O(n), don't use this choice.

It fails in this situation:

 s= File.read("/dev/zero") # => never terminates
s.each …

ri

ri is a tool which shows you the ruby documentation. You use it like this on your shell.

ri File.open
ri open
ri IO.foreach
ri File#each_line

With this you can find almost everything I wrote here and much more.

In ruby, file.readlines.each not faster than file.open.each_line, why?

Both readlines and open.each_line read the file only once. And Ruby will do buffering on IO objects, so it will read a block (e.g. 64KB) data from disk every time to minimize the cost on disk read. There should be little time consuming difference in the disk read step.

When you call readlines, Ruby constructs an empty array [] and repeatedly reads a line of file contents and pushes it to the array. And at last it will return the array containing all lines of the file.

When you call each_line, Ruby reads a line of file contents and yield it to your logic. When you finished processing this line, ruby reads another line. It repeatedly reads lines until there is no more contents in the file.

The difference between the two method is that readlines have to append the lines to an array. When the file is large, Ruby might have to duplicate the underlying array (C level) to enlarge its size one or more times.

Digging into the source, readlines is implemented by io_s_readlines which calls rb_io_readlines. rb_io_readlines calls rb_io_getline_1 to fetch line and rb_ary_push to push result into the returning array.

each_line is implemented by rb_io_each_line which calls rb_io_getline_1 to fetch line just like readlines and yield the line to your logic with rb_yield.

So, there is no need to store line results in a growing array for each_line, no array resizing, copying issue.

What are all the common ways to read a file in Ruby?

File.open("my/file/path", "r") do |f|
f.each_line do |line|
puts line
end
end
# File is closed automatically at end of block

It is also possible to explicitly close file after as above (pass a block to open closes it for you):

f = File.open("my/file/path", "r")
f.each_line do |line|
puts line
end
f.close

File.open with block vs without

DarkDust already said that these methods are different. I'll explain you the blocks a little more, as I suppose that I can guess why you asked this question ;-)

The block in ruby is just a parameter for some method. It's not just a different syntax.

Methods which accept (optional) blocks usually have a condition to test whether they have been called with block, or without.

Consider this very simplified example: (the real File.open is similar, but it ensures the file is closed even if your block raises an error, for example)

def open(fname)
self.do_open(fname)
if block_given?
yield(self) # This will 'run' the block with given parameter
self.close
else
return self # This will just return some value
end
end

In general, every method may work (works) differently with a block or without a block. It should be always stated in the method documentation.

Rails - file.readlines vs file.gets

What do the docs say? (Note that File is a subclass of IO. The methods #readlines and #gets are defined on IO.)

IO#readlines:

Reads all of the lines […], and returns them in an array.

IO#gets:

Reads the next “line” from the I/O stream.

Thus, I expect the latter to be better in terms of memory usage as it doesn't load the entire file into memory.

How do I read the nth line of a file efficiently in Ruby?

What about IO.foreach?

IO.foreach('filename') { |line| p line; break }

That should read the first line, print it, and then stop. It does not read the entire file; it reads one line at a time.

Write to a file in Ruby with every iteration of a loop

Let's use IO::write to create two input files.

FNameIn1 = 'in1'
File.write(FNameIn1, "cow\npig\ngoat\nhen\n")
#=> 17

We can use IO::read to confirm what was written.

puts File.read(FNameIn1)
cow
pig
goat
hen

FNameIn2 = 'in2'
File.write(FNameIn2, "12\n34\n56\n78\n")
#=> 12
puts File.read(FNameIn2)
12
34
56
78

Next, use File::open to open the two input files for reading, obtaining a file handle for each.

f1 = File.open(FNameIn1)
#=> #<File:in1>
f2 = File.open(FNameIn2)
#=> #<File:in2>

Now open a file for writing.

FNameOut = 'out'
f = File.open(FNameOut, "w")
#=> #<File:out>

Assuming the two input files have the same number of lines, in a while loop read the next line from each, combine the two lines in some ways and the write the resulting line to the output file.

until f1.eof
line11 = f1.gets.chomp
line12 = f1.gets.chomp
line21 = f2.gets.chomp
line22 = f2.gets.chomp
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end

See IO#eof, IO#gets and IO#puts.

Lastly, use IO#close to close the files.

f1.close
f2.close
f.close

Let's see that FileOut looks like.

puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78

We can have Ruby close the files by using a block for each File::open:

File.open(FNameIn1) do |f1|
File.open(FNameIn2) do |f2|
File.open(FNameOut, "w") do |f|
until f1.eof
line11 = f1.gets.chomp
line12 = f1.gets.chomp
line21 = f2.gets.chomp
line22 = f2.gets.chomp
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
end
end
end
puts File.read FNameOut
cow 12, pig 34
goat 56, hen 78

This is in fact how it's normally done in Ruby, in part to avoid the possibility of forgetting to close files.

Here's another way, using IO::foreach, which, without a block, returns an enumerator, allowing the use of Enumerable#each_slice, as referenced in the question.

e1 = File.foreach(FNameIn1).each_slice(2)
#=> #<Enumerator: #<Enumerator: File:foreach("in1")>:each_slice(2)>
e2 = File.foreach(FNameIn2).each_slice(2)
#=> #<Enumerator: #<Enumerator: File:foreach("in2")>:each_slice(2)>

File.open(FNameOut, "w") do |f|
loop do
line11, line12 = e1.next.map(&:chomp)
line21, line22 = e2.next.map(&:chomp)
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
end
puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78

We may observe the values generated by the enumerator

e1 = File.foreach(FNameIn1).each_slice(2)

by repeatedly executing Enumerator#next:

e1.next
#=> ["cow\n", "pig\n"]
e1.next
#=> ["goat\n", "hen\n"]
e1.next
#=> StopIteration (iteration reached an end)

The StopIteration exception, when raised, is handled by Kernel#loop by breaking out of the loop (which is one reason why loop is so useful).

Ruby I/O programming about a unless loop

unless is the negative equivalent of if. It does not loop.

The keyword you were looking for, presumably, is until - which is the inverse of while, so does indeed perform a loop.

What is the most performant way of processing this large text file?

You need to run a benchmark test, using Ruby's built-in Benchmark to figure out what is your fastest choice.

However, from experience, I've found that "slurping" the file, i.e., reading it all in at once, is not any faster than using a loop with IO.foreach or File.foreach. This is because Ruby and the underlying OS do file buffering as the reads occur, allowing your loop to occur from memory, not directly from disk. foreach will not strip the line-terminators for you, like split would, so you'll need to add a chomp or chomp! if you want to mutate the line read in:

File.foreach('/path/to/file') do |li|
puts li.chomp
end

or

File.foreach('/path/to/file') do |li|
li.chomp!
puts li
end

Also, slurping has the problem of not being scalable; You could end up trying to read a file bigger than memory, taking your machine to its knees, while reading line-by-line will never do that.


Here's some performance numbers:

#!/usr/bin/env ruby

require 'benchmark'
require 'fileutils'

FILENAME = 'test.txt'
LOOPS = 1

puts "Ruby Version: #{RUBY_VERSION}"
puts "Filesize being read: #{File.size(FILENAME)}"
puts "Lines in file: #{`wc -l #{FILENAME}`.split.first}"

Benchmark.bm(20) do |x|
x.report('read.split') { LOOPS.times { File.read(FILENAME).split("\n") }}
x.report('read.lines.chomp') { LOOPS.times { File.read(FILENAME).lines.map(&:chomp) }}
x.report('readlines.map.chomp1') { LOOPS.times { File.readlines(FILENAME).map(&:chomp) }}
x.report('readlines.map.chomp2') { LOOPS.times { File.readlines(FILENAME).map{ |s| s.chomp } }}
x.report('foreach.map.chomp1') { LOOPS.times { File.foreach(FILENAME).map(&:chomp) }}
x.report('foreach.map.chomp2') { LOOPS.times { File.foreach(FILENAME).map{ |s| s.chomp } }}
end

And the results:

Ruby Version: 1.9.3
Filesize being read: 42026131
Lines in file: 465440
user system total real
read.split 0.150000 0.060000 0.210000 ( 0.213365)
read.lines.chomp 0.470000 0.070000 0.540000 ( 0.541266)
readlines.map.chomp1 0.450000 0.090000 0.540000 ( 0.535465)
readlines.map.chomp2 0.550000 0.060000 0.610000 ( 0.616674)
foreach.map.chomp1 0.580000 0.060000 0.640000 ( 0.641563)
foreach.map.chomp2 0.620000 0.050000 0.670000 ( 0.662912)

On today's machines a 42MB file can be read into RAM pretty safely. I have seen files a lot bigger than that which won't fit into the memory of some of our production hosts. While foreach is slower, it's also not going to take a machine to its knees by sucking up all memory if there isn't enough memory.

On Ruby 1.9.3, using the map(&:chomp) method, instead of the older form of map { |s| s.chomp }, is a lot faster. That wasn't true with older versions of Ruby, so caveat emptor.

Also, note that all the above processed the data in less than one second on my several years old Mac Pro. All in all I'd say that worrying about the load speed is premature optimization, and the real problem will be what is done after the data is loaded.



Related Topics



Leave a reply



Submit