File.open, open and IO.foreach in Ruby, what is the difference?
There are important differences beetween those 3 choices.
File.open("file").each_line { |line| puts line }
File.open
opens a local file and returns a file object- the file stays open until you call
IO#close
on it
open("file").each_line { |line| puts line }
Kernel.open
looks at the string to decide what to do with it.
open(".irbrc").class # => File
open("http://google.com/").class # => StringIO
File.open("http://google.com/") # => Errno::ENOENT: No such file or directory - http://google.com/
In the second case the StringIO
object returned by Kernel#open
actually holds the content of http://google.com/. If Kernel#open
returns a File
object, it stays open untill you call IO#close
on it.
IO.foreach("file") { |line| puts line }
IO.foreach
opens a file, calls the given block for each line it reads, and closes the file afterwards.- You don't have to worry about closing the file.
File.read("file").each { |line| puts line }
You didn't mention this choice, but this is the one I would use in most cases.
File.read
reads a file completely and returns it as a string.- You don't have to worry about closing the file.
- In comparison to
IO.foreach
this makes it clear, that you are dealing with a file. - The memory complexity for this is O(n). If you know you are dealing with a small file, this is no drawback. But if it can be a big file and you know your memory complexity can be smaller than O(n), don't use this choice.
It fails in this situation:
s= File.read("/dev/zero") # => never terminates
s.each …
ri
ri is a tool which shows you the ruby documentation. You use it like this on your shell.
ri File.open
ri open
ri IO.foreach
ri File#each_line
With this you can find almost everything I wrote here and much more.
In ruby, file.readlines.each not faster than file.open.each_line, why?
Both readlines
and open.each_line
read the file only once. And Ruby will do buffering on IO objects, so it will read a block (e.g. 64KB) data from disk every time to minimize the cost on disk read. There should be little time consuming difference in the disk read step.
When you call readlines
, Ruby constructs an empty array []
and repeatedly reads a line of file contents and pushes it to the array. And at last it will return the array containing all lines of the file.
When you call each_line
, Ruby reads a line of file contents and yield it to your logic. When you finished processing this line, ruby reads another line. It repeatedly reads lines until there is no more contents in the file.
The difference between the two method is that readlines
have to append the lines to an array. When the file is large, Ruby might have to duplicate the underlying array (C level) to enlarge its size one or more times.
Digging into the source, readlines
is implemented by io_s_readlines
which calls rb_io_readlines
. rb_io_readlines
calls rb_io_getline_1
to fetch line and rb_ary_push
to push result into the returning array.
each_line
is implemented by rb_io_each_line
which calls rb_io_getline_1
to fetch line just like readlines
and yield the line to your logic with rb_yield
.
So, there is no need to store line results in a growing array for each_line
, no array resizing, copying issue.
What are all the common ways to read a file in Ruby?
File.open("my/file/path", "r") do |f|
f.each_line do |line|
puts line
end
end
# File is closed automatically at end of block
It is also possible to explicitly close file after as above (pass a block to open
closes it for you):
f = File.open("my/file/path", "r")
f.each_line do |line|
puts line
end
f.close
File.open with block vs without
DarkDust already said that these methods are different. I'll explain you the blocks a little more, as I suppose that I can guess why you asked this question ;-)
The block in ruby is just a parameter for some method. It's not just a different syntax.
Methods which accept (optional) blocks usually have a condition to test whether they have been called with block, or without.
Consider this very simplified example: (the real File.open is similar, but it ensures the file is closed even if your block raises an error, for example)
def open(fname)
self.do_open(fname)
if block_given?
yield(self) # This will 'run' the block with given parameter
self.close
else
return self # This will just return some value
end
end
In general, every method may work (works) differently with a block or without a block. It should be always stated in the method documentation.
Rails - file.readlines vs file.gets
What do the docs say? (Note that File
is a subclass of IO
. The methods #readlines
and #gets
are defined on IO
.)
IO#readlines
:
Reads all of the lines […], and returns them in an array.
IO#gets
:
Reads the next “line” from the I/O stream.
Thus, I expect the latter to be better in terms of memory usage as it doesn't load the entire file into memory.
How do I read the nth line of a file efficiently in Ruby?
What about IO.foreach
?
IO.foreach('filename') { |line| p line; break }
That should read the first line, print it, and then stop. It does not read the entire file; it reads one line at a time.
Write to a file in Ruby with every iteration of a loop
Let's use IO::write to create two input files.
FNameIn1 = 'in1'
File.write(FNameIn1, "cow\npig\ngoat\nhen\n")
#=> 17
We can use IO::read to confirm what was written.
puts File.read(FNameIn1)
cow
pig
goat
hen
FNameIn2 = 'in2'
File.write(FNameIn2, "12\n34\n56\n78\n")
#=> 12
puts File.read(FNameIn2)
12
34
56
78
Next, use File::open to open the two input files for reading, obtaining a file handle for each.
f1 = File.open(FNameIn1)
#=> #<File:in1>
f2 = File.open(FNameIn2)
#=> #<File:in2>
Now open a file for writing.
FNameOut = 'out'
f = File.open(FNameOut, "w")
#=> #<File:out>
Assuming the two input files have the same number of lines, in a while
loop read the next line from each, combine the two lines in some ways and the write the resulting line to the output file.
until f1.eof
line11 = f1.gets.chomp
line12 = f1.gets.chomp
line21 = f2.gets.chomp
line22 = f2.gets.chomp
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
See IO#eof, IO#gets and IO#puts.
Lastly, use IO#close to close the files.
f1.close
f2.close
f.close
Let's see that FileOut
looks like.
puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78
We can have Ruby close the files by using a block for each File::open
:
File.open(FNameIn1) do |f1|
File.open(FNameIn2) do |f2|
File.open(FNameOut, "w") do |f|
until f1.eof
line11 = f1.gets.chomp
line12 = f1.gets.chomp
line21 = f2.gets.chomp
line22 = f2.gets.chomp
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
end
end
end
puts File.read FNameOut
cow 12, pig 34
goat 56, hen 78
This is in fact how it's normally done in Ruby, in part to avoid the possibility of forgetting to close files.
Here's another way, using IO::foreach, which, without a block, returns an enumerator, allowing the use of Enumerable#each_slice, as referenced in the question.
e1 = File.foreach(FNameIn1).each_slice(2)
#=> #<Enumerator: #<Enumerator: File:foreach("in1")>:each_slice(2)>
e2 = File.foreach(FNameIn2).each_slice(2)
#=> #<Enumerator: #<Enumerator: File:foreach("in2")>:each_slice(2)>
File.open(FNameOut, "w") do |f|
loop do
line11, line12 = e1.next.map(&:chomp)
line21, line22 = e2.next.map(&:chomp)
f.puts "%s %s, %s %s" % [line11, line21, line12, line22]
end
end
puts File.read(FNameOut)
cow 12, pig 34
goat 56, hen 78
We may observe the values generated by the enumerator
e1 = File.foreach(FNameIn1).each_slice(2)
by repeatedly executing Enumerator#next:
e1.next
#=> ["cow\n", "pig\n"]
e1.next
#=> ["goat\n", "hen\n"]
e1.next
#=> StopIteration (iteration reached an end)
The StopIteration
exception, when raised, is handled by Kernel#loop by breaking out of the loop (which is one reason why loop
is so useful).
Ruby I/O programming about a unless loop
unless
is the negative equivalent of if
. It does not loop.
The keyword you were looking for, presumably, is until
- which is the inverse of while
, so does indeed perform a loop.
What is the most performant way of processing this large text file?
You need to run a benchmark test, using Ruby's built-in Benchmark to figure out what is your fastest choice.
However, from experience, I've found that "slurping" the file, i.e., reading it all in at once, is not any faster than using a loop with IO.foreach
or File.foreach
. This is because Ruby and the underlying OS do file buffering as the reads occur, allowing your loop to occur from memory, not directly from disk. foreach
will not strip the line-terminators for you, like split
would, so you'll need to add a chomp
or chomp!
if you want to mutate the line read in:
File.foreach('/path/to/file') do |li|
puts li.chomp
end
or
File.foreach('/path/to/file') do |li|
li.chomp!
puts li
end
Also, slurping has the problem of not being scalable; You could end up trying to read a file bigger than memory, taking your machine to its knees, while reading line-by-line will never do that.
Here's some performance numbers:
#!/usr/bin/env ruby
require 'benchmark'
require 'fileutils'
FILENAME = 'test.txt'
LOOPS = 1
puts "Ruby Version: #{RUBY_VERSION}"
puts "Filesize being read: #{File.size(FILENAME)}"
puts "Lines in file: #{`wc -l #{FILENAME}`.split.first}"
Benchmark.bm(20) do |x|
x.report('read.split') { LOOPS.times { File.read(FILENAME).split("\n") }}
x.report('read.lines.chomp') { LOOPS.times { File.read(FILENAME).lines.map(&:chomp) }}
x.report('readlines.map.chomp1') { LOOPS.times { File.readlines(FILENAME).map(&:chomp) }}
x.report('readlines.map.chomp2') { LOOPS.times { File.readlines(FILENAME).map{ |s| s.chomp } }}
x.report('foreach.map.chomp1') { LOOPS.times { File.foreach(FILENAME).map(&:chomp) }}
x.report('foreach.map.chomp2') { LOOPS.times { File.foreach(FILENAME).map{ |s| s.chomp } }}
end
And the results:
Ruby Version: 1.9.3
Filesize being read: 42026131
Lines in file: 465440
user system total real
read.split 0.150000 0.060000 0.210000 ( 0.213365)
read.lines.chomp 0.470000 0.070000 0.540000 ( 0.541266)
readlines.map.chomp1 0.450000 0.090000 0.540000 ( 0.535465)
readlines.map.chomp2 0.550000 0.060000 0.610000 ( 0.616674)
foreach.map.chomp1 0.580000 0.060000 0.640000 ( 0.641563)
foreach.map.chomp2 0.620000 0.050000 0.670000 ( 0.662912)
On today's machines a 42MB file can be read into RAM pretty safely. I have seen files a lot bigger than that which won't fit into the memory of some of our production hosts. While foreach
is slower, it's also not going to take a machine to its knees by sucking up all memory if there isn't enough memory.
On Ruby 1.9.3, using the map(&:chomp)
method, instead of the older form of map { |s| s.chomp }
, is a lot faster. That wasn't true with older versions of Ruby, so caveat emptor.
Also, note that all the above processed the data in less than one second on my several years old Mac Pro. All in all I'd say that worrying about the load speed is premature optimization, and the real problem will be what is done after the data is loaded.
Related Topics
Actioncontroller::Routingerror (No Route Matches [Put] ) for Ajax Call
Rspec Allow/Expect VS Just Expect/And_Return
Parse Command Line Arguments in a Ruby Script
Generating Unique, Hard-To-Guess "Coupon" Codes
Convert String with Hex Ascii Codes to Characters
Rspec: How to Test File Operations and File Content
Make Rake Task from Gem Available Everywhere
How to Turn Off Automatic Stylesheet/JavaScript Generation on Rails 3.1
How to Reference Global Variables and Class Variables
Having Trouble Installing Any Ruby 1.9.X (With Rbenv) on MAC Osx Due to Psych Yaml Parse Errors
How to Update Ruby with Homebrew
Vim Command-T Plugin Error: Could Not Load the C Extension
How to Convert a JSON Formatted Key Value Pair to Ruby Hash with Symbol as Key
Installed Rails But the Rails Command Says It's Not Installed
How to Disable Mongodb Log Messages in Console
How to Read a Password from the Command Line in Ruby
Why Does Ruby Open-Uri's Open Return a Stringio in My Unit Test, But a Fileio in My Controller