Why Is "Slurping" a File Not a Good Practice

Why is slurping a file not a good practice?

Again and again we see questions asking about reading a text file to process it line-by-line, that use variations of read, or readlines, which pull the entire file into memory in one action.

The documentation for read says:

Opens the file, optionally seeks to the given offset, then returns length bytes (defaulting to the rest of the file). [...]

The documentation for readlines says:

Reads the entire file specified by name as individual lines, and returns those lines in an array. [...]

Pulling in a small file is no big deal, but there comes a point where memory has to be shuffled around as the incoming data's buffer grows, and that eats CPU time. In addition, if the data consumes too much space, the OS has to get involved just to keep the script running and starts spooling to disk, which will take a program to its knees. On a HTTPd (web-host) or something needing fast response it'll cripple the entire application.

Slurping is usually based on a misunderstanding of the speed of file I/O or thinking that it's better to read then split the buffer than it is to read it a single line at a time.

Here's some test code to demonstrate the problem caused by "slurping".

Save this as "test.sh":

echo Building test files...

yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000       > kb.txt
yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000000    > mb.txt
yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000000000 > gb1.txt
cat gb1.txt gb1.txt > gb2.txt
cat gb1.txt gb2.txt > gb3.txt

echo Testing...

ruby -v

echo
for i in kb.txt mb.txt gb1.txt gb2.txt gb3.txt
do
  echo
  echo "Running: time ruby readlines.rb $i"
  time ruby readlines.rb $i
  echo '---------------------------------------'
  echo "Running: time ruby foreach.rb $i"
  time ruby foreach.rb $i
  echo
done

rm [km]b.txt gb[123].txt

It creates five files of increasing sizes. 1K files are easily processed, and are very common. It used to be that 1MB files were considered big, but they're common now. 1GB is common in my environment, and files beyond 10GB are encountered periodically, so knowing what happens at 1GB and beyond is very important.

Save this as "readlines.rb". It doesn't do anything but read the entire file line-by-line internally, and append it to an array that is then returned, and seems like it'd be fast since it's all written in C:

lines = File.readlines(ARGV.shift).size
puts "#{ lines } lines read"

Save this as "foreach.rb":

lines = 0
File.foreach(ARGV.shift) { |l| lines += 1 }
puts "#{ lines } lines read"

Running sh ./test.sh on my laptop I get:

Building test files...
Testing...
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0]

Reading the 1K file:

Running: time ruby readlines.rb kb.txt
28 lines read

real    0m0.998s
user    0m0.386s
sys 0m0.594s
---------------------------------------
Running: time ruby foreach.rb kb.txt
28 lines read

real    0m1.019s
user    0m0.395s
sys 0m0.616s

Reading the 1MB file:

Running: time ruby readlines.rb mb.txt
27028 lines read

real    0m1.021s
user    0m0.398s
sys 0m0.611s
---------------------------------------
Running: time ruby foreach.rb mb.txt
27028 lines read

real    0m0.990s
user    0m0.391s
sys 0m0.591s

Reading the 1GB file:

Running: time ruby readlines.rb gb1.txt
27027028 lines read

real    0m19.407s
user    0m17.134s
sys 0m2.262s
---------------------------------------
Running: time ruby foreach.rb gb1.txt
27027028 lines read

real    0m10.378s
user    0m9.472s
sys 0m0.898s

Reading the 2GB file:

Running: time ruby readlines.rb gb2.txt
54054055 lines read

real    0m58.904s
user    0m54.718s
sys 0m4.029s
---------------------------------------
Running: time ruby foreach.rb gb2.txt
54054055 lines read

real    0m19.992s
user    0m18.765s
sys 0m1.194s

Reading the 3GB file:

Running: time ruby readlines.rb gb3.txt
81081082 lines read

real    2m7.260s
user    1m57.410s
sys 0m7.007s
---------------------------------------
Running: time ruby foreach.rb gb3.txt
81081082 lines read

real    0m33.116s
user    0m30.790s
sys 0m2.134s

Notice how readlines runs twice as slow each time the file size increases, and using foreach slows linearly. At 1MB, we can see there's something affecting the "slurping" I/O that doesn't affect reading line-by-line. And, because 1MB files are very common these days, it's easy to see they'll slow the processing of files over the lifetime of a program if we don't think ahead. A couple seconds here or there aren't much when they happen once, but if they happen multiple times a minute it adds up to a serious performance impact by the end of a year.

I ran into this problem years ago when processing large data files. The Perl code I was using would periodically stop as it reallocated memory while loading the file. Rewriting the code to not slurp the data file, and instead read and process it line-by-line, gave a huge speed improvement from over five minutes to run to less than one and taught me a big lesson.

"slurping" a file is sometimes useful, especially if you have to do something across line boundaries, however, it's worth spending some time thinking about alternate ways of reading a file if you have to do that. For instance, consider maintaining a small buffer built from the last "n" lines and scan it. That will avoid memory management issues caused by trying to read and hold the entire file. This is discussed in a Perl-related blog "Perl Slurp-Eaze" which covers the "whens" and "whys" to justify using full file-reads, and applies well to Ruby.

For other excellent reasons not to "slurp" your files, read "How to search file text for a pattern and replace it with a given value".

What are all the common ways to read a file in Ruby?

File.open("my/file/path", "r") do |f|
  f.each_line do |line|
    puts line
  end
end
# File is closed automatically at end of block

It is also possible to explicitly close file after as above (pass a block to open closes it for you):

f = File.open("my/file/path", "r")
f.each_line do |line|
  puts line
end
f.close

I don't understand why is this step necessary to count the number of chars in text document

When you do this

lines = File.readlines("new text document.txt")

you have an array of strings, i.e.:

lines #=> [
  "The surgeon lead over ...\n", # <- There's a newline at the end of each string
  "The medical gentleman ...\n",
]

There are as many entries on the array as lines on your text file. That's why you count the number of lines by doing:

lines_count =lines.size

When you call lines.join you are essentially concatenating all the strings together one after another

text = lines.join
text # => "The surgeon ... dress the infant"

And to calculate the number of characters of string you just call length on it.

The reason they look similar to you on the console is because when you print them they get represented in an identical way. To highlight the difference you may call inspect on each of them:

puts lines.inspect
puts text.inspect

Rails - file.readlines vs file.gets

What do the docs say? (Note that File is a subclass of IO. The methods #readlines and #gets are defined on IO.)

IO#readlines:

Reads all of the lines […], and returns them in an array.

IO#gets:

Reads the next “line” from the I/O stream.

Thus, I expect the latter to be better in terms of memory usage as it doesn't load the entire file into memory.

Is this good practice? Storing objects accessed by a constant...

I recommend against using constants to store changing data. Using constants not only tells the interpreter that your data won't change, it also tells other programmers that read your code. If you need somewhere to store data over time, use a @@class_variable, and @instance_variable or a $global instead.

This is not so much an OOP convention as a ruby convention.

is it good programming practice to put code statements in braces?

It's always a good idea to strive for simplicity wherever possible, and to that end it's best to state things in a straightforward manner. Declarations like that make it hard to determine where variables originate as they're embedded rather thoroughly inside the statement.

Declaring scoped variables within brackets is usually considered acceptable:

if (found = MyModel.find_by_pigeon_id(params[:pigeon_id]))
  # Variable 'found' used only within this block
end

# Ruby variables will persist here, but in many languages they are out of scope

A more verbose version actually has implications:

found = MyModel.find_by_pigeon_id(params[:pigeon_id])
if (found)
  # Variable 'found' can be used here
end

# Implies 'found' may be used here for whatever reason

It's always nice to be able to scan up through the program and see, quite clearly, all the variables as they're declared. Hiding things serves no purpose other than to frustrate people.

Ruby is a lot more relaxed than many other languages in terms of how much you can get away with. Some languages will punish you severely for complicating things because a tiny mistake in declaration or casting can have enormous ramifications. That doesn't mean you should take full advantage of that at every opportunity.

Here's how I'd advocate implementing your first example:

# Ensure that @select_file is defined
@select_file ||= Qt::FileDialog.new

@pushButton.connect(SIGNAL(:clicked)) do
  # Block is split out into multiple lines for clarity
  @select_file.show
end

The second:

# Simple declaration, variable name inherited from class name, not truncated
timer = Qt::Timer.new

timer.connect(SIGNAL(:timeout)) do 
  # Long parameter list is broken out into separate lines to make it clear
  # what the ordering is. Useful for identifying accidentally missing parameters.
  @label.text = Qt::Application.translate(
    "MainWindow",
    "The time right now is #{Time.now}",
    nil,
    Qt::Application::UnicodeUTF8
  )
end

timer.start(1000)

I've found that the most complicated programs often look the simplest, as they're written by people with lots of experience who know how to express things in a straightforward manner.

Interestingly enough, some of the simplest programs are often the most complicated as they're written by novices who are either grandstanding and showing off or are digging themselves into a deep ditch and keep throwing code at the problem in the hopes of fixing it.

What's the best way to search for a string in a file?

File.open(filename).grep(/string/)

This loads the whole file into memory (slurps the file). You should avoid file slurping when dealing with large files. That means loading one line at a time, instead of the whole file.

File.foreach(filename).grep(/string/)

It's good practice to clean up after yourself rather than letting the garbage collector handle it at some point. This is more important if your program is long-lived and not just some quick script. Using a code block ensures that the File object is closed when the block terminates.

File.foreach(filename) do |file|
  file.grep(/string/)
end

How to read whole file in Ruby?

IO.read("filename")

File.read("filename")

Why Is "Slurping" a File Not a Good Practice