Python Readlines() Usage and Efficient Practice for Reading

Python readlines() usage and efficient practice for reading

The short version is: The efficient way to use readlines() is to not use it. Ever.


I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.


On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.


You can work around this in three ways:

  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

For example, this has to read all of foo at once:

with open('foo') as f:
lines = f.readlines()
for line in lines:
pass

But this only reads about 8K at a time:

with open('foo') as f:
while True:
lines = f.readlines(8192)
if not lines:
break
for line in lines:
pass

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
while True:
line = f.readline()
if not line:
break
pass

And this will do the exact same thing as the previous:

with open('foo') as f:
for line in f:
pass

Meanwhile:

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python doesn't make any such guarantees about garbage collection.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).


Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
with open(filename, 'rb') as f:
if filename.endswith(".gz"):
f = gzip.open(fileobj=f)
words = (line.split(delimiter) for line in f)
... my logic ...

Or, maybe:

for filename in os.listdir(input_dir):
if filename.endswith(".gz"):
f = gzip.open(filename, 'rb')
else:
f = open(filename, 'rb')
with contextlib.closing(f):
words = (line.split(delimiter) for line in f)
... my logic ...

When should I ever use file.read() or file.readlines()?

The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.

f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.

Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.

(As noted in the other answer, this documentation is a very good read):

https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

Note:
It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.

Reading multiple opened files using readlines() results in empty array

This has nothing to do with opening multiple files.

When you open a file in append mode, you're initially positioned at the end of the file, so there's nothing to read. You need to seek to the beginning to read the contents.

with open('in.txt', 'r') as i, open('out.txt', 'a+') as o:
in_data = i.readlines()
o.seek(0)
out_data = o.readlines()

print(in_data)
print(out_data)

Does it takes RAM to save a readlines array?

If I understood your issue correctly you just want to combine (ie concatenate) files.

If memory is an issue normally for line in f is the way to go.

I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.

Codes:

#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
chunk = f.read(chunksize)
while chunk:
chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop

#read whole file in one go - slowest in my test
with open(fn,'r') as f:
chunk = f.read()
#1 loop, best of 3: 11.7 s per loop

#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
for line in f:
s = line
#1 loop, best of 3: 6.74 s per loop

Knowing this you could write something like:

with open(outputfile,'w') as fo:
for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
with open(inputfile,'r') as fi:
for chunk in iter(lambda: fi.read(chunksize), ''):
fo.write(fi.read(chunk))
fo.write('\n') #newline between each file(might not be necessary)

Differences between file.read(), file.readline() and iterating over the file object

read(x) will read up to x bytes in a file. If you don't supply the size, the entire file is read.

readline(x) will read up to x bytes or a newline, whichever comes first. If you don't supply a size, it will read all data until it hits a newline.

When using for line in f, it will call the next() method under the hood which really just does something very similar to readline (although I see references that is may do some buffering more efficiently since iterating usually means you are planning to read the entire file).

There is also readlines() which reads all lines into memory.

Why is readlines() reading much more than the sizehint?

The buffer the readlines documentation mentions isn't related to the buffering that the third argument of the open call controls. The buffer is this buffer in file_readlines:

static PyObject *
file_readlines(PyFileObject *f, PyObject *args)
{
long sizehint = 0;
PyObject *list = NULL;
PyObject *line;
char small_buffer[SMALLCHUNK];

where SMALLCHUNK is defined earlier:

#if BUFSIZ < 8192
#define SMALLCHUNK 8192
#else
#define SMALLCHUNK BUFSIZ
#endif

I don't know where BUFSIZ comes from, but it looks like you're getting the #define SMALLCHUNK 8192 case. In any case, readlines will never use a buffer smaller than 8 KiB, so you should probably make your chunks bigger than that.

If RAM isn't a concern, is reading line by line faster or reading everything into RAM and access it? - Python

I used cProfile on a ~1MB dictionary words file. I read the same file 3 times. The first reads tho whole file in just to even the playing field in terms of it being stored in cache. Here is the simple code:

def first_read():
codecs.open(file, 'r', 'utf8').readlines()

def line_by_line():
for i in codecs.open(file, 'r', 'utf8'):
pass

def at_once():
for i in codecs.open(file, 'r', 'utf8').readlines():
pass

first_read()
cProfile.run('line_by_line()')
cProfile.run('at_once()')

And here are the results:

Line by line:

         366959 function calls in 1.762 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.762 1.762 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 codecs.py:322(__init__)
1 0.000 0.000 0.000 0.000 codecs.py:395(__init__)
14093 0.087 0.000 0.131 0.000 codecs.py:424(read)
57448 0.285 0.000 0.566 0.000 codecs.py:503(readline)
57448 0.444 0.000 1.010 0.000 codecs.py:612(next)
1 0.000 0.000 0.000 0.000 codecs.py:651(__init__)
57448 0.381 0.000 1.390 0.000 codecs.py:681(next)
1 0.000 0.000 0.000 0.000 codecs.py:686(__iter__)
1 0.000 0.000 0.000 0.000 codecs.py:841(open)
1 0.372 0.372 1.762 1.762 test.py:9(line_by_line)
13316 0.011 0.000 0.023 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 {_codecs.lookup}
27385 0.027 0.000 0.027 0.000 {_codecs.utf_8_decode}
98895 0.011 0.000 0.011 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
13316 0.099 0.000 0.122 0.000 {method 'endswith' of 'unicode' objects}
27 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
14069 0.027 0.000 0.027 0.000 {method 'read' of 'file' objects}
13504 0.020 0.000 0.020 0.000 {method 'splitlines' of 'unicode' objects}
1 0.000 0.000 0.000 0.000 {open}

All at once:

         15 function calls in 0.023 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.023 0.023 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 codecs.py:322(__init__)
1 0.000 0.000 0.000 0.000 codecs.py:395(__init__)
1 0.000 0.000 0.003 0.003 codecs.py:424(read)
1 0.000 0.000 0.014 0.014 codecs.py:576(readlines)
1 0.000 0.000 0.000 0.000 codecs.py:651(__init__)
1 0.000 0.000 0.014 0.014 codecs.py:677(readlines)
1 0.000 0.000 0.000 0.000 codecs.py:841(open)
1 0.009 0.009 0.023 0.023 test.py:13(at_once)
1 0.000 0.000 0.000 0.000 {_codecs.lookup}
1 0.003 0.003 0.003 0.003 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.001 0.001 0.001 0.001 {method 'read' of 'file' objects}
1 0.010 0.010 0.010 0.010 {method 'splitlines' of 'unicode' objects}
1 0.000 0.000 0.000 0.000 {open}

As you can see from the results, reading the whole file in at once is much faster, but you run the risk of a MemoryError being thrown in the file is too large.

How can I read large text files line by line, without loading them into memory?

Use a for loop on a file object to read it line-by-line. Use with open(...) to let a context manager ensure that the file is closed after reading:

with open("log.txt") as infile:
for line in infile:
print(line)


Related Topics



Leave a reply



Submit