Lazy Method For Reading Big File in Python

Lazy Method for Reading Big File in Python?

To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open('really_big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

Another option would be to use iter and a helper function:

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)

If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):
    process_data(line)

using a python generator to process large text files

Instead of playing with offsets in the file, try to build and yield lists of 10000 elements from a loop:

def read_large_file(file_handler, block_size=10000):
    block = []
    for line in file_handler:
        block.append(line)
        if len(block) == block_size:
            yield block
            block = []

    # don't forget to yield the last block
    if block:
        yield block

with open(path) as file_handler:
    for block in read_large_file(file_handler):
        print(block)

joining big files in Python

You are reading from infile in two different places: inside read_in_chunks, and directly when you call outfile_bl. This causes you to skip writing the data just read into the variable piece, so you only copy roughly half the file.

You've already read data into piece; just write that to your file.

with open(path + "/" + str(sys.argv[2]) + "_BL.265", "wb") as outfile_bl:
    for fname in dirs:
        with open(path+"/"+fname, 'rb') as infile:
            for piece in read_in_chunks(infile):
                outfile_bl.write(piece)

As an aside, you don't really need to define read_in_chunks, or at least its definition can be simplified greatly by using iter:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""

    yield from iter(lambda: file_object.read(chunk_size), '')

    # Or
    # from functools import partial
    # yield from iter(partial(file_object.read, chunk_size), '')

How to read file chunk by chunk?

You can use the two-argument form of iter, which takes a callable and a sentinel value. It will call the callable repeatedly until it receives the sentinel value.

>>> from functools import partial
>>> with open('test.txt') as f:
...     for chunk in iter(partial(f.read, 4), ''):
...         print(repr(chunk), len(chunk))
...
'"hi ' 4
'ther' 4
'e 1,' 4
' 3, ' 4
'4, 5' 4
'"\n\n' 3

How to efficiently read the last line of very big gzipped log file?

If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).

If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.

Lazy Method For Reading Big File in Python