Lazy Method for Reading Big File in Python?
To write a lazy function, just use yield
:
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('really_big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
Another option would be to use iter
and a helper function:
f = open('really_big_file.dat')
def read1k():
return f.read(1024)
for piece in iter(read1k, ''):
process_data(piece)
If the file is line-based, the file object is already a lazy generator of lines:
for line in open('really_big_file.dat'):
process_data(line)
using a python generator to process large text files
Instead of playing with offsets in the file, try to build and yield lists of 10000 elements from a loop:
def read_large_file(file_handler, block_size=10000):
block = []
for line in file_handler:
block.append(line)
if len(block) == block_size:
yield block
block = []
# don't forget to yield the last block
if block:
yield block
with open(path) as file_handler:
for block in read_large_file(file_handler):
print(block)
joining big files in Python
You are reading from infile
in two different places: inside read_in_chunks
, and directly when you call outfile_bl
. This causes you to skip writing the data just read into the variable piece
, so you only copy roughly half the file.
You've already read data into piece
; just write that to your file.
with open(path + "/" + str(sys.argv[2]) + "_BL.265", "wb") as outfile_bl:
for fname in dirs:
with open(path+"/"+fname, 'rb') as infile:
for piece in read_in_chunks(infile):
outfile_bl.write(piece)
As an aside, you don't really need to define read_in_chunks
, or at least its definition can be simplified greatly by using iter
:
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
yield from iter(lambda: file_object.read(chunk_size), '')
# Or
# from functools import partial
# yield from iter(partial(file_object.read, chunk_size), '')
How to read file chunk by chunk?
You can use the two-argument form of iter
, which takes a callable and a sentinel value. It will call the callable repeatedly until it receives the sentinel value.
>>> from functools import partial
>>> with open('test.txt') as f:
... for chunk in iter(partial(f.read, 4), ''):
... print(repr(chunk), len(chunk))
...
'"hi ' 4
'ther' 4
'e 1,' 4
' 3, ' 4
'4, 5' 4
'"\n\n' 3
How to efficiently read the last line of very big gzipped log file?
If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).
If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.
Related Topics
How to Pandas Group-By to Get Sum
What Is the Meaning of Single and Double Underscore Before an Object Name
How Does Python'S Super() Work With Multiple Inheritance
Importing Files from Different Folder
Running Bash Commands in Python
How to Read a Single Character from the User
Adding a Scrollbar to a Group of Widgets in Tkinter
Why Does This Unboundlocalerror Occur (Closure)
Switch Between Two Frames in Tkinter
Why Is the Order in Dictionaries and Sets Arbitrary
What Is Truthy and Falsy? How Is It Different from True and False
Remap Values in Pandas Column With a Dict, Preserve Nans
What Does Ruby Have That Python Doesn'T, and Vice Versa
How to Detect Collision in Pygame
Standard_Init_Linux.Go:178: Exec User Process Caused "Exec Format Error"