Faster Way to Find Large Files with Python

Fastest way to process a large file?

It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.

And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.

Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.


Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)

So, for example:

bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)

Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:

with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.

Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.

Fastest way to process large files in Python

Will starting multiple Python processes to run the script take advantage of the other cores?

Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.

That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.

I've taken a look at the multiprocessing library but not sure how I can utilize it.

Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then

import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))

will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.

Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.

Fastest way to grep big files

You may also consider using memory mapping (mmap module) like this

def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
with open(filename, 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
match = regex.search(txt)
if match:
print match.group()

also some side notes:

  • in the case of using a shell command, ag may be in some cases orders of magnitude faster than grep (although with only 200 lines of greppable text the difference probably vanishes compared to the overhead of starting a shell)
  • just compiling your regex in the beginning of the function may make some difference

What's the fastest way to recursively search for files in python?

Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).

Python 3.6:

import os
import glob

def walk():
pys = []
for p, d, f in os.walk('.'):
for file in f:
if file.endswith('.py'):
pys.append(file)
return pys

def iglob():
pys = []
for file in glob.iglob('**/*', recursive=True):
if file.endswith('.py'):
pys.append(file)
return pys

def iglob2():
pys = []
for file in glob.iglob('**/*.py', recursive=True):
pys.append(file)
return pys

# I also tried pathlib.Path.glob but it was slow and error prone, sadly

%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using GNU find (4.6.0) on cygwin (4.6.0-1)

Edit: The below is on WINDOWS, on LINUX I found find to be about 25% faster

$ time find . -name '*.py' > /dev/null

real 0m8.827s
user 0m1.482s
sys 0m7.284s

Seems like os.walk is as good as you can get on windows.

Fastest Way to Read Large Binary Files (More than 500 MB)?

You've got a few different options. Your main problem is that, with the small size of your chunks (6 bytes), there's a lot of overhead spent in fetching the chunk and garbage collecting.

There's two main ways to get around that:

  1. Load the entire file into memory, THEN separate it into chunks. This is the fastest method, but the larger your file to more likely it is you will start running into MemoryErrors.

  2. Load one chunk at a time into memory, process it, then move on to the next chunk. This is no faster overall, but saves time up front since you don't need to wait for the entire file to be loaded to start processing.

  3. Experiment with combinations of 1. and 2. (buffering the file in large chunks and separating it into smaller chunks, loading the file in multiples of your chunk size, etc). This is left as an exercise for the viewer, as it will take a large amount of experimentation to reach code that will work quickly and correctly.

Some code, with time comparisons:

import timeit

def read_original(filename):
with open(filename, "rb") as infile:
data_arr = []
while True:
data = infile.read(6)
if not data:
break
data_arr.append(data)
return data_arr

# the bigger the file, the more likely this is to cause python to crash
def read_better(filename):
with open(filename, "rb") as infile:
# read everything into memory at once
data = infile.read()
# separate string into 6-byte chunks
data_arr = [data[i:i+6] for i in range(0, len(data), 6)]
return data_arr

# no faster than the original, but allows you to work on each piece without loading the whole into memory
def read_iter(filename):
with open(filename, "rb") as infile:
data = infile.read(6)
while data:
yield data
data = infile.read(6)

def main():
# 93.8688215 s
tm = timeit.timeit(stmt="read_original('test/oraociei12.dll')", setup="from __main__ import read_original", number=10)
print(tm)
# 85.69337399999999 s
tm = timeit.timeit(stmt="read_better('test/oraociei12.dll')", setup="from __main__ import read_better", number=10)
print(tm)
# 103.0508528 s
tm = timeit.timeit(stmt="[x for x in read_iter('test/oraociei12.dll')]", setup="from __main__ import read_iter", number=10)
print(tm)

if __name__ == '__main__':
main()

Fastest way to read a large binary file with Python

You can use array.array('d')'s fromfile method:

def ReadBinary():
fileName = r'C:\File_Data\LargeDataFile.bin'

fileContent = array.array('d')
with open(fileName, mode='rb') as file:
fileContent.fromfile(file)
return fileContent

That's a C-level read as raw machine values. mmap.mmap could also work by creating a memoryview of the mmap object and casting it.



Related Topics



Leave a reply



Submit