How to Release Memory Used by a Pandas Dataframe

Delete and release memory of a single pandas dataframe

From the original link that you included, you have to include variable in the list, delete the variable and then delete the list. If you just add to the list, it won't delete the original dataframe, when you delete the list.

import pandas
import psutil 
import gc
psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 68.44267845153809

df = pd.read_csv('pythonSRC/bigFile.txt',sep='|')
len(df)
>> 20082056

psutil.virtual_memory().available * 100 / psutil.virtual_memory().total

>> 56.380510330200195

lst = [df]
del lst

psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 56.22601509094238

lst = [df]
del df
del lst

psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 76.77617073059082

gc.collect()

>> 0

I tried also just deleting the dataframe and using gc.collect() with the same result!

del df
gc.collect()
psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 76.59363746643066

However, the execution time of adding the dataframe to the list and deleting the list and the variable is a bit faster then calling gc.collect(). I used time.time() to measure the difference and gc.collect() was almost a full second slower!

How to clear Dataframe memory in pandas?

I had the same problem as you using https://stackoverflow.com/a/49144260/2799214
I found a solution using gc.collect() by splitting my code in different methods within a class. For example:

Class A:
    def __init__(self):
        # your code

    def first_part_of_my_code(self):
        # your code
        # I want to clear my dataframe
        del my_dataframe
        gc.collect()
        my_dataframe = pd.DataFrame() # not sure whether this line really helps
        return my_new_light_dataframe

    def second_part_of_my_code(self):
        # my code
        # same principle

So When the program call the methods, The garbage collector clear the memory once the program leaves the method.

How to destroy Python objects and free up memory

Now, it could be that something in the 50,000th is very large, and that's causing the OOM, so to test this I'd first try:

file_list_chunks = list(divide_chunks(file_list_1,20000))[30000:]

If it fails at 10,000 this will confirm whether 20k is too big a chunksize, or if it fails at 50,000 again, there is an issue with the code...

Okay, onto the code...

Firstly, you don't need the explicit list constructor, it's much better in python to iterate rather than generate the entire the list into memory.

file_list_chunks = list(divide_chunks(file_list_1,20000))
# becomes
file_list_chunks = divide_chunks(file_list_1,20000)

I think you might be misusing ThreadPool here:

Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.

This reads like close might have some thinks still running, although I guess this is safe it feels a little un-pythonic, it's better to use the context manager for ThreadPool:

with ThreadPool(64) as pool: 
    results = pool.map(get_image_features,f)
    # etc.

The explicit dels in python aren't actually guaranteed to free memory.

You should collect after the join/after the with:

with ThreadPool(..):
    ...
    pool.join()
gc.collect()

You could also try chunk this into smaller pieces e.g. 10,000 or even smaller!

Hammer 1

One thing, I would consider doing here, instead of using pandas DataFrames and large lists is to use a SQL database, you can do this locally with sqlite3:

import sqlite3
conn = sqlite3.connect(':memory:', check_same_thread=False)  # or, use a file e.g. 'image-features.db'

and use context manager:

with conn:
    conn.execute('''CREATE TABLE images
                    (filename text, features text)''')

with conn:
    # Insert a row of data
    conn.execute("INSERT INTO images VALUES ('my-image.png','feature1,feature2')")

That way, we won't have to handle the large list objects or DataFrame.

You can pass the connection to each of the threads... you might have to something a little weird like:

results = pool.map(get_image_features, zip(itertools.repeat(conn), f))

Then, after the calculation is complete you can select all from the database, into which ever format you like. E.g. using read_sql.

Hammer 2

Use a subprocess here, rather than running this in the same instance of python "shell out" to another.

Since you can pass start and end to python as sys.args, you can slice these:

# main.py
# a for loop to iterate over this
subprocess.check_call(["python", "chunk.py", "0", "20000"])

# chunk.py a b
for count,f in enumerate(file_list_chunks):
    if count < int(sys.argv[1]) or count > int(sys.argv[2]):
         pass
    # do stuff

That way, the subprocess will properly clean up python (there's no way there'll be memory leaks, since the process will be terminated).

My bet is that Hammer 1 is the way to go, it feels like you're gluing up a lot of data, and reading it into python lists unnecessarily, and using sqlite3 (or some other database) completely avoids that.