Pandas To_Csv() Slow Saving Large Dataframe

Optimizing time in a large loop Pandas to_csv

It would be helpful to give a sample of your data to test the answer beforehand. Like this I just hope it works without errors ;)

You should be able to use groupby with a custom function that gets applied to each group like this:

def custom_to_csv(temp_df, output_folder):
date, tick = temp_df.name
# Saving files
if output_folder in [None, ""]:
temp_df.to_csv("%s_%s.txt" % (date, tick))
else:
temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))

df.groupby(['Date', '#RIC']).apply(custom_to_csv, (output_folder))

EDIT: Changed
df to temp_df and (output_folder,) to (output_folder)

Performance: Python pandas DataFrame.to_csv append becomes gradually slower

In these kind of situation you should profile your code (to see which function calls are taking the most time), that way you can check empirically that it is indeed slow in the read_csv rather than elsewhere...

From looking at your code: Firstly there's a lot of copying here and a lot of looping (not enough vectorization)... everytime you see looping look for a way to remove it. Secondly, when you use things like zfill, I wonder if you want to_fwf (fixed width format) rather than to_csv?

Some sanity testing: Are some files are significantly bigger than others (which could lead to you hitting swap)? Are you sure the largest files are only 1200 rows?? Have your checked this? e.g. using wc -l.

IMO I think it unlikely to be garbage collection.. (as was suggested in the other answer).


Here are a few improvements on your code, which should improve the runtime.

Columns are fixed I would extract the column calculations and vectorize the real, child and other normalizations. Use apply rather than iterating (for zfill).

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols

for count, picklefile in enumerate(pickleFiles):
if count % 100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2

#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)

df['ConcatIndex'] = 100000*df.FileID + df.ID
# use apply here rather than iterating
df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
df.index = df.ConcatIndex

#DataFrame Normalization:
dftemp = df.very_deep_copy() # don't *think* you need this

# drop all excludes
dftemp.drop(columns_to_drop), axis=1, inplace=True)

# normalize real cols
m = dftemp[real_cols_suffix].max()
m.index = real_cols
dftemp[real_cols] = dftemp[real_cols] / m

# normalize child cols
m = dftemp[child_cols_desc].max()
m.index = child_cols
dftemp[child_cols] = dftemp[child_cols] / m

# normalize remaining
remaining = list(remaining - child)
dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

# if this case is important then discard the rows of m with .max() is 0
#if max != 0:
# dftemp[string] = dftemp[string]/max

# this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
# dftemp.drop('ConcatIndex', axis=1, inplace=True)

#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

As a point of style I would probably choose to wrap each of these parts into functions, this will also mean more things can be gc'd if that really was the issue...


Another options which would be faster is to use pytables (HDF5Store) if you didn't need to resulting output to be csv (but I expect you do)...

The best thing to do by far is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).


Ah ha, I'd missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let's keep the file open:

# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')

...
for picklefile in ...

if picklefile == '0000.p':
dftemp.to_csv(f)
else:
dftemp.to_csv(f, mode='a', header=False)
...

f.close()

Each time the file is opened before it can append to, it needs to seek to the end before writing, it could be that this is the expensive (I don't see why this should be that bad, but keeping it open removes the need to do this).



Related Topics



Leave a reply



Submit