Why Does Concatenation of Dataframes Get Exponentially Slower

Why does concatenation of DataFrames get exponentially slower?

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 2 | 1 | 3 |
| ... | | | |
| N-1 | N-1 | 1 | N |

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to
complete the loop.

Now consider

super_x = []
for i, df_chunk in enumerate(df_list):
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1) operation and does not require copying. Now
there is a single call to pd.concat after the loop is done. This call to
pd.concat requires N copies to be made, since super_x contains N
DataFrames of size 1. So when constructed this way, super_x requires O(N)
copies.

Concatenating dataframes in a loop very slow

You need to append dataframes to your list and not the data as list.

Try:

interimdf.append(pd.DataFrame(listdf))

Then outside of your loop,

pd.concat(interimdf) 

Merging 1300 data frames into a single frame becomes really slow

The reason your loop slows down is because of at each .append(), the dataframe has to create a copy in order to allocate more memory, as described here.

If your memory can fit it all, you could first fill a list of fixed size(1300) with all data frames, and then use df = pd.concat(list_of_dataframes), which would probably avoid the issue you are having right now. Your code could be adjusted as such:

import pandas as pd 
lst = [None for _ in range(1300)] # Creates empty list

for i, filename in enumerate(os.listdir(filepath)):
file_path = os.path.join(filepath, filename)
df = pd.read_csv(file_path,index_col=0)
df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
lst[i] = df


frame = pd.concat(lst)

How to concat thousands of pandas dataframes generated by a for loop efficiently?

You can create list of DataFrames and then use concat only once:

dfs = []

for i in range(1,1000): # demo only
global combined
generate_df() # df is created here
dfs.append(df)

combined = pd.concat(dfs)

Why does concatenation of DataFrames get exponentially slower?

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 2 | 1 | 3 |
| ... | | | |
| N-1 | N-1 | 1 | N |

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to
complete the loop.

Now consider

super_x = []
for i, df_chunk in enumerate(df_list):
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1) operation and does not require copying. Now
there is a single call to pd.concat after the loop is done. This call to
pd.concat requires N copies to be made, since super_x contains N
DataFrames of size 1. So when constructed this way, super_x requires O(N)
copies.

Merging 1300 data frames into a single frame column-wise becomes really slow

The reason that your code is slowing down is the same issue that is in your linked question: quadratic copy. In each loop, you are copying the entire existing dataframe plus some new data. The solution is to store all of the individual dataframes in a list, and then concatenate once all files have been read.

frame = []

for filename in os.listdir(filepath): #filepath has the rest of the files
file_path = os.path.join(filepath, filename)
df = (pd.read_csv(file_path, index_col=0)
.groupby(['Date']).first()
.add_prefix(f"{filename}-"))
frame.append(df)

frame = pd.concat(frame, axis=1)

To illustrate the concept with some sample data:

df1 = pd.DataFrame(
{'Date': ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03'],
'A': [4, 5, 55, 6],
'B': [7, 8, 85, 9]}
).set_index('Date')
df2 = pd.DataFrame(
{'Date': ['2020-01-02', '2020-01-03', '2020-01-04'],
'A': [40, 50, 60],
'C': [70, 80, 90]}
).set_index('Date')

frame = []

for n, df in enumerate([df1, df2]): #filepath has the rest of the files
df = (df.groupby(level=['Date']).first()
.add_prefix(f"{n}-"))
frame.append(df)

frame = pd.concat(frame, axis=1)

>>> frame
0-A 0-B 1-A 1-C
2020-01-01 4.0 7.0 NaN NaN
2020-01-02 5.0 8.0 40.0 70.0
2020-01-03 6.0 9.0 50.0 80.0
2020-01-04 NaN NaN 60.0 90.0


fast concatenation of large amount of homogeneous dataframes

Here is an example to use pd.HDFStore to append many tables together.

import pandas as pd
import numpy as np
from time import time

# your tables
# =========================================
columns = ['col{}'.format(i) for i in range(100)]
data = np.random.randn(100000).reshape(1000, 100)
df = pd.DataFrame(data, columns=columns)

# many tables, generator
def get_generator(df, n=1000):
for x in range(n):
yield df

table_reader = get_generator(df, n=1000)


# processing
# =========================================
# create a hdf5 storage, compression level 5, (1-9, 9 is extreme)
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5', complevel=5, complib='blosc')

Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
Empty


t0 = time()

# loop over your df
counter = 1
for frame in table_reader:
print('Appending Table {}'.format(counter))
h5_file.append('big_table', frame, complevel=5, complib='blosc')
counter += 1

t1 = time()

# Appending Table 1
# Appending Table 2
# ...
# Appending Table 999
# Appending Table 1000


print(t1-t0)

Out[3]: 41.6630880833

# check our hdf5_file
h5_file

Out[7]:
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
/big_table frame_table (typ->appendable,nrows->1000000,ncols->100,indexers->[index])

# close hdf5
h5_file.close()

# very fast to retrieve your data in any future IPython session

h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5')

%time my_big_table = h5_file['big_table']

CPU times: user 217 ms, sys: 1.11 s, total: 1.33 s
Wall time: 1.89 s

Speed up concatenation of excel files with Pandas

I don't see anything wrong with your pandas code; a 300mb excel file might just be a lot for Pandas to handle! Here are some approaches I'd take:

Tactic 1. Investigate

If I were you, my next step in debugging this would be to throw some print(datetime.now()) statements into the loop, to see whether it's the reading, the concatenating, or the .to_excel that's taking time. That way you may be able to narrow down the problem. Also take a look at your memory usage using appropriate tools for whatever OS you're in.

Tactic 2. Try a different tool

Pandas is optimized for scientific computing and it probably spends quite a bit of time organizing the data for querying and such. ETL isn't it's primary purpose. If you only need to concatenate a few sheets, (as much as it pains me to suggest doing something manually!) manual work in Excel itself will likely be the quickest way to go - highly-paid engineers at Microsoft have been tasked with optimizing that. If you need a programmatic approach, it could be worth trying out petl or one of the tools discussed here that may take a simpler/more efficient approach than pandas.

Some example petl code that might do the trick:

import petl
petl.cat(*
petl.io.fromxlsx(file)
for file in ['your.xlsx', 'excel.xlsx', 'files.xlsx']
).progress().toxlsx()


Related Topics



Leave a reply



Submit