Split a Large Pandas Dataframe

Split a large pandas dataframe

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

Pandas - Slice large dataframe into chunks

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

Or use numpy array_split:

list_df = np.array_split(df, n)

You can access the chunks with:

list_df[0]
list_df[1]
etc...

Then you can assemble it back into a one dataframe using pd.concat.

By AcctName

list_df = []

for n,g in df.groupby('AcctName'):
    list_df.append(g)

Splitting a Large Dataframe into multiple df's with no more than 'x' number of rows in Python

If the order does not matter, you can use a slice with a stepsize:

import pandas as pd
import numpy as np
import math

data = np.random.rand(1_000_000, 2)
df = pd.DataFrame(data)

# how many rows we allow per dataframe
max_rows = 150_000

# how many subsets we need
stepsize = math.ceil(len(df) / max_rows)

# create a list of subsets
dfs = [df.iloc[offset::stepsize] for offset in range(stepsize)]

for small_df in dfs:
    print(len(small_df))

You can also use a generator to prevent holding the list of subsets in memory. Here's a variation which uses a generator and preserves order:

def small_dfs(df, max_rows):
    n = math.ceil(len(df) / max_rows)
    for ix in range(n):
        yield df.iloc[ix * max_rows : (ix+1) * max_rows, :]

for small_df in small_dfs(df, 150_000):
    print(len(small_df))

Split dataframe into 3 equally sized new dataframes - Pandas

Try using numpy.array_split:

import numpy as np
df1, df2, df3 = np.array_split(df_seen, 3)

To save each DataFrame to a separate file, you could do:

for i, df in enumerate(np.array_split(df_seen, 3)):
    df.to_csv(f"data{i+1}.csv", index=False)

Split large Dataframe into smaller equal dataframes

I don't know from your description if you are aware that np.array_split outputs n objects. If it's only a few objects you could manually assign them, for example:

df1, df2, df3 = np.array_split(df, 3)

This would assign every subarray to these variables in order.
Otherwise you could assign the series of subarrays to a single variable;

split_df = np.array_split(df, 3)
len(split_df)
# 3

then loop over this one variable and do your analysis per subarray. I would personally choose the latter.

for object in split_df:
    print(type(object))

This prints <class 'pandas.core.frame.DataFrame'> three times.

Split large pandas data frame into smaller one as per time series

Thanks for the hint @Alollz

group=[]
for k,g in df.groupby(df.time.eq(1).shift().fillna(0).cumsum()):
    group.append(g)

You can call the group you need with

group[0] , group[1] , group[2]....

Details

Starting dataframe

row     var1    var2    time
row1    x1       y1     0
row2    x2       y2     0
row3    x3       y3     0
row4    x4       y4     0
row5    x5       y5     0
row6    x6       y6     0
row7    x7       y7     0
row8    x8       y8     1
row9    x9       y9     0
row10   x10     y10     0
row11   x11     y11     0
row12   x12     y12     0
row13   x13     y13     0
row14   x14     y14     1
row15   x15     y15     0
row16   x16     y16     0
row17   x17     y17     0
row18   x18     y18     0

with df.time.eq(1).shift().fillna(0).cumsum(), we are essentially creating a column for us to groupby. Shown here with column s

row     var1    var2    time    s
row1    x1       y1     0       0
row2    x2       y2     0       0
row3    x3       y3     0       0
row4    x4       y4     0       0
row5    x5       y5     0       0
row6    x6       y6     0       0
row7    x7       y7     0       0
row8    x8       y8     1       0
row9    x9       y9     0       1
row10   x10     y10     0       1
row11   x11     y11     0       1
row12   x12     y12     0       1
row13   x13     y13     0       1
row14   x14     y14     1       1
row15   x15     y15     0       2
row16   x16     y16     0       2
row17   x17     y17     0       2
row18   x18     y18     0       2

then we do group-by on column s essentially (even though we never create column s). Since each group is essentially a dataframe, you have separate dataframes.

if we use df.time.eq(1).fillna(0).cumsum(), we have the row which changes to 1 in the next dataframe. The data for grouping shown in column s2

    row     var1    var2 time   s   s2
0   row1    x1       y1     0   0   0
1   row2    x2       y2     0   0   0
2   row3    x3       y3     0   0   0
3   row4    x4       y4     0   0   0
4   row5    x5       y5     0   0   0
5   row6    x6       y6     0   0   0
6   row7    x7       y7     0   0   0
7   row8    x8       y8     1   0   1
8   row9    x9       y9     0   1   1
9   row10   x10     y10     0   1   1
10  row11   x11     y11     0   1   1
11  row12   x12     y12     0   1   1
12  row13   x13     y13     0   1   1
13  row14   x14     y14     1   1   2
14  row15   x15     y15     0   2   2
15  row16   x16     y16     0   2   2
16  row17   x17     y17     0   2   2
17  row18   x18     y18     0   2   2

Split a Large Pandas Dataframe