Pandas - Slice Large Dataframe into Chunks

Pandas - Slice large dataframe into chunks

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

Or use numpy array_split, see this comment for discrepancies:

list_df = np.array_split(df, n)

You can access the chunks with:

list_df[0]
list_df[1]
etc...

Then you can assemble it back into a one dataframe using pd.concat.

By AcctName

list_df = []

for n,g in df.groupby('AcctName'):
    list_df.append(g)

Split a large pandas dataframe

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

Python divide dataframe into chunks

The range function is enough here:

for start in range(0, len(df), 2500):
    process_operation(df[start:start+2500])

Splitting a Large Dataframe into multiple df's with no more than 'x' number of rows in Python

If the order does not matter, you can use a slice with a stepsize:

import pandas as pd
import numpy as np
import math

data = np.random.rand(1_000_000, 2)
df = pd.DataFrame(data)

# how many rows we allow per dataframe
max_rows = 150_000

# how many subsets we need
stepsize = math.ceil(len(df) / max_rows)

# create a list of subsets
dfs = [df.iloc[offset::stepsize] for offset in range(stepsize)]

for small_df in dfs:
    print(len(small_df))

You can also use a generator to prevent holding the list of subsets in memory. Here's a variation which uses a generator and preserves order:

def small_dfs(df, max_rows):
    n = math.ceil(len(df) / max_rows)
    for ix in range(n):
        yield df.iloc[ix * max_rows : (ix+1) * max_rows, :]

for small_df in small_dfs(df, 150_000):
    print(len(small_df))

Split dataframe into relatively even chunks according to length

You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby splitting the dataframe into equally sized chunks:

n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
    print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)

Split pandas dataframe into chunks of N

You can try this:

def rolling(df, window, step):
    count = 0
    df_length = len(df)
    while count < (df_length -window):
        yield count, df[count:window+count]
        count += step

Usage:

for offset, window in rolling(df, 100, 100):
    # |     |                      |     |
    # |     The current chunk.     |     How many rows to step at a time.
    # The current offset index.    How many rows in each chunk.
    # your code here
    pass

There is also this simpler idea:

def chunk(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

Usage:

for df_chunk in chunk(df, 100):
    #                     |
    #                     The chunk size
    # your code here

BTW. All this can be found on SO, with a search.

How can I split a pandas DataFrame into multiple dataframes?

You can use, np.array_split to split the dataframe:

import numpy as np

dfs = np.array_split(df, 161) # split the dataframe into 161 separate tables

Edit (To assign a new col based on sequential number of df in dfs):

dfs = [df.assign(new_col=i) for i, df in enumerate(dfs, 1)]

pandas chunksize how to slice chunk and directly jump into target chunk

IIUC use skiprows parameter for omit first 499 chunks, for not remove header is use np.arange:

n = 100000
for i, df_ia in enumerate(pd.read_csv("/path/to/file/file.TXT", 
                          chunksize=n, 
                          skiprows = np.arange(1, 500 * n + 1),
                          iterator=True, 
                          low_memory=False)):
    if i == 0:
        #do logic

Pandas - Slice Large Dataframe into Chunks