Pandas - Slice Large Dataframe into Chunks

Pandas - Slice large dataframe into chunks

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

Or use numpy array_split, see this comment for discrepancies:

list_df = np.array_split(df, n)

You can access the chunks with:

list_df[0]
list_df[1]
etc...

Then you can assemble it back into a one dataframe using pd.concat.

By AcctName

list_df = []

for n,g in df.groupby('AcctName'):
list_df.append(g)

Split a large pandas dataframe

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})

In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]

Python divide dataframe into chunks

The range function is enough here:

for start in range(0, len(df), 2500):
process_operation(df[start:start+2500])

Splitting a Large Dataframe into multiple df's with no more than 'x' number of rows in Python

If the order does not matter, you can use a slice with a stepsize:

import pandas as pd
import numpy as np
import math

data = np.random.rand(1_000_000, 2)
df = pd.DataFrame(data)

# how many rows we allow per dataframe
max_rows = 150_000

# how many subsets we need
stepsize = math.ceil(len(df) / max_rows)

# create a list of subsets
dfs = [df.iloc[offset::stepsize] for offset in range(stepsize)]

for small_df in dfs:
print(len(small_df))

You can also use a generator to prevent holding the list of subsets in memory. Here's a variation which uses a generator and preserves order:

def small_dfs(df, max_rows):
n = math.ceil(len(df) / max_rows)
for ix in range(n):
yield df.iloc[ix * max_rows : (ix+1) * max_rows, :]

for small_df in small_dfs(df, 150_000):
print(len(small_df))

Split dataframe into relatively even chunks according to length

You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby splitting the dataframe into equally sized chunks:

n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)

Split pandas dataframe into chunks of N

You can try this:

def rolling(df, window, step):
count = 0
df_length = len(df)
while count < (df_length -window):
yield count, df[count:window+count]
count += step

Usage:

for offset, window in rolling(df, 100, 100):
# | | | |
# | The current chunk. | How many rows to step at a time.
# The current offset index. How many rows in each chunk.
# your code here
pass

There is also this simpler idea:

def chunk(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))

Usage:

for df_chunk in chunk(df, 100):
# |
# The chunk size
# your code here

BTW. All this can be found on SO, with a search.

How can I split a pandas DataFrame into multiple dataframes?

You can use, np.array_split to split the dataframe:

import numpy as np

dfs = np.array_split(df, 161) # split the dataframe into 161 separate tables

Edit (To assign a new col based on sequential number of df in dfs):

dfs = [df.assign(new_col=i) for i, df in enumerate(dfs, 1)]

pandas chunksize how to slice chunk and directly jump into target chunk

IIUC use skiprows parameter for omit first 499 chunks, for not remove header is use np.arange:

n = 100000
for i, df_ia in enumerate(pd.read_csv("/path/to/file/file.TXT",
chunksize=n,
skiprows = np.arange(1, 500 * n + 1),
iterator=True,
low_memory=False)):
if i == 0:
#do logic


Related Topics



Leave a reply



Submit