Split Dataframe into Relatively Even Chunks According to Length

Split dataframe into relatively even chunks according to length

You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby splitting the dataframe into equally sized chunks:

n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)

Pandas - Slice large dataframe into chunks

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

Or use numpy array_split, see this comment for discrepancies:

list_df = np.array_split(df, n)

You can access the chunks with:

list_df[0]
list_df[1]
etc...

Then you can assemble it back into a one dataframe using pd.concat.

By AcctName

list_df = []

for n,g in df.groupby('AcctName'):
list_df.append(g)

Split a pandas dataframe every 5 rows

Use floor division on the index to create your groups, then we can use DataFrame.groupby to create different dataframes:

grps = df.groupby(df.index // 5)

for _, dfg in grps:
print(dfg)

COLUMN_Y
0 value1
1 value2
2 value3
3 value4
4 value5

COLUMN_Y
5 value6
6 value7
7 value8
8 value9
9 value10

COLUMN_Y
10 value11
11 value12
12 value13
13 value14
14 value15

COLUMN_Y
15 value16

How can I evenly split up a pandas.DataFrame into n-groups?

Use np.array_split to break it up into a list of "evenly" sized DataFrames. You can shuffle too if you sample the full DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(24).reshape(-1,2), columns=['A', 'B'])
N = 5

np.array_split(df, N)
#np.array_split(df.sample(frac=1), N) # Shuffle and split


[   A  B
0 0 1
1 2 3
2 4 5,
A B
3 6 7
4 8 9
5 10 11,
A B
6 12 13
7 14 15,
A B
8 16 17
9 18 19,
A B
10 20 21
11 22 23]

Split one dataframe to multiple with maximum n rows for each in Python

One way using pandas.Dataframe.groupby:

n = 10
[d for _, d in df.groupby(df.index//n)]

Output:

[          a         b         c
0 0.897134 -0.356157 -0.396212
1 -2.357861 2.066570 -0.512687
2 -0.080665 0.719328 0.604294
3 -0.639392 -0.912989 -1.029892
4 -0.550007 -0.633733 -0.748733
5 -0.712962 -1.612912 -0.248270
6 -0.571474 1.310807 -0.271137
7 -0.228068 0.675771 0.433016
8 0.005606 -0.154633 0.985484
9 0.691329 -0.837302 -0.607225,
a b c
10 -0.011909 -0.304162 0.422001
11 0.127570 0.956831 1.837523
12 -1.074771 0.379723 -1.889117
13 -1.449475 -0.799574 -0.878192
14 -1.029757 0.551023 2.519929
15 -1.001400 0.838614 -1.006977
16 0.677216 -0.403859 0.451338
17 0.221596 -0.323259 0.324158
18 -0.241935 -2.251687 -0.088494
19 -0.995426 0.665569 -2.228848,
a b c
20 1.714709 -0.353391 0.671539
21 0.155050 1.136433 -0.005721
22 -0.502412 -0.610901 1.520165
23 -0.853906 0.648321 1.124464
24 1.149151 -0.187300 -0.412946
25 0.329229 -1.690569 -2.746895]

Split dataframe to sub dataframes and fill content according to the relevant dataframe?

You can create a dict of .groupby() objects of x grouped by id, as follows:

x_df_dict = {a: b for a, b in df.groupby('id')['x']}

Then, you can access the sub-dataframes (more accurately sub-Series) of x by id, as follows:

print(x_df_dict[1])

0 A
1 B
2 C
3 D
4 E
Name: x, dtype: object
print(x_df_dict[2])

5 A
6 D
7 E
8 F
9 Z
Name: x, dtype: object



Related Topics



Leave a reply



Submit