Split dataframe into relatively even chunks according to length
You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby
splitting the dataframe into equally sized chunks:
n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)
Pandas - Slice large dataframe into chunks
You can use list comprehension to split your dataframe into smaller dataframes contained in a list.
n = 200000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
Or use numpy array_split
, see this comment for discrepancies:list_df = np.array_split(df, n)
You can access the chunks with:list_df[0]
list_df[1]
etc...
Then you can assemble it back into a one dataframe using pd.concat.By AcctName
list_df = []
for n,g in df.groupby('AcctName'):
list_df.append(g)
Split a pandas dataframe every 5 rows
Use floor division on the index to create your groups, then we can use DataFrame.groupby
to create different dataframes:
grps = df.groupby(df.index // 5)
for _, dfg in grps:
print(dfg)
COLUMN_Y
0 value1
1 value2
2 value3
3 value4
4 value5
COLUMN_Y
5 value6
6 value7
7 value8
8 value9
9 value10
COLUMN_Y
10 value11
11 value12
12 value13
13 value14
14 value15
COLUMN_Y
15 value16
How can I evenly split up a pandas.DataFrame into n-groups?
Use np.array_split
to break it up into a list of "evenly" sized DataFrames. You can shuffle too if you sample the full DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(24).reshape(-1,2), columns=['A', 'B'])
N = 5
np.array_split(df, N)
#np.array_split(df.sample(frac=1), N) # Shuffle and split
[ A B
0 0 1
1 2 3
2 4 5,
A B
3 6 7
4 8 9
5 10 11,
A B
6 12 13
7 14 15,
A B
8 16 17
9 18 19,
A B
10 20 21
11 22 23]
Split one dataframe to multiple with maximum n rows for each in Python
One way using pandas.Dataframe.groupby
:
n = 10
[d for _, d in df.groupby(df.index//n)]
Output:[ a b c
0 0.897134 -0.356157 -0.396212
1 -2.357861 2.066570 -0.512687
2 -0.080665 0.719328 0.604294
3 -0.639392 -0.912989 -1.029892
4 -0.550007 -0.633733 -0.748733
5 -0.712962 -1.612912 -0.248270
6 -0.571474 1.310807 -0.271137
7 -0.228068 0.675771 0.433016
8 0.005606 -0.154633 0.985484
9 0.691329 -0.837302 -0.607225,
a b c
10 -0.011909 -0.304162 0.422001
11 0.127570 0.956831 1.837523
12 -1.074771 0.379723 -1.889117
13 -1.449475 -0.799574 -0.878192
14 -1.029757 0.551023 2.519929
15 -1.001400 0.838614 -1.006977
16 0.677216 -0.403859 0.451338
17 0.221596 -0.323259 0.324158
18 -0.241935 -2.251687 -0.088494
19 -0.995426 0.665569 -2.228848,
a b c
20 1.714709 -0.353391 0.671539
21 0.155050 1.136433 -0.005721
22 -0.502412 -0.610901 1.520165
23 -0.853906 0.648321 1.124464
24 1.149151 -0.187300 -0.412946
25 0.329229 -1.690569 -2.746895]
Split dataframe to sub dataframes and fill content according to the relevant dataframe?
You can create a dict of .groupby()
objects of x
grouped by id
, as follows:
x_df_dict = {a: b for a, b in df.groupby('id')['x']}
Then, you can access the sub-dataframes (more accurately sub-Series) of x
by id
, as follows:print(x_df_dict[1])
0 A
1 B
2 C
3 D
4 E
Name: x, dtype: object
print(x_df_dict[2])
5 A
6 D
7 E
8 F
9 Z
Name: x, dtype: object
Related Topics
Running Jupyter with Multiple Python and Ipython Paths
Listing Contents of a Bucket with Boto3
How to Check If Stdin Has Some Data
Loading Initial Data with Django 1.7 and Data Migrations
How to Fix Selenium Webdriverexception: the Browser Appears to Have Exited Before We Could Connect
Change to Sudo User Within a Python Script
Is There a Numpy Builtin to Reject Outliers from a List
Scale Everything on Pygame Display Surface
How to Set a Proxy for Phantomjs/Ghostdriver in Python Webdriver
Return List of Items in List Greater Than Some Value
Python [Errno 98] Address Already in Use
Importing Class from Another File
Adding a Y-Axis Label to Secondary Y-Axis in Matplotlib
How to Exit from Python Without Traceback
Saving Plots (Axessubplot) Generated from Python Pandas with Matplotlib's Savefig