Pandas - Slice large dataframe into chunks
You can use list comprehension to split your dataframe into smaller dataframes contained in a list.
n = 200000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
Or use numpy array_split
, see this comment for discrepancies:
list_df = np.array_split(df, n)
You can access the chunks with:
list_df[0]
list_df[1]
etc...
Then you can assemble it back into a one dataframe using pd.concat.
By AcctName
list_df = []
for n,g in df.groupby('AcctName'):
list_df.append(g)
Split a large pandas dataframe
Use np.array_split
:
Docstring:
Split an array into multiple sub-arrays.
Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})
In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468
In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]
Python divide dataframe into chunks
The range
function is enough here:
for start in range(0, len(df), 2500):
process_operation(df[start:start+2500])
Splitting a Large Dataframe into multiple df's with no more than 'x' number of rows in Python
If the order does not matter, you can use a slice with a stepsize:
import pandas as pd
import numpy as np
import math
data = np.random.rand(1_000_000, 2)
df = pd.DataFrame(data)
# how many rows we allow per dataframe
max_rows = 150_000
# how many subsets we need
stepsize = math.ceil(len(df) / max_rows)
# create a list of subsets
dfs = [df.iloc[offset::stepsize] for offset in range(stepsize)]
for small_df in dfs:
print(len(small_df))
You can also use a generator to prevent holding the list of subsets in memory. Here's a variation which uses a generator and preserves order:
def small_dfs(df, max_rows):
n = math.ceil(len(df) / max_rows)
for ix in range(n):
yield df.iloc[ix * max_rows : (ix+1) * max_rows, :]
for small_df in small_dfs(df, 150_000):
print(len(small_df))
Split dataframe into relatively even chunks according to length
You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby
splitting the dataframe into equally sized chunks:
n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)
Split pandas dataframe into chunks of N
You can try this:
def rolling(df, window, step):
count = 0
df_length = len(df)
while count < (df_length -window):
yield count, df[count:window+count]
count += step
Usage:
for offset, window in rolling(df, 100, 100):
# | | | |
# | The current chunk. | How many rows to step at a time.
# The current offset index. How many rows in each chunk.
# your code here
pass
There is also this simpler idea:
def chunk(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
Usage:
for df_chunk in chunk(df, 100):
# |
# The chunk size
# your code here
BTW. All this can be found on SO, with a search.
How can I split a pandas DataFrame into multiple dataframes?
You can use, np.array_split
to split the dataframe:
import numpy as np
dfs = np.array_split(df, 161) # split the dataframe into 161 separate tables
Edit (To assign a new col based on sequential number of df in dfs
):
dfs = [df.assign(new_col=i) for i, df in enumerate(dfs, 1)]
pandas chunksize how to slice chunk and directly jump into target chunk
IIUC use skiprows
parameter for omit first 499 chunks, for not remove header is use np.arange
:
n = 100000
for i, df_ia in enumerate(pd.read_csv("/path/to/file/file.TXT",
chunksize=n,
skiprows = np.arange(1, 500 * n + 1),
iterator=True,
low_memory=False)):
if i == 0:
#do logic
Related Topics
How to Install Psycopg2 with "Pip" on Python
Generating a List of Random Numbers, Summing to 1
Python/Numpy First Occurrence of Subarray
Basic Http File Downloading and Saving to Disk in Python
Matplotlib Scatter Plot Legend
How to Interpret Conda Package Conflicts
How to Merge a Transparent Png Image with Another Image Using Pil
How to Add an Image in Tkinter
How to Filter Rows Containing a String Pattern from a Pandas Dataframe
How to Print an Exception in Python
Python String Prints as [U'String']
How to Insert a Column at a Specific Column Index in Pandas
Multi-Level Defaultdict with Variable Depth