Break dataframe into smaller dataframe's and save them
You can use the split
function and cut
function to perform the operation:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
answer<-split(x, cut(x$num, breaks=c(0, 5, 10, 15, 20, 25, 30)))
you can then pass this list to lapply
for further processing.
Split a dataframe into smaller dataframes in R using dplyr
We may use gl
to create the grouping column in group_split
library(dplyr)
df1 %>%
group_split(grp = as.integer(gl(n(), 59, n())), .keep = FALSE)
Split a large pandas dataframe
Use np.array_split
:
Docstring:
Split an array into multiple sub-arrays.
Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})
In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468
In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]
Split dataframe into smaller dataframe by column Names
Assume this is your dataframe:
Name price
0 aal 1
1 aal 2
2 aal 3
3 aal 4
4 aal 5
5 aal 6
6 bll 7
7 bll 8
8 bll 9
9 bll 8
10 dll 7
11 dll 56
12 dll 4
13 dll 3
14 dll 3
15 dll 5
Then do the following:
for Name, df in df.groupby('Name'):
df.to_csv("Price_{}".format(Name)+".csv", sep=";")
That'll save all sub-dataframes as csv.
To view what the code does:
for Name, df in df.groupby('Name'):
print(df)
returns:
Name price
0 aal 1
1 aal 2
2 aal 3
3 aal 4
4 aal 5
5 aal 6
Name price
6 bll 7
7 bll 8
8 bll 9
9 bll 8
Name price
10 dll 7
11 dll 56
12 dll 4
13 dll 3
14 dll 3
15 dll 5
If you need to reset the index in every df, do this:
for Name, df in df.groupby('Name'):
gf = df.reset_index()
print(gf)
which gives:
index Name price
0 0 aal 1
1 1 aal 2
2 2 aal 3
3 3 aal 4
4 4 aal 5
5 5 aal 6
index Name price
0 6 bll 7
1 7 bll 8
2 8 bll 9
3 9 bll 8
index Name price
0 10 dll 7
1 11 dll 56
2 12 dll 4
3 13 dll 3
4 14 dll 3
5 15 dll 5
Split dataframe into 3 equally sized new dataframes - Pandas
Try using numpy.array_split
:
import numpy as np
df1, df2, df3 = np.array_split(df_seen, 3)
To save each DataFrame to a separate file, you could do:
for i, df in enumerate(np.array_split(df_seen, 3)):
df.to_csv(f"data{i+1}.csv", index=False)
Split large Dataframe into smaller equal dataframes
I don't know from your description if you are aware that np.array_split
outputs n
objects. If it's only a few objects you could manually assign them, for example:
df1, df2, df3 = np.array_split(df, 3)
This would assign every subarray to these variables in order.
Otherwise you could assign the series of subarrays to a single variable;
split_df = np.array_split(df, 3)
len(split_df)
# 3
then loop over this one variable and do your analysis per subarray. I would personally choose the latter.
for object in split_df:
print(type(object))
This prints <class 'pandas.core.frame.DataFrame'>
three times.
Splitting dataframe into multiple dataframes
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name'
, set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values
and pandas.DataFrame.set_index
:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
Divide a large dataframe into smaller sub dataframes in order
Following my comment. Here is an example, note it's probably not the best approach..:
import numpy as np
dfs = np.array_split(df2, 5)
for index, df in enumerate(dfs):
globals()['df%s' % index] = pd.DataFrame(df)
df3
How to randomly split a DataFrame into several smaller DataFrames?
Use np.array_split
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)
df.sample(frac=1)
shuffle the rows of df
. Then use np.array_split
split it into parts that have equal size.
It gives you:
for part in result:
print(part,'\n')
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
5 6 5 0 0 0 0 0 0 5 0 0 0 10
4 5 3 0 0 0 0 0 0 0 0 0 0 3
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
22 23 4 0 0 0 4 3 0 0 5 0 0 16
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
21 22 4 0 0 0 3 5 5 0 5 4 0 26
1 2 3 0 0 3 0 0 0 0 0 0 0 6
20 21 1 0 0 3 3 0 0 0 0 0 0 7
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
10 11 2 0 4 0 0 3 3 0 4 2 0 18
9 10 3 2 0 0 0 4 0 0 0 0 0 9
11 12 5 0 0 0 4 5 0 0 5 2 0 21
8 9 5 0 0 0 4 5 0 0 4 5 0 23
12 13 5 4 0 0 2 0 0 0 3 0 0 14
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
0 1 5 4 0 4 4 0 0 0 4 0 0 21
23 24 3 0 0 4 0 0 0 0 0 3 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
17 18 4 0 0 0 0 0 0 0 0 0 0 4
2 3 4 0 0 0 0 0 0 0 0 0 0 4
15 16 5 0 0 0 0 0 0 0 4 0 0 9
19 20 4 0 0 0 0 0 0 0 0 0 0 4
Related Topics
How to Remove Rows With Any Zero Value
Select the N Most Frequent Values in a Variable
How to Find the Difference in Value in Every Two Consecutive Rows in R
How to Fix Spaces in Column Names of a Data.Frame (Remove Spaces, Inject Dots)
Ggplot With 2 Y Axes on Each Side and Different Scales
Aggregating by Unique Identifier and Concatenating Related Values into a String
How to Get Summary Statistics by Group
Replace Specific Characters Within Strings
How to Use "≪≪-" (Scoping Assignment) in R
R Reshape Data Frame from Long to Wide Format
Delete Rows That Exist in Another Data Frame
Remove Quotes from a Character Vector in R
Convert Dataframe Column to 1 or 0 for "True"/"False" Values and Assign to Dataframe
Removing Space Between Numeric Values in R
Numbering Rows Within Groups in a Data Frame
Convert Continuous Numeric Values to Discrete Categories Defined by Intervals
How to Remove All Duplicates So That None Are Left in a Data Frame