Splitting dataframe into multiple dataframes
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name'
, set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values
and pandas.DataFrame.set_index
:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
How can I split a pandas DataFrame into multiple dataframes?
You can use, np.array_split
to split the dataframe:
import numpy as np
dfs = np.array_split(df, 161) # split the dataframe into 161 separate tables
Edit (To assign a new col based on sequential number of df in dfs
):
dfs = [df.assign(new_col=i) for i, df in enumerate(dfs, 1)]
How to split dataframe into multiple dataframes based on column-name?
Use wide_to_long
for reshape original DataFrame first and then aggregate mean
:
cols = ['total_tracks']
df1 = (pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp')
.reset_index()
.drop('tmp', 1)
.groupby(cols, as_index=False)
.mean())
print (df1)
total_tracks t_dur t_dance
0 4 293071.000000 0.563667
1 8 157071.666667 0.886333
2 12 213577.666667 0.663000
3 17 216151.000000 0.766333
4 59 146673.333333 0.283667
Details:
cols = ['total_tracks']
print(pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp'))
t_dur t_dance
total_tracks tmp
4 0 292720.0 0.549
12 0 213760.0 0.871
59 0 157124.0 0.289
8 0 127896.0 0.886
17 0 210320.0 0.724
4 1 293760.0 0.623
12 1 181000.0 0.702
59 1 130446.0 0.328
8 1 176351.0 0.947
17 1 226253.0 0.791
4 2 292733.0 0.519
12 2 245973.0 0.416
59 2 152450.0 0.234
8 2 166968.0 0.826
17 2 211880.0 0.784
Split a dataframe into multiple dataframes based on specific row value in R
You are probably looking for the split
function. I made a small example where I split every time the b
column is equal to a
(d<-data.frame(a=1:10, b=sample(letters[1:3], replace = T, size = 10)))
#> a b
#> 1 1 a
#> 2 2 a
#> 3 3 c
#> 4 4 b
#> 5 5 c
#> 6 6 b
#> 7 7 c
#> 8 8 b
#> 9 9 c
#> 10 10 a
d$f<-cumsum(d$b=='a')
lst<-split(d, d$f)
lst
#> $`1`
#> a b f
#> 1 1 a 1
#>
#> $`2`
#> a b f
#> 2 2 a 2
#> 3 3 c 2
#> 4 4 b 2
#> 5 5 c 2
#> 6 6 b 2
#> 7 7 c 2
#> 8 8 b 2
#> 9 9 c 2
#>
#> $`3`
#> a b f
#> 10 10 a 3
Created on 2021-10-05 by the reprex package (v2.0.1)
Group dataframe by ID and then split it into multiple dataframes for each group
Creating the data frame:
ID = c("A", "B", "C", "A", "B", "C", "A", "B", "C")
Date = c("01/01/2022", "01/02/2022", "01/03/2022", "01/01/2022", "01/02/2022", "01/03/2022", "01/01/2022", "01/02/2022", "01/03/2022")
Value = c("45", "24", "33", "65", "24", "87", "51", "32", "72")
df <- data.frame(ID,Date,Value)
Splitting the data:
df_a <- df %>%
filter(ID =="A")
df_b <- df %>%
filter(ID =="B")
df_c <- df %>%
filter(ID =="C")
Printing the data:
Now just run the split data frames below:
df_a
df_b
df_c
This will give you the following output:
ID Date Value
1 A 01/01/2022 45
2 A 01/01/2022 65
3 A 01/01/2022 51
ID Date Value
1 B 01/02/2022 24
2 B 01/02/2022 24
3 B 01/02/2022 32
ID Date Value
1 C 01/03/2022 33
2 C 01/03/2022 87
3 C 01/03/2022 72
Split pandas dataframe into multiple dataframes with list of lists as mask
Numpy:
flatnonzero
to find where the'foo.foo'
rows aresplit
to divide the dataframe up accordingly
import numpy as np
np.split(df, np.flatnonzero(df.BB.eq('foo.foo'))[:-1] + 1)
[ A BB
0 1 foo.bar
1 2 foo.bar
2 3 foo.foo,
A BB
3 4 foo.bar
4 5 foo.bar
5 6 foo.foo]
Addressing @mozway's comment
list(filter(
lambda d: not d.empty,
np.split(df, np.flatnonzero(df.BB.eq('foo.foo')) + 1)
))
[ A BB
0 1 foo.bar
1 2 foo.bar
2 3 foo.foo,
A BB
3 4 foo.bar
4 5 foo.bar
5 6 foo.foo]
Related Topics
How to Share Single Sqlite Connection in Multi-Threaded Python Application
Importerror: No Module Named Psycopg2 After Install
How to Increment a Variable on a for Loop in Jinja Template
Python Flask Threaded True Not Working
Print a List of Space-Separated Elements
How to Replace Negative Numbers in Pandas Data Frame by Zero
Python: Filenotfounderror: [Winerror 3] the System Cannot Find the Path Specified: ''
Sqlalchemy - Select for Update Example
How to Append New Data Onto a New Line
Python Does Not Match Format '%Y-%M-%Dt%H:%M:%S%Z.%F'
How to Enable Autocomplete (Intellisense) for Python Package Modules
Importing Large Tab-Delimited .Txt File into Python
How to Change Default Python Version
Most Efficient Way to Forward-Fill Nan Values in Numpy Array
Heroku: No Default Language Could Be Detected for This App
Flask API Typeerror: Object of Type 'Response' Is Not Json Serializable