Splitting Dataframe into Multiple Dataframes

Splitting dataframe into multiple dataframes

Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.

However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?

I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.

Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.

Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)

# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)

# get a list of names
names=df['name'].unique().tolist()

# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']

# now you can query all 'joes'

How can I split a pandas DataFrame into multiple dataframes?

You can use, np.array_split to split the dataframe:

import numpy as np

dfs = np.array_split(df, 161) # split the dataframe into 161 separate tables

Edit (To assign a new col based on sequential number of df in dfs):

dfs = [df.assign(new_col=i) for i, df in enumerate(dfs, 1)]

How to split dataframe into multiple dataframes based on column-name?

Use wide_to_long for reshape original DataFrame first and then aggregate mean:

cols = ['total_tracks']
df1 = (pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp')
.reset_index()
.drop('tmp', 1)
.groupby(cols, as_index=False)
.mean())

print (df1)
total_tracks t_dur t_dance
0 4 293071.000000 0.563667
1 8 157071.666667 0.886333
2 12 213577.666667 0.663000
3 17 216151.000000 0.766333
4 59 146673.333333 0.283667

Details:

cols = ['total_tracks']
print(pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp'))

t_dur t_dance
total_tracks tmp
4 0 292720.0 0.549
12 0 213760.0 0.871
59 0 157124.0 0.289
8 0 127896.0 0.886
17 0 210320.0 0.724
4 1 293760.0 0.623
12 1 181000.0 0.702
59 1 130446.0 0.328
8 1 176351.0 0.947
17 1 226253.0 0.791
4 2 292733.0 0.519
12 2 245973.0 0.416
59 2 152450.0 0.234
8 2 166968.0 0.826
17 2 211880.0 0.784

Split a dataframe into multiple dataframes based on specific row value in R

You are probably looking for the split function. I made a small example where I split every time the b column is equal to a

(d<-data.frame(a=1:10, b=sample(letters[1:3], replace = T, size = 10)))
#> a b
#> 1 1 a
#> 2 2 a
#> 3 3 c
#> 4 4 b
#> 5 5 c
#> 6 6 b
#> 7 7 c
#> 8 8 b
#> 9 9 c
#> 10 10 a
d$f<-cumsum(d$b=='a')
lst<-split(d, d$f)
lst
#> $`1`
#> a b f
#> 1 1 a 1
#>
#> $`2`
#> a b f
#> 2 2 a 2
#> 3 3 c 2
#> 4 4 b 2
#> 5 5 c 2
#> 6 6 b 2
#> 7 7 c 2
#> 8 8 b 2
#> 9 9 c 2
#>
#> $`3`
#> a b f
#> 10 10 a 3

Created on 2021-10-05 by the reprex package (v2.0.1)

Group dataframe by ID and then split it into multiple dataframes for each group

Creating the data frame:

ID = c("A", "B", "C", "A", "B", "C", "A", "B", "C")
Date = c("01/01/2022", "01/02/2022", "01/03/2022", "01/01/2022", "01/02/2022", "01/03/2022", "01/01/2022", "01/02/2022", "01/03/2022")
Value = c("45", "24", "33", "65", "24", "87", "51", "32", "72")

df <- data.frame(ID,Date,Value)

Splitting the data:

df_a <- df %>% 
filter(ID =="A")
df_b <- df %>%
filter(ID =="B")
df_c <- df %>%
filter(ID =="C")

Printing the data:

Now just run the split data frames below:

df_a
df_b
df_c

This will give you the following output:

  ID       Date Value
1 A 01/01/2022 45
2 A 01/01/2022 65
3 A 01/01/2022 51

ID Date Value
1 B 01/02/2022 24
2 B 01/02/2022 24
3 B 01/02/2022 32

ID Date Value
1 C 01/03/2022 33
2 C 01/03/2022 87
3 C 01/03/2022 72

Split pandas dataframe into multiple dataframes with list of lists as mask

Numpy:

  • flatnonzero to find where the 'foo.foo' rows are
  • split to divide the dataframe up accordingly


import numpy as np

np.split(df, np.flatnonzero(df.BB.eq('foo.foo'))[:-1] + 1)

[ A BB
0 1 foo.bar
1 2 foo.bar
2 3 foo.foo,
A BB
3 4 foo.bar
4 5 foo.bar
5 6 foo.foo]

Addressing @mozway's comment

list(filter(
lambda d: not d.empty,
np.split(df, np.flatnonzero(df.BB.eq('foo.foo')) + 1)
))

[ A BB
0 1 foo.bar
1 2 foo.bar
2 3 foo.foo,
A BB
3 4 foo.bar
4 5 foo.bar
5 6 foo.foo]


Related Topics



Leave a reply



Submit