Split Data.Frame by Value

Pandas split DataFrame by column value

You can use boolean indexing:

df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50

s = 30

df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50

df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20

It's also possible to invert mask by ~:

mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50

print (df2)
A Sales
0 3 10
1 4 20

print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool

print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool

Pandas Split Dataframe by Unique Column Value

The groupby answers on the other question will work for you too. In your case, something like:

df_list = [d for _, d in df.groupby(['state'])]

This uses a list comprehension to return a list of dataframes, with one dataframe for each state.

How to split a dataframe each time a string value changes in a column?

I would create a column that increments on each change, then group by that column. If you need separate dataframes you can assign them in a loop.

df['group'] = df['label'].ne(df['label'].shift()).cumsum()
df = df.groupby('group')
dfs = []
for name, data in df:
dfs.append(data)

dfs will be a list of dataframes like so:

[         time     value label  group
0 2020-01-01 -0.556014 high 1
1 2020-01-02 0.185451 high 1,
time value label group
2 2020-01-03 -0.401111 medium 2
3 2020-01-04 0.436111 medium 2,
time value label group
4 2020-01-05 0.412933 high 3
5 2020-01-06 0.636421 high 3
6 2020-01-07 1.168237 high 3
7 2020-01-08 1.205073 high 3
8 2020-01-09 0.798674 high 3
9 2020-01-10 0.174116 high 3]

split data.frame in two based on one column values

Here is an idea using zoo::cbind.zoo,

do.call(zoo::cbind.zoo, split(df, df$two))

# one.a two.a three.a one.b two.b three.b
#1 1 a 1123 4 b 212
#2 2 a 33 5 b 1
#3 3 a 5566 6 b 90
#4 <NA> <NA> <NA> 7 b 876

split dataframe based on column value

You can use the code snippet df.loc[df['id'] == item] to extract sub dataframes based on a particular value of a column in the dataframe.

Please refer the full code below

import pandas as pd

df_dict = {"id" : [1,1,1,2,2,2,3,3,3],
"value" : [12,13,14,22,23,24,32,33,34]
}

df = pd.DataFrame(df_dict)
print(df)
id_list = []
for data in df['id'].unique():
id_list.append(data)

print(id_list)

for item in id_list:
sub_df = df.loc[df['id'] == item]
print(sub_df)
print("****")

The following output will be generated for this with the requirement of getting the sub dataframes for each of the distinct column ids

 id  value
0 1 12
1 1 13
2 1 14
3 2 22
4 2 23
5 2 24
6 3 32
7 3 33
8 3 34
[1, 2, 3]
id value
0 1 12
1 1 13
2 1 14
****
id value
3 2 22
4 2 23
5 2 24
****
id value
6 3 32
7 3 33
8 3 34
****

Now in your code snippet the issue was that the function createdataframe() is getting called only once and inside the function when we iterate through the elements, after fetching the details of the sub df for id =1 you have used a return statement to return this df. Hence you are getting only the sub df for id = 1.

Splitting data frame into segments for each factor based on a cutoff value in a column in R

In data.table:

dt[, V1 := paste0("A.", 1+cumsum(V4 >= 0.4))]

In dplyr:

df %>%
mutate(V1 = paste0("A.", 1+cumsum(V4 >= 0.4)))

Splitting data frame into smaller data frames based on unique column values

As suggested you could use groupby() on your dataframe to segregate by one column name values:

import pandas as pd

cols = ['Quantity', 'Code', 'Value']
data = [[1757, '08951201', 717.0],
[1100, '08A85800', 0.0],
[2500, '08A85800', 0.0],
[323, '08951201', 0.0],
[800, '08A85800', 0.0]]

df = pd.DataFrame(data, columns=cols)

groups =df.groupby(['Code'])

Then you can recover indices by groups.indices , this will return a dict with 'Code' values as keys, and index as values. For last if you want to get every sub-dataframe you can call group_list = list(groups). I suggest to do the work in 2 steps (first group by, then call list), because this way you can call other methods over the groupDataframe (group)


EDIT

Then if you want a particular dataframe you could call

 df_i = group_list[i][1]

group_list[i] is the i-th element of sub-dataframe, but it's a tupple containing (group_val,group_df). where group_val is the value associated to this new dataframe ('08951201' or '08A85800') and group_df is the new dataframe.



Related Topics



Leave a reply



Submit