Pandas split DataFrame by column value
You can use boolean indexing
:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask
by ~
:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Pandas Split Dataframe by Unique Column Value
The groupby answers on the other question will work for you too. In your case, something like:
df_list = [d for _, d in df.groupby(['state'])]
This uses a list comprehension to return a list of dataframes, with one dataframe for each state.
How to split a dataframe each time a string value changes in a column?
I would create a column that increments on each change, then group by that column. If you need separate dataframes you can assign them in a loop.
df['group'] = df['label'].ne(df['label'].shift()).cumsum()
df = df.groupby('group')
dfs = []
for name, data in df:
dfs.append(data)
dfs will be a list of dataframes like so:
[ time value label group
0 2020-01-01 -0.556014 high 1
1 2020-01-02 0.185451 high 1,
time value label group
2 2020-01-03 -0.401111 medium 2
3 2020-01-04 0.436111 medium 2,
time value label group
4 2020-01-05 0.412933 high 3
5 2020-01-06 0.636421 high 3
6 2020-01-07 1.168237 high 3
7 2020-01-08 1.205073 high 3
8 2020-01-09 0.798674 high 3
9 2020-01-10 0.174116 high 3]
split data.frame in two based on one column values
Here is an idea using zoo::cbind.zoo
,
do.call(zoo::cbind.zoo, split(df, df$two))
# one.a two.a three.a one.b two.b three.b
#1 1 a 1123 4 b 212
#2 2 a 33 5 b 1
#3 3 a 5566 6 b 90
#4 <NA> <NA> <NA> 7 b 876
split dataframe based on column value
You can use the code snippet df.loc[df['id'] == item]
to extract sub dataframes based on a particular value of a column in the dataframe.
Please refer the full code below
import pandas as pd
df_dict = {"id" : [1,1,1,2,2,2,3,3,3],
"value" : [12,13,14,22,23,24,32,33,34]
}
df = pd.DataFrame(df_dict)
print(df)
id_list = []
for data in df['id'].unique():
id_list.append(data)
print(id_list)
for item in id_list:
sub_df = df.loc[df['id'] == item]
print(sub_df)
print("****")
The following output will be generated for this with the requirement of getting the sub dataframes for each of the distinct column ids
id value
0 1 12
1 1 13
2 1 14
3 2 22
4 2 23
5 2 24
6 3 32
7 3 33
8 3 34
[1, 2, 3]
id value
0 1 12
1 1 13
2 1 14
****
id value
3 2 22
4 2 23
5 2 24
****
id value
6 3 32
7 3 33
8 3 34
****
Now in your code snippet the issue was that the function createdataframe() is getting called only once and inside the function when we iterate through the elements, after fetching the details of the sub df for id =1 you have used a return statement to return this df. Hence you are getting only the sub df for id = 1.
Splitting data frame into segments for each factor based on a cutoff value in a column in R
In data.table
:
dt[, V1 := paste0("A.", 1+cumsum(V4 >= 0.4))]
In dplyr
:
df %>%
mutate(V1 = paste0("A.", 1+cumsum(V4 >= 0.4)))
Splitting data frame into smaller data frames based on unique column values
As suggested you could use groupby()
on your dataframe to segregate by one column name values:
import pandas as pd
cols = ['Quantity', 'Code', 'Value']
data = [[1757, '08951201', 717.0],
[1100, '08A85800', 0.0],
[2500, '08A85800', 0.0],
[323, '08951201', 0.0],
[800, '08A85800', 0.0]]
df = pd.DataFrame(data, columns=cols)
groups =df.groupby(['Code'])
Then you can recover indices by groups.indices
, this will return a dict with 'Code' values as keys, and index as values. For last if you want to get every sub-dataframe you can call group_list = list(groups)
. I suggest to do the work in 2 steps (first group by, then call list), because this way you can call other methods over the groupDataframe (group
)
EDIT
Then if you want a particular dataframe you could call
df_i = group_list[i][1]
group_list[i]
is the i-th element of sub-dataframe, but it's a tupple containing (group_val,group_df)
. where group_val
is the value associated to this new dataframe ('08951201'
or '08A85800'
) and group_df
is the new dataframe.
Related Topics
Line Break When No Data in Ggplot2
Data.Table - Select First N Rows Within Group
Create a Data Frame of Unequal Lengths
Generate Random Numbers with Fixed Mean and Sd
Converting Latitude and Longitude Points to Utm
Why Is Allow.Cartesian Required at Times When When Joining Data.Tables with Duplicate Keys
How to Connect Two Coordinates with a Line Using Leaflet in R
Plotting Multiple Time-Series in Ggplot
Why Is Apply() Method Slower Than a for Loop in R
How to Draw Stacked Bars in Ggplot2 That Show Percentages Based on Group
How to Get a Reversed, Log10 Scale in Ggplot2
Unicode Characters in Ggplot2 PDF Output
Is There a R Function That Applies a Function to Each Pair of Columns
Combining Bar and Line Chart (Double Axis) in Ggplot2