Filter a data frame according to minimum and maximum values
We can use Map
from base R
Map(function(x, y) x[x >= min(y) & x <= max(y)], df, Neighb)
#$A
#numeric(0)
#$B
#[1] 2 2 1 1 1 1 2
#$C
#[1] 16
#$D
#[1] 5 5 6 5
#$E
#[1] 1.09
#$F
#[1] 2 2 1 1 1 1 2
If we need to filter
the dataset based on the logical index, i.e. rows that have all TRUE
based on the comparison with 'Neighb'
df[Reduce(`&`, Map(function(x, y) x >= min(y) & x <= max(y), df, Neighb)), ]
and if it is any TRUE
df[Reduce(`|`, Map(function(x, y) x >= min(y) & x <= max(y), df, Neighb)),]
data
df <- structure(list(A = c(24, 21, 12, 39, 51, 24, 48, 44, 14, 34),
B = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2),
C = c(16, 19, 12, 39, 82, 24, 32, 44, 18, 51),
D = c(5, 2, 79, 39, 27, 40, 5, 12, 6, 5),
E = c(1.2, 1.09, 0.86, 1.9, 2.3, 1.6, 1.6, 1.7, 0.88, 2.7),
F = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2)),
.Names = c("A","B", "C", "D", "E", "F"),
class = "data.frame",
row.names = c(NA, -10L))
Neighb <- list(c(15.7, 15.9, 16.0, 16.1, 16.2),
c(0, 1, 2, 3, 4),
c(15.0, 15.3, 16.0, 16.3, 16.5),
c(3, 4, 5, 6, 7),
c(1.08, 1.09, 1.10, 1.11, 1.12),
c(0, 1, 2, 3, 4))
Filter a Pandas dataframe by a condition and a minimum value in a column
In my opinion your solution is nice, also is possible join both rows of code together with double []
for return one row DataFrame
:
df = df.loc[[df.loc[df.test_a == 'OK', 'metric_e'].idxmin()]]
print (df)
test_a test_b metric_e
2 OK NOK 2
Filter dataframe by maximum values in each group
Here's a simple and fast approach using data.table package
library(data.table)
setDT(df)[, .SD[which.max(date)], id]
# id date
# 1: 1 2012
# 2: 3 2014
# 3: 2 2014
Or (could be a bit faster because of keyed by
setkey(setDT(df), id)[, .SD[which.max(date)], id]
Or using OPs idea via the data.table
package
unique(setorder(setDT(df), id, -date), by = "id")
Or
setorder(setDT(df), id, -date)[!duplicated(id)]
Or base R solution
with(df, tapply(date, id, function(x) x[which.max(x)]))
## 1 2 3
## 2012 2014 2014
Another way
library(dplyr)
df %>%
group_by(id) %>%
filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
#
# id date
# 1 1 2012
# 2 2 2014
# 3 3 2014
Or
df %>%
group_by(id) %>%
slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties
Or
df %>%
group_by(id) %>%
summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties
Filter DataFrame based on Max value in Column - Pandas
You could groupby Browser and take the max:
In [11]: g = df.groupby('Browser')
In [12]: g['Metric1'].max()
Out[12]:
Browser
Chrome/29 3000
FF 2000
IE 1000
Opera 3000
Name: Metric1, dtype: int64
In [13]: over2000 = g['Metric1'].max() > 2000
In [14]: over2000
Out[14]:
Browser
Chrome/29 True
FF False
IE False
Opera True
Name: Metric1, dtype: bool
To get out the array, use this as a boolean mask:
In [15]: over2000[over2000].index.values
Out[15]: array(['Chrome/29', 'Opera'], dtype=object)
Filter pandas Dataframe based on max values in a column
You can perform a groupby
on 'Product ID', then apply idxmax
on 'Sales' column.
This will create a series with the index of the highest values.
We can then use the index values to index into the original dataframe using iloc
In [201]:
df.iloc[df.groupby('Product ID')['Sales'].agg(pd.Series.idxmax)]
Out[201]:
Product_ID Store Sales
1 1 B 200
3 2 A 400
5 3 A 200
8 4 C 500
How to remove the 'minimum' and 'maximum' values in a data frame and compute the average in 'R'
One solution is to use data.table. The data.table is like a data.frame but with added fuctionality. You will first need to load the data.table package and convert your data.frame (df
) to a data.table
library(data.table)
setDT(df)
From there, filter out the values at the extremes for each group using by
, then get the mean of the remaining values.
# Solution:
df[,
# ID rows where value is min/max
.(Cycle.Time, "drop" = Cycle.Time %in% range(Cycle.Time)), by=Code][
# Filter to those where value is not mon/max and get mean per Code
drop==FALSE, mean(Cycle.Time), by=Code]
An alternative is to use dplyr
df %>%
group_by(Code) %>%
filter(!Cycle.Time %in% range(Cycle.Time)) %>%
summarize(mean(Cycle.Time))
And to store that in a data.frame:
df %>%
group_by(Code) %>%
filter(!Cycle.Time %in% range(Cycle.Time)) %>%
summarize(mean(Cycle.Time)) %>%
data.frame -> averages
Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched
Based on your input df
, you can do Groupby.agg
:
In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()
In [1191]: res
Out[1191]:
sender type _time
min max count
0 b@aol.com email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
1 c@aol.com email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
2 g@aol.com email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
EDIT: To drop nested columns, do:
In [1206]: res.columns = res.columns.droplevel()
In [1207]: res
Out[1207]:
min max count
0 b@aol.com email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
1 c@aol.com email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
2 g@aol.com email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
EDIT-2: Using df1
also:
In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')
In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()
In [1260]: res.columns = res.columns.droplevel()
In [1261]: res
Out[1261]:
min max count
0 a@aol.com email NaN NaN 0
1 b@aol.com email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
2 c@aol.com email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
3 g@aol.com email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
R output BOTH maximum and minimum value by group in dataframe
You can use range
to get max
and min
value and use it in summarise
to get different rows for each Name
.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table
might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
dplyr filter: Get rows with minimum of variable, but only the first if multiple minima
Just for completeness: Here's the final dplyr
solution, derived from the comments of @hadley and @Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column
Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]
Related Topics
How to Get Rowsums for Selected Columns in R
R: Error in Usemethod("Tbl_Vars")
How to Keep Columns When Grouping/Summarizing
Sum Across Multiple Columns With Dplyr
Repeat Each Row of Data.Frame the Number of Times Specified in a Column
How to Trim Leading and Trailing White Space
Geographic/Geospatial Distance Between 2 Lists of Lat/Lon Points (Coordinates)
Elegant Way to Check For Missing Packages and Install Them
R Collapse Multiple Rows into 1 Row - Same Columns
Regex to Replace Comma to Dot Separator
Using Ifelse Statement on the Whole Dataset Instead of a Single Column
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)
Replacing Nas With Latest Non-Na Value
How to Replace Na Values With Zeros in an R Dataframe
Add Count of Unique/Distinct Values by Group to the Original Data
How to Split Data into Training/Testing Sets Using Sample Function