Filter a Data Frame According to Minimum and Maximum Values

Filter a data frame according to minimum and maximum values

We can use Map from base R

Map(function(x, y) x[x >= min(y) & x <= max(y)], df, Neighb)
#$A
#numeric(0)

#$B
#[1] 2 2 1 1 1 1 2

#$C
#[1] 16

#$D
#[1] 5 5 6 5

#$E
#[1] 1.09

#$F
#[1] 2 2 1 1 1 1 2

If we need to filter the dataset based on the logical index, i.e. rows that have all TRUE based on the comparison with 'Neighb'

df[Reduce(`&`, Map(function(x, y) x >= min(y) & x <= max(y), df, Neighb)), ]

and if it is any TRUE

df[Reduce(`|`, Map(function(x, y) x >= min(y) & x <= max(y), df, Neighb)),]

data

df <- structure(list(A = c(24, 21, 12, 39, 51, 24, 48, 44, 14, 34), 
B = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2),
C = c(16, 19, 12, 39, 82, 24, 32, 44, 18, 51),
D = c(5, 2, 79, 39, 27, 40, 5, 12, 6, 5),
E = c(1.2, 1.09, 0.86, 1.9, 2.3, 1.6, 1.6, 1.7, 0.88, 2.7),
F = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2)),
.Names = c("A","B", "C", "D", "E", "F"),
class = "data.frame",
row.names = c(NA, -10L))


Neighb <- list(c(15.7, 15.9, 16.0, 16.1, 16.2),
c(0, 1, 2, 3, 4),
c(15.0, 15.3, 16.0, 16.3, 16.5),
c(3, 4, 5, 6, 7),
c(1.08, 1.09, 1.10, 1.11, 1.12),
c(0, 1, 2, 3, 4))

Filter a Pandas dataframe by a condition and a minimum value in a column

In my opinion your solution is nice, also is possible join both rows of code together with double [] for return one row DataFrame:

df = df.loc[[df.loc[df.test_a == 'OK', 'metric_e'].idxmin()]]
print (df)
test_a test_b metric_e
2 OK NOK 2

Filter dataframe by maximum values in each group

Here's a simple and fast approach using data.table package

library(data.table)
setDT(df)[, .SD[which.max(date)], id]
# id date
# 1: 1 2012
# 2: 3 2014
# 3: 2 2014

Or (could be a bit faster because of keyed by

setkey(setDT(df), id)[, .SD[which.max(date)], id]

Or using OPs idea via the data.table package

unique(setorder(setDT(df), id, -date), by = "id")

Or

setorder(setDT(df), id, -date)[!duplicated(id)]

Or base R solution

with(df, tapply(date, id, function(x) x[which.max(x)]))
## 1 2 3
## 2012 2014 2014

Another way

library(dplyr)
df %>%
group_by(id) %>%
filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
#
# id date
# 1 1 2012
# 2 2 2014
# 3 3 2014

Or

df %>%
group_by(id) %>%
slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties

Or

df %>%
group_by(id) %>%
summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties

Filter DataFrame based on Max value in Column - Pandas

You could groupby Browser and take the max:

In [11]: g = df.groupby('Browser')

In [12]: g['Metric1'].max()
Out[12]:
Browser
Chrome/29 3000
FF 2000
IE 1000
Opera 3000
Name: Metric1, dtype: int64

In [13]: over2000 = g['Metric1'].max() > 2000

In [14]: over2000
Out[14]:
Browser
Chrome/29 True
FF False
IE False
Opera True
Name: Metric1, dtype: bool

To get out the array, use this as a boolean mask:

In [15]: over2000[over2000].index.values
Out[15]: array(['Chrome/29', 'Opera'], dtype=object)

Filter pandas Dataframe based on max values in a column

You can perform a groupby on 'Product ID', then apply idxmax on 'Sales' column.
This will create a series with the index of the highest values.
We can then use the index values to index into the original dataframe using iloc

In [201]:

df.iloc[df.groupby('Product ID')['Sales'].agg(pd.Series.idxmax)]
Out[201]:
Product_ID Store Sales
1 1 B 200
3 2 A 400
5 3 A 200
8 4 C 500

How to remove the 'minimum' and 'maximum' values in a data frame and compute the average in 'R'

One solution is to use data.table. The data.table is like a data.frame but with added fuctionality. You will first need to load the data.table package and convert your data.frame (df) to a data.table

library(data.table)
setDT(df)

From there, filter out the values at the extremes for each group using by, then get the mean of the remaining values.

# Solution: 
df[,
# ID rows where value is min/max
.(Cycle.Time, "drop" = Cycle.Time %in% range(Cycle.Time)), by=Code][
# Filter to those where value is not mon/max and get mean per Code
drop==FALSE, mean(Cycle.Time), by=Code]

An alternative is to use dplyr

df %>% 
group_by(Code) %>%
filter(!Cycle.Time %in% range(Cycle.Time)) %>%
summarize(mean(Cycle.Time))

And to store that in a data.frame:

df %>% 
group_by(Code) %>%
filter(!Cycle.Time %in% range(Cycle.Time)) %>%
summarize(mean(Cycle.Time)) %>%
data.frame -> averages

Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched

Based on your input df, you can do Groupby.agg:

In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1191]: res
Out[1191]:
sender type _time
min max count
0 b@aol.com email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
1 c@aol.com email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
2 g@aol.com email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1

EDIT: To drop nested columns, do:

In [1206]: res.columns = res.columns.droplevel()

In [1207]: res
Out[1207]:
min max count
0 b@aol.com email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
1 c@aol.com email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
2 g@aol.com email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1

EDIT-2: Using df1 also:

In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')

In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1260]: res.columns = res.columns.droplevel()

In [1261]: res
Out[1261]:
min max count
0 a@aol.com email NaN NaN 0
1 b@aol.com email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
2 c@aol.com email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
3 g@aol.com email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1

R output BOTH maximum and minimum value by group in dataframe

You can use range to get max and min value and use it in summarise to get different rows for each Name.

library(dplyr)

df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")

# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97

If you have large dataset using data.table might be faster.

library(data.table)
setDT(df)[, .(Value = range(Value)), Name]

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

Just for completeness: Here's the final dplyr solution, derived from the comments of @hadley and @Arun:

library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.

Solution 1:

Determine the minimum INVYR for every plotcode, using .groupby():

min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()

And use pd.merge() to do an inner join between your orignal df with this minimum you just found:

result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)

Solution 2:

Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()

df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)

Now filter your dataframe where INVYR in a row is equal to the minimum of that group:

df[df['INVYR'] == df['min_per_group']]


Related Topics



Leave a reply



Submit