Filter Dataframe by Maximum Values in Each Group

Get the row(s) which have the max value in groups using groupby

In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1 3
S3 8
S4 10
S2 7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 8
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7

Groupby and filter by max value in pandas

You can do this:

latest = df.query('Value==1').groupby("ID").max("year").assign(Latest = "Latest")
pd.merge(df,latest,how="outer")

Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest

Filter data using max. categorical value of a group in pandas

If all the values other than region are the same for each customer, you can use df.groupby('customer').max(); it's the ['region'] part that's restricting the columns to just region. (Also, you can use just use customer, rather than a list containing customer). Also, max returns the alphabetically last element. If you want the value from the last row, you'll need something different.

How to select rows with max values in categories?

The only solution that comes to my mind is to :

  • Get the highest day for each ID (using groupBy)
  • Append the value of the highest day to each line (with matching ID) using join
  • Then a simple filter where the value of the two lines match
# select the max value for each of the ID
maxDayForIDs = df.groupBy("ID").max("day").withColumnRenamed("max(day)", "maxDay")

# now add the max value of the day for each line (with matching ID)
df = df.join(maxDayForIDs, "ID")

# keep only the lines where it matches "day" equals "maxDay"
df = df.filter(df.day == df.maxDay)

How to filter rows and columns based on the maximum value in a Python DataFrame

Try sorting by "Value" and keeping the last row for each country

>>> df.sort_values("Value").drop_duplicates("country",keep="last")
Year country Value
2 2003 USA 7000
6 2002 India 9000
10 2001 Japan 10000

Alternatively, you could use groupby:

>>> df[df["Value"].eq(df.groupby("country")["Value"].transform('max'))]
Year country Value
2 2003 USA 7000
6 2002 India 9000
10 2001 Japan 10000

Filter dataframe by maximum values in each group

Here's a simple and fast approach using data.table package

library(data.table)
setDT(df)[, .SD[which.max(date)], id]
# id date
# 1: 1 2012
# 2: 3 2014
# 3: 2 2014

Or (could be a bit faster because of keyed by

setkey(setDT(df), id)[, .SD[which.max(date)], id]

Or using OPs idea via the data.table package

unique(setorder(setDT(df), id, -date), by = "id")

Or

setorder(setDT(df), id, -date)[!duplicated(id)]

Or base R solution

with(df, tapply(date, id, function(x) x[which.max(x)]))
## 1 2 3
## 2012 2014 2014

Another way

library(dplyr)
df %>%
group_by(id) %>%
filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
#
# id date
# 1 1 2012
# 2 2 2014
# 3 3 2014

Or

df %>%
group_by(id) %>%
slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties

Or

df %>%
group_by(id) %>%
summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties

Select the row with the maximum value in each group based on multiple columns in R dplyr

We may get rowwise max of the 'count' columns with pmax, grouped by 'col1', filter the rows where the max value of 'Max' column is.

library(dplyr)
df1 %>%
mutate(Max = pmax(count_col1, count_col2) ) %>%
group_by(col1) %>%
filter(Max == max(Max)) %>%
ungroup %>%
select(-Max)

-output

# A tibble: 3 × 4
col1 col2 count_col1 count_col2
<chr> <chr> <dbl> <dbl>
1 apple aple 1 4
2 banana banan 4 1
3 banana bananb 4 1

We may also use slice_max

library(purrr)
df1 %>%
group_by(col1) %>%
slice_max(invoke(pmax, across(starts_with("count")))) %>%
ungroup
# A tibble: 3 × 4
col1 col2 count_col1 count_col2
<chr> <chr> <dbl> <dbl>
1 apple aple 1 4
2 banana banan 4 1
3 banana bananb 4 1

get rows with largest value in grouping

Use DataFrameGroupBy.idxmax if need select only one max value:

df = df.loc[df.groupby('id')['value'].idxmax()]
print (df)
id other_value value
2 1 b 5
5 2 d 6
7 3 f 4
10 4 e 7

If multiple max values and want seelct all rows by max values:

df = pd.DataFrame({'id' : [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4],
'other_value' : ['a', 'e', 'b', 'b', 'a', 'd', 'b', 'f' ,'a' ,'c', 'e', 'f'],
'value' : [1, 3, 5, 2, 5, 6, 2, 4, 6, 1, 7, 7]
})

print (df)
id other_value value
0 1 a 1
1 1 e 3
2 1 b 5
3 2 b 2
4 2 a 5
5 2 d 6
6 3 b 2
7 3 f 4
8 4 a 6
9 4 c 1
10 4 e 7
11 4 f 7

df = df[df.groupby('id')['value'].transform('max') == df['value']]
print (df)
id other_value value
2 1 b 5
5 2 d 6
7 3 f 4
10 4 e 7
11 4 f 7

GroupBy column and filter rows with maximum value in Pyspark

You can do this without a udf using a Window.

Consider the following example:

import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
('b', 3)
]
df = sqlCtx.createDataFrame(data, ["A", "B"])
df.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 5|
#| a| 8|
#| a| 7|
#| b| 1|
#| b| 3|
#+---+---+

Create a Window to partition by column A and use this to compute the maximum of each group. Then filter out the rows such that the value in column B is equal to the max.

from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
.where(f.col('B') == f.col('maxB'))\
.drop('maxB')\
.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 8|
#| b| 3|
#+---+---+

Or equivalently using pyspark-sql:

df.registerTempTable('table')
q = "SELECT A, B FROM (SELECT *, MAX(B) OVER (PARTITION BY A) AS maxB FROM table) M WHERE B = maxB"
sqlCtx.sql(q).show()
#+---+---+
#| A| B|
#+---+---+
#| b| 3|
#| a| 8|
#+---+---+


Related Topics



Leave a reply



Submit