Filter Dataframe by Maximum Values in Each Group

Get the row(s) which have the max value in groups using groupby

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

Groupby and filter by max value in pandas

You can do this:

latest = df.query('Value==1').groupby("ID").max("year").assign(Latest = "Latest")
pd.merge(df,latest,how="outer")

   Value  ID  Date  Latest
0      1   5  2012     NaN
1      1   5  2013  Latest
2      0  12  2017     NaN
3      0  12  2022     NaN
4      1  27  2005     NaN
5      1  27  2011  Latest

Filter data using max. categorical value of a group in pandas

If all the values other than region are the same for each customer, you can use df.groupby('customer').max(); it's the ['region'] part that's restricting the columns to just region. (Also, you can use just use customer, rather than a list containing customer). Also, max returns the alphabetically last element. If you want the value from the last row, you'll need something different.

How to select rows with max values in categories?

The only solution that comes to my mind is to :

Get the highest day for each ID (using groupBy)
Append the value of the highest day to each line (with matching ID) using join
Then a simple filter where the value of the two lines match

# select the max value for each of the ID
maxDayForIDs = df.groupBy("ID").max("day").withColumnRenamed("max(day)", "maxDay")

# now add the max value of the day for each line (with matching ID)
df = df.join(maxDayForIDs, "ID")

# keep only the lines where it matches "day" equals "maxDay"
df = df.filter(df.day == df.maxDay)

How to filter rows and columns based on the maximum value in a Python DataFrame

Try sorting by "Value" and keeping the last row for each country

>>> df.sort_values("Value").drop_duplicates("country",keep="last")
    Year country  Value
2   2003     USA   7000
6   2002   India   9000
10  2001   Japan  10000

Alternatively, you could use groupby:

>>> df[df["Value"].eq(df.groupby("country")["Value"].transform('max'))]
    Year country  Value
2   2003     USA   7000
6   2002   India   9000
10  2001   Japan  10000

Filter dataframe by maximum values in each group

Here's a simple and fast approach using data.table package

library(data.table)
setDT(df)[, .SD[which.max(date)], id]
#    id date
# 1:  1 2012
# 2:  3 2014
# 3:  2 2014

Or (could be a bit faster because of keyed by

setkey(setDT(df), id)[, .SD[which.max(date)], id]

Or using OPs idea via the data.table package

unique(setorder(setDT(df), id, -date), by = "id")

setorder(setDT(df), id, -date)[!duplicated(id)]

Or base R solution

with(df, tapply(date, id, function(x) x[which.max(x)]))
##    1    2    3 
## 2012 2014 2014

Another way

library(dplyr)
df %>%
  group_by(id) %>%
  filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
# 
#   id date
# 1  1 2012
# 2  2 2014
# 3  3 2014

df %>%
  group_by(id) %>%
  slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties

df %>%
  group_by(id) %>%
  summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties

Select the row with the maximum value in each group based on multiple columns in R dplyr

We may get rowwise max of the 'count' columns with pmax, grouped by 'col1', filter the rows where the max value of 'Max' column is.

library(dplyr)
df1 %>% 
 mutate(Max = pmax(count_col1, count_col2) ) %>%
 group_by(col1) %>%
 filter(Max == max(Max)) %>%
 ungroup %>%
 select(-Max)

-output

# A tibble: 3 × 4
  col1   col2   count_col1 count_col2
  <chr>  <chr>       <dbl>      <dbl>
1 apple  aple            1          4
2 banana banan           4          1
3 banana bananb          4          1

We may also use slice_max

library(purrr)
df1 %>%
  group_by(col1) %>%
  slice_max(invoke(pmax, across(starts_with("count")))) %>%
  ungroup
# A tibble: 3 × 4
  col1   col2   count_col1 count_col2
  <chr>  <chr>       <dbl>      <dbl>
1 apple  aple            1          4
2 banana banan           4          1
3 banana bananb          4          1

get rows with largest value in grouping

Use DataFrameGroupBy.idxmax if need select only one max value:

df = df.loc[df.groupby('id')['value'].idxmax()]
print (df)
    id other_value  value
2    1           b      5
5    2           d      6
7    3           f      4
10   4           e      7

If multiple max values and want seelct all rows by max values:

df = pd.DataFrame({'id' : [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4],
                   'other_value' : ['a', 'e', 'b', 'b', 'a', 'd', 'b', 'f' ,'a' ,'c', 'e', 'f'],
                   'value' : [1, 3, 5, 2, 5, 6, 2, 4, 6, 1, 7, 7]
                   })

print (df)
    id other_value  value
0    1           a      1
1    1           e      3
2    1           b      5
3    2           b      2
4    2           a      5
5    2           d      6
6    3           b      2
7    3           f      4
8    4           a      6
9    4           c      1
10   4           e      7
11   4           f      7

df = df[df.groupby('id')['value'].transform('max') == df['value']]
print (df)
    id other_value  value
2    1           b      5
5    2           d      6
7    3           f      4
10   4           e      7
11   4           f      7

GroupBy column and filter rows with maximum value in Pyspark

You can do this without a udf using a Window.

Consider the following example:

import pyspark.sql.functions as f
data = [
    ('a', 5),
    ('a', 8),
    ('a', 7),
    ('b', 1),
    ('b', 3)
]
df = sqlCtx.createDataFrame(data, ["A", "B"])
df.show()
#+---+---+
#|  A|  B|
#+---+---+
#|  a|  5|
#|  a|  8|
#|  a|  7|
#|  b|  1|
#|  b|  3|
#+---+---+

Create a Window to partition by column A and use this to compute the maximum of each group. Then filter out the rows such that the value in column B is equal to the max.

from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
    .where(f.col('B') == f.col('maxB'))\
    .drop('maxB')\
    .show()
#+---+---+
#|  A|  B|
#+---+---+
#|  a|  8|
#|  b|  3|
#+---+---+

Or equivalently using pyspark-sql:

df.registerTempTable('table')
q = "SELECT A, B FROM (SELECT *, MAX(B) OVER (PARTITION BY A) AS maxB FROM table) M WHERE B = maxB"
sqlCtx.sql(q).show()
#+---+---+
#|  A|  B|
#+---+---+
#|  b|  3|
#|  a|  8|
#+---+---+