get rows of unique values by group
data.table
is a bit different in how to use duplicated
. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
how to group rows in an unique row for unique column values?
We could group by 'a', 'c', summarise
the unique
elements to 'b' in a string
library(dplyr)
df %>%
group_by(a, c) %>%
summarise(b = sprintf('[%s]', toString(unique(b))), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 3 x 3
# a b c
# <chr> <chr> <dbl>
#1 A1 [a, b, c] 1
#2 A2 [d, e] 1
#3 A3 [f] 1
Or if the 'c' values are also changing, use across
df %>%
group_by(a) %>%
summarise(across(everything(), ~ sprintf('[%s]',
toString(unique(.)))), .groups = 'drop')
Or if we need a list
df %>%
group_by(a) %>%
summarise(across(everything(), ~ list(unique(.))
), .groups = 'drop')
Or using glue
df %>%
group_by(a, c) %>%
summarise(b = glue::glue('[{toString(unique(b))}]'), .groups = 'drop')
-output
# A tibble: 3 x 3
# a c b
#* <chr> <dbl> <glue>
#1 A1 1 [a, b, c]
#2 A2 1 [d, e]
#3 A3 1 [f]
Get the first row of each group of unique values in another column
Use groupby
+ first
:
firsts = df.groupby('col_B', as_index=False).first()
Output:
>>> firsts
col_B col_A
0 x 1
1 xx 2
2 y 4
If the order of the columns is important:
firsts = df.loc[df.groupby('col_B', as_index=False).first().index]
Output:
>>> firsts
col_A col_B
0 1 x
1 2 xx
2 3 xx
Get rows based on distinct values from one column
Use drop_duplicates
with specifying column COL2
for check duplicates:
df = df.drop_duplicates('COL2')
#same as
#df = df.drop_duplicates('COL2', keep='first')
print (df)
COL1 COL2
0 a.com 22
1 b.com 45
2 c.com 34
4 f.com 56
You can also keep only last values:
df = df.drop_duplicates('COL2', keep='last')
print (df)
COL1 COL2
2 c.com 34
4 f.com 56
5 g.com 22
6 h.com 45
Or remove all duplicates:
df = df.drop_duplicates('COL2', keep=False)
print (df)
COL1 COL2
2 c.com 34
4 f.com 56
pandas: how to select unique rows in group
Using .unique()
grouped_df['column_1'].unique()
or without unique you could do something like...
grouped_df['column_1'].apply(list).apply(set)
How to get unique values from multiple columns in a pandas groupby
You can do it with apply
:
import numpy as np
g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x)))
Count distinct values depending on group
You would use count(distinct)
:
select "group", count(distinct id)
from t
group by "group";
Note that group
is a very poor name for a column because it is a SQL keyword. Hopefully the real column name is something more reasonable.
SQL - Select unique rows from a group of results
You want embedded queries, which not all SQLs support. In t-sql you'd have something like
select r.registration, r.recent, t.id, t.unittype
from (
select registration, max([date]) recent
from @tmp
group by
registration
) r
left outer join
@tmp t
on r.recent = t.[date]
and r.registration = t.registration
Related Topics
Importing Multiple .CSV Files with Variable Column Types into R
How to Unlock Environment in R
Convert from N X M Matrix to Long Matrix in R
R Shiny - Ui.R Seems to Not Recognize a Dataframe Read by Server.R
Modify Spacing Between Key Glyphs in Vertical Legend Whilst Keeping Key Glyph Border
Backports 1.1.1 Package Fails to Install
The Difference Between & and && in R
Mapping Variable to Hexagon Size with Geom_Hex
Retain Numerical Precision in an R Data Frame
Adding an Image to a Datatable in R
Shiny Error in Match.Arg(Position):'Arg' Must Be Null or a Character Vector
R Read Abbreviated Month Form a Date That Is Not in English
Technique for Finding Bad Data in Read.CSV in R
How to Extend the 'Summary' Function to Include Sd, Kurtosis and Skew