Remove Duplicates Keeping Entry with Largest Absolute Value

Remove duplicates keeping entry with largest absolute value

First. Sort in the order putting the less desired items last within id groups

 aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)

Then: Remove items after the first within id groups

 aa[ !duplicated(aa$id), ]              # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6

Remove duplicates, keeping most frequent row

Use table and which.max to extract the mode:

df %>% 
group_by(id, var) %>%
summarise(val = {t <- table(val); names(t)[which.max(t)] })

# A tibble: 2 x 3
# Groups: id [?]
# id var val
# <fct> <fct> <chr>
#1 a b d
#2 a c f

Another way to do this in base R: Create a three way contingency table directly, and then find the max column along the third axis:

apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])

# var
#id b c
# a "d" "f"

Convert this to a data frame:

as.data.frame.table(
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
)

# id var Freq
#1 a b d
#2 a c f

Choosing duplicates according to maximum (for multiple columns) in R

You can group by ID, Gender and Race and summarise the Value variables to get their max.

library(dplyr)

df %>%
group_by(ID, Gender, Race) %>%
summarise(across(starts_with('Value'), max, na.rm = TRUE), .groups = "drop")

# ID Gender Race Value1 Value2 Value3
# <int> <chr> <chr> <int> <int> <int>
#1 1 M B 45 76 95
#2 2 F W 67 100 92
#3 3 F A 34 95 90
#4 4 M W 78 37 13
#5 5 M B 80 105 45

Conditionally removing duplicates in Excel based on largest value in a Column

If you don not want to use VBA you can try this:

Select the table and from HOME -> Sort & Filter select Custom Sort.

First sort by word with order : A to Z then Add Level and sort by value with order: Smallest to Largest.

Create a column Key with the following formula (i am assuming you data have headers and start from cell A1. Then in cell C2 put the formula =IF(B2<>B3,0,1) and copy down.

Then copy and paste special as values column C and sort on the column Key Descending or filter on 1

Finally, copy the filtered table and paste it elsewhere then select it all and Data -> Remove Duplicates and your done. Or you can do this last part first and then run the steps above.

Remove duplicates from dataset based on criteria

There are many ways to get the expected output. One option with dplyr would be to group by 'ID', 'Sub' columns, get the top Score observation with top_n, and if there are duplicate rows, use distinct.

library(dplyr)
df1 %>%
group_by(ID, Sub) %>%
top_n(1) %>%
distinct()
ID Sub Score
# (int) (chr) (int)
#1 2 Mat 34
#2 3 Mat 67
#3 4 Mat 22
#4 5 Sci 78
#5 6 Mat 32
#6 1 Mat 56
#7 1 Sci 40

Or with data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Sub', we order the 'Score' in descending and subset the first row of each group combination (.SD[1L] or head(.SD, 1) can be used).

library(data.table) 
setDT(df1)[order(-Score), .SD[1L] ,.(ID, Sub)]

Or another option is unique after we order by the columns so that it will only select the first observation for each duplicate.

unique(setDT(df1)[order(ID, Sub,-Score)], by = c('ID', 'Sub'))

Or with base R, we order the columns, and use duplicated to remove the rows that are duplicates for the first two columns.

df2 <- df1[with(df1, order(ID, Sub, -Score)),]
df2[!duplicated(df2[1:2]),]

how to remove unique entry and keep duplicates in R

Another option in base R Using duplicated

dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]

# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 5 A0020 367 8.8750 37797 0
# 6 A0020 339 9.6250 39324 0

data.table using duplicated

using duplicated and fromLast version you get :

library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]

# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0

This can be applied to base R also but I prefer data.table here for syntax sugar.



Related Topics



Leave a reply



Submit