Replacing the Duplicate Values Except 1 Row in R Dataframe

Replacing the duplicate values except 1 row in R dataframe

Option 1: A base R method is to use ave() replace the duplicated date values with NA for each group in id.

x$date <- ave(
x$date,
x$id,
FUN = function(a) replace(a, duplicated(a), NA_integer_)
)

which gives the updated x data

  id date
1 p1 d1
2 p1 <NA>
3 p1 <NA>
4 p2 d2
5 p2 <NA>

The method above will work for multiple values in date, replacing duplicates with NA. If it's only the first group value you're after, you could use the code above or the following, which may be faster.

ave(
x$date,
x$id,
FUN = function(a) c(a[1], a[-1][NA])
)

This code takes the first value in each group and replaces all the rest with NA. It's not clear which one you want, since your example data only has one value per id group.

Option 2: An alternative using the data.table package. Since NA is logical, date[NA] simply turns values into NA without changing the type of the data.

library(data.table)
setDT(x)[duplicated(date), date := date[NA], by = id]

which gives

   id date
1: p1 d1
2: p1 NA
3: p1 NA
4: p2 d2
5: p2 NA

replace the duplicate data in a row with NA except the first

Using apply with MARGIN = 1 (row-wise) we can replace duplicated values in a row to NA.

t(apply(df, 1, function(x) replace(x, duplicated(x), NA)))

# [,1] [,2] [,3]
#[1,] 2 4 NA
#[2,] 1 5 NA
#[3,] 3 NA 4

How do I identify duplicates except for one column, and replace that column with max

This should do what you want

iris %>%
group_by(Sepal.Length,
Sepal.Width,
Petal.Length,
Species) %>%
filter(Petal.Width == max(Petal.Width)) %>%
filter(row_number() == 1) %>%
ungroup()

The second filtering is to get rid of duplicates if the Petal.Width is also identical for two entries. Does this work for you?

Remove duplicate values across a few columns but keep rows

Base R way using apply :

cols <- grep('z_\\d+', names(dat))
dat[cols] <- t(apply(dat[cols], 1, function(x) replace(x, duplicated(x), 0)))

# id z_1 z_2 z_3 z_4 z_5 z_6
#1 1 100 20 0 0 23 0
#2 2 290 0 0 0 0 0
#3 3 38 0 0 0 25 0
#4 4 129 0 0 127 0 0
#5 5 0 0 0 38 0 0
#6 6 290 0 98 78 0 9

tidyverse way without reshaping can be done using pmap :

library(tidyverse)

dat %>%
mutate(result = pmap(select(., matches('z_\\d+')), ~{
x <- c(...)
replace(x, duplicated(x), 0)
})) %>%
select(id, result) %>%
unnest_wider(result)

Since tests performed by @thelatemail suggests reshaping is a better option than handling the data rowwise you might want to consider it.

dat %>%
pivot_longer(cols = matches('z_\\d+')) %>%
group_by(id) %>%
mutate(value = replace(value, duplicated(value), 0)) %>%
pivot_wider()

Replace duplicate elements in a column in R

Here is another option. Group by columnA and if we see the first occurrence of A then use A else use B.

library(tidyverse)

data <- tibble(columnA = c(1,2,3,1.1,2.2,3.3,1,2),
columnB =c("a","b","c","d","e","f","g","h"))

data %>%
group_by(columnA) %>%
mutate(columnC = ifelse(row_number() == 1, as.character(columnA), columnB))
#> # A tibble: 8 x 3
#> # Groups: columnA [6]
#> columnA columnB columnC
#> <dbl> <chr> <chr>
#> 1 1 a 1
#> 2 2 b 2
#> 3 3 c 3
#> 4 1.1 d 1.1
#> 5 2.2 e 2.2
#> 6 3.3 f 3.3
#> 7 1 g g
#> 8 2 h h

In R: Duplicate rows except for the first row based on condition

dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
, by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0])

Adding the extra column part should be obvious.

Remove duplicate rows with certain value in specific column

dplyr with some base R should work for this:

 library(dplyr) 
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)

d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")

Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded

The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.

Duplicate values in a single row in dataframe

Using duplicated with apply

apply(df,1,duplicated)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE FALSE

And replace it with NA

df[t(apply(df,1,duplicated))]=NA
df
label val val1
1 a x z
2 b <NA> <NA>
3 c <NA> d


Related Topics



Leave a reply



Submit