Replacing the duplicate values except 1 row in R dataframe
Option 1: A base R method is to use ave()
replace the duplicated date
values with NA
for each group in id
.
x$date <- ave(
x$date,
x$id,
FUN = function(a) replace(a, duplicated(a), NA_integer_)
)
which gives the updated x
data
id date
1 p1 d1
2 p1 <NA>
3 p1 <NA>
4 p2 d2
5 p2 <NA>
The method above will work for multiple values in date
, replacing duplicates with NA
. If it's only the first group value you're after, you could use the code above or the following, which may be faster.
ave(
x$date,
x$id,
FUN = function(a) c(a[1], a[-1][NA])
)
This code takes the first value in each group and replaces all the rest with NA
. It's not clear which one you want, since your example data only has one value per id
group.
Option 2: An alternative using the data.table package. Since NA
is logical, date[NA]
simply turns values into NA
without changing the type of the data.
library(data.table)
setDT(x)[duplicated(date), date := date[NA], by = id]
which gives
id date
1: p1 d1
2: p1 NA
3: p1 NA
4: p2 d2
5: p2 NA
replace the duplicate data in a row with NA except the first
Using apply
with MARGIN = 1
(row-wise) we can replace
duplicated
values in a row to NA
.
t(apply(df, 1, function(x) replace(x, duplicated(x), NA)))
# [,1] [,2] [,3]
#[1,] 2 4 NA
#[2,] 1 5 NA
#[3,] 3 NA 4
How do I identify duplicates except for one column, and replace that column with max
This should do what you want
iris %>%
group_by(Sepal.Length,
Sepal.Width,
Petal.Length,
Species) %>%
filter(Petal.Width == max(Petal.Width)) %>%
filter(row_number() == 1) %>%
ungroup()
The second filtering is to get rid of duplicates if the Petal.Width
is also identical for two entries. Does this work for you?
Remove duplicate values across a few columns but keep rows
Base R way using apply
:
cols <- grep('z_\\d+', names(dat))
dat[cols] <- t(apply(dat[cols], 1, function(x) replace(x, duplicated(x), 0)))
# id z_1 z_2 z_3 z_4 z_5 z_6
#1 1 100 20 0 0 23 0
#2 2 290 0 0 0 0 0
#3 3 38 0 0 0 25 0
#4 4 129 0 0 127 0 0
#5 5 0 0 0 38 0 0
#6 6 290 0 98 78 0 9
tidyverse
way without reshaping can be done using pmap
:
library(tidyverse)
dat %>%
mutate(result = pmap(select(., matches('z_\\d+')), ~{
x <- c(...)
replace(x, duplicated(x), 0)
})) %>%
select(id, result) %>%
unnest_wider(result)
Since tests performed by @thelatemail suggests reshaping is a better option than handling the data rowwise you might want to consider it.
dat %>%
pivot_longer(cols = matches('z_\\d+')) %>%
group_by(id) %>%
mutate(value = replace(value, duplicated(value), 0)) %>%
pivot_wider()
Replace duplicate elements in a column in R
Here is another option. Group by columnA and if we see the first occurrence of A then use A else use B.
library(tidyverse)
data <- tibble(columnA = c(1,2,3,1.1,2.2,3.3,1,2),
columnB =c("a","b","c","d","e","f","g","h"))
data %>%
group_by(columnA) %>%
mutate(columnC = ifelse(row_number() == 1, as.character(columnA), columnB))
#> # A tibble: 8 x 3
#> # Groups: columnA [6]
#> columnA columnB columnC
#> <dbl> <chr> <chr>
#> 1 1 a 1
#> 2 2 b 2
#> 3 3 c 3
#> 4 1.1 d 1.1
#> 5 2.2 e 2.2
#> 6 3.3 f 3.3
#> 7 1 g g
#> 8 2 h h
In R: Duplicate rows except for the first row based on condition
dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
, by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0])
Adding the extra column part should be obvious.
Remove duplicate rows with certain value in specific column
dplyr with some base R should work for this:
library(dplyr)
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)
d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")
Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded
The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.
Duplicate values in a single row in dataframe
Using duplicated
with apply
apply(df,1,duplicated)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE FALSE
And replace it with NA
df[t(apply(df,1,duplicated))]=NA
df
label val val1
1 a x z
2 b <NA> <NA>
3 c <NA> d
Related Topics
Multiple Graphs of Each Time Series
Options for Deploying R Models in Production
How to Get Rstudio to Automatically Compile R Markdown Vignettes
Street Address to Geolocation Lat/Long
Odds Ratios Instead of Logits in Stargazer() Latex Output
How to Select Rows from Data.Frame with 2 Conditions
Meaning of Band Width in Ggplot Geom_Smooth Lm
Creating Vector of Results of Repeated Function Calls in R
How to Recreate Same Documenttermmatrix with New (Test) Data
Faster Reading of Time Series from Netcdf
Subset Dataframe Such That All Values in Each Row Are Less Than a Certain Value
Xgboost in R: How Does Xgb.Cv Pass the Optimal Parameters into Xgb.Train
When to Use 'With' Function and Why Is It Good
Methods for Doing Heatmaps, Level/Contour Plots, and Hexagonal Binning
Convert and Save Distance Matrix to a Specific Format
Control Alignment of Two Side-By-Side Plots in Knitr
Time Difference in Years with Lubridate
How to Optimize Read and Write to Subsections of a Matrix in R (Possibly Using Data.Table)