Set Certain Values to Na with Dplyr

Set certain values to NA with dplyr

You can use replace which is a bit faster than ifelse:

dat <-  dat %>% mutate(x = replace(x, x<0, NA))

You can speed it up a bit more by supplying an index to replace using which:

dat <- dat %>% mutate(x = replace(x, which(x<0L), NA))

On my machine, this cut the time to a third, see below.

Here's a little comparison of the different answers, which is only indicative of course:

set.seed(24)
dat <- data.frame(x=rnorm(1e6))
system.time(dat %>% mutate(x = replace(x, x<0, NA)))
User System elapsed
0.03 0.00 0.03
system.time(dat %>% mutate(x=ifelse(x<0,NA,x)))
User System elapsed
0.30 0.00 0.29
system.time(setDT(dat)[x<0,x:=NA])
User System elapsed
0.01 0.00 0.02
system.time(dat$x[dat$x<0] <- NA)
User System elapsed
0.03 0.00 0.03
system.time(dat %>% mutate(x = "is.na<-"(x, x < 0)))
User System elapsed
0.05 0.00 0.05
system.time(dat %>% mutate(x = NA ^ (x < 0) * x))
User System elapsed
0.01 0.00 0.02
system.time(dat %>% mutate(x = replace(x, which(x<0), NA)))
User System elapsed
0.01 0.00 0.01

(I'm using dplyr_0.3.0.2 and data.table_1.9.4)


Since we're always very interested in benchmarking, especially in the course of data.table-vs-dplyr discussions I provide another benchmark of 3 of the answers using microbenchmark and the data by akrun. Note that I modified dplyr1 to be the updated version of my answer:

set.seed(285)
dat1 <- dat <- data.frame(x=sample(-5:5, 1e8, replace=TRUE), y=rnorm(1e8))
dtbl1 <- function() {setDT(dat)[x<0,x:=NA]}
dplr1 <- function() {dat1 %>% mutate(x = replace(x, which(x<0L), NA))}
dplr2 <- function() {dat1 %>% mutate(x = NA ^ (x < 0) * x)}
microbenchmark(dtbl1(), dplr1(), dplr2(), unit='relative', times=20L)
#Unit: relative
# expr min lq median uq max neval
# dtbl1() 1.091208 4.319863 4.194086 4.162326 4.252482 20
# dplr1() 1.000000 1.000000 1.000000 1.000000 1.000000 20
# dplr2() 6.251354 5.529948 5.344294 5.311595 5.190192 20

Setting multiple values to NA with dplyr

We can create a vector of values, then use mutate/across (from dplyr 1.0.0), and replace the values in each of the columns (everything() - to select all column) where it matches the 'vec' (%in%) to NA)

library(dplyr)
vec <- c(96:99, 99999)
dataset %>%
mutate(across(everything(), ~ replace(., . %in% vec, NA)))

Set values to NA per row after specific value in R

The following answer is within dplyr as OP requested, also it is not dependent in the number of X columns (i.e., you can have as many columns as you like, the answer still holds).

library(tidyverse)

df %>%
pivot_longer(cols = c(-ID, -number)) %>%
mutate(is_pos_inconcl = if_else(value %in% c("pos", "inconcl"), 1, 0)) %>%
group_by(ID, number) %>%
mutate(should_na = cumsum(cumsum(is_pos_inconcl))) %>%
mutate(value = if_else(should_na > 1, NA_character_, value)) %>%
select(ID, number, name, value) %>%
pivot_wider(names_from = "name",
values_from = "value")

The trick is to reshape the data to a long form (pivot_longer), then make sure that the condition holds, and use cumsum (twice, since a first positive is ok).
Finally, restructure the data back using pivot_wider.

Conditionally replace values with NA in R

It is case where the column is factor. Convert to character and it should work

library(dplyr)
have %>%
mutate(gender = as.character(gender),
gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))

The change in values in gender is when it gets coerced to its integer storage values

as.integer(factor(c("Male", "Female", "Male")))

set na all values that starts with certain string in dplyr environment is.na(), na_if(), startsWith(), regex

If you're able to do it for one column using mutate, you should be able to do it for multiple columns using mutate_at() or mutate_all(), explained here: https://dplyr.tidyverse.org/reference/mutate_all.html

Without knowing what your data looks like, I think you'd want mutate_all() to modify all columns which have data which matches your condition.

In this example using the iris dataset, we replace all instances of 5 with the word five:

iris %>%
tibble %>%
mutate_all(function(x) str_replace(x, '5', 'five'))

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 five.1 3.five 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.five 0.2 setosa
5 five 3.6 1.4 0.2 setosa
6 five.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 five 3.4 1.five 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.five 0.1 setosa

Or like your condition, we can do this only when the string starts with 5, using ^5 regex language (^ indicates the start of the string, and 5 means a 5 at the beginning of the string).

iris %>%
tibble %>%
mutate_all(function(x) str_replace(x, '^5', 'five'))

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 five.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 five 3.6 1.4 0.2 setosa
6 five.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 five 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa

Update To change the entire value, if it has a 5 at the start, you just need to change the str_replace function to a function which can change the entire value. In this case, we use an ifelse statement

iris %>%
tibble %>%
mutate_all(function(x) ifelse(str_detect(x, '^5'), 'had_five', x))

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <dbl> <chr> <dbl> <int>
1 had_five 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 had_five 3.6 1.4 0.2 1
6 had_five 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 had_five 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1

Another update From your comments, it sounds like you want to apply the function to only character columns. To do this, you can substitute mutate_all(your_fun) for mutate_if(is.character, your_fun) - as described in the help documentation at the start of this answer (the same info page describes mutate_all, mutate_if and mutate_at).

Using your sample data as an example, we can set anything beginning with '0' to NA. I am confused by your example though - do you want to look for '0' or '0\n(' at the start of the string? Either way, this is how to do it:

# sample data
string <- c("asff", "1\n(", '0asfd', '0\n(asdf)')
num <- c(0,1,2,3)
df <- data.frame(string, num)


# for only a 0 at the start of the string
df %>%
mutate_if(is.character, function(x) ifelse(str_detect(x, '^0'), NA, x))

string num
1 asff 0
2 1\n( 1
3 <NA> 2
4 <NA> 3


# for '0\n(' at the start of the string
df %>%
mutate_if(is.character, function(x) ifelse(str_detect(x, '^0\\n\\('), NA, x))

string num
1 asff 0
2 1\n( 1
3 0asfd 2
4 <NA> 3

How to replace certain values in a specific rows and columns with NA in R?

Since your data structure is 2 dimensional, you can find the indices of the rows containing a specific value first and then use this information.

which(DF$Fruits == "Pineapple")
[1] 3
DF$Weight[which(DF$Fruits == "Pineapple")] <- NA

You should be aware of that which will return a vector, so if you have multiple fruits called "Pineapple" then the previous command will return all indices of them.

replace values with NA in several columns

We may do this in two steps - loop across the columns that have 'VAR' followed by digits (\\d+) in column names, replace the values where the first two characters are not AA or DD to NA, then replace the corresponding DATE column to NA based on the NA in the 'VAR1', 'VAR2' columns

library(dplyr)
library(stringr)
DF %>%
mutate(across(matches("^VAR\\d+$"),
~ replace(., !substr(., 1, 2) %in% c("AA", "DD"), NA)),
across(ends_with("DATE"),
~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))

-output

# A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 <NA> <NA>
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 <NA> <NA> DDCC 2001-01-03
4 4 DDAA 2001-01-04 <NA> <NA>
5 5 <NA> <NA> <NA> <NA>

Is there a way to list values in na_if function in dplyr?

You can use:

df %>%
mutate(
across(
where(is.character),
~if_else(. %in% c("97", "98", "99"), NA_character_, .)
)
)
# A tibble: 5 × 4
a1 a2 a3 a4
<chr> <chr> <chr> <dbl>
1 one R statistics 80
2 three Python Data 80
3 NA NA Programming 76
4 twenty Java NA 83
5 NA NA Science 78

The reason na_if doesn't work here is because ~na_if(., c("97", "98", "99")) is basically equivalent to if_else(. == c("97", "98", "99"), NA_character_, .). In other words, it only compares the vectors in an pairwise fashion. You can see why this is an issue:

> if_else(df$a1 == c("97", "98", "99"), NA_character_, df$a1)
[1] "one" "three" "97" "twenty" NA
Warning message:
In df$a1 == c("97", "98", "99") :
longer object length is not a multiple of shorter object length


Related Topics



Leave a reply



Submit