Set certain values to NA with dplyr
You can use replace
which is a bit faster than ifelse
:
dat <- dat %>% mutate(x = replace(x, x<0, NA))
You can speed it up a bit more by supplying an index to replace
using which
:
dat <- dat %>% mutate(x = replace(x, which(x<0L), NA))
On my machine, this cut the time to a third, see below.
Here's a little comparison of the different answers, which is only indicative of course:
set.seed(24)
dat <- data.frame(x=rnorm(1e6))
system.time(dat %>% mutate(x = replace(x, x<0, NA)))
User System elapsed
0.03 0.00 0.03
system.time(dat %>% mutate(x=ifelse(x<0,NA,x)))
User System elapsed
0.30 0.00 0.29
system.time(setDT(dat)[x<0,x:=NA])
User System elapsed
0.01 0.00 0.02
system.time(dat$x[dat$x<0] <- NA)
User System elapsed
0.03 0.00 0.03
system.time(dat %>% mutate(x = "is.na<-"(x, x < 0)))
User System elapsed
0.05 0.00 0.05
system.time(dat %>% mutate(x = NA ^ (x < 0) * x))
User System elapsed
0.01 0.00 0.02
system.time(dat %>% mutate(x = replace(x, which(x<0), NA)))
User System elapsed
0.01 0.00 0.01
(I'm using dplyr_0.3.0.2 and data.table_1.9.4)
Since we're always very interested in benchmarking, especially in the course of data.table-vs-dplyr discussions I provide another benchmark of 3 of the answers using microbenchmark and the data by akrun. Note that I modified dplyr1
to be the updated version of my answer:
set.seed(285)
dat1 <- dat <- data.frame(x=sample(-5:5, 1e8, replace=TRUE), y=rnorm(1e8))
dtbl1 <- function() {setDT(dat)[x<0,x:=NA]}
dplr1 <- function() {dat1 %>% mutate(x = replace(x, which(x<0L), NA))}
dplr2 <- function() {dat1 %>% mutate(x = NA ^ (x < 0) * x)}
microbenchmark(dtbl1(), dplr1(), dplr2(), unit='relative', times=20L)
#Unit: relative
# expr min lq median uq max neval
# dtbl1() 1.091208 4.319863 4.194086 4.162326 4.252482 20
# dplr1() 1.000000 1.000000 1.000000 1.000000 1.000000 20
# dplr2() 6.251354 5.529948 5.344294 5.311595 5.190192 20
Setting multiple values to NA with dplyr
We can create a vector
of values, then use mutate/across
(from dplyr
1.0.0), and replace the values in each of the columns (everything()
- to select all column) where it matches the 'vec' (%in%
) to NA
)
library(dplyr)
vec <- c(96:99, 99999)
dataset %>%
mutate(across(everything(), ~ replace(., . %in% vec, NA)))
Set values to NA per row after specific value in R
The following answer is within dplyr
as OP requested, also it is not dependent in the number of X
columns (i.e., you can have as many columns as you like, the answer still holds).
library(tidyverse)
df %>%
pivot_longer(cols = c(-ID, -number)) %>%
mutate(is_pos_inconcl = if_else(value %in% c("pos", "inconcl"), 1, 0)) %>%
group_by(ID, number) %>%
mutate(should_na = cumsum(cumsum(is_pos_inconcl))) %>%
mutate(value = if_else(should_na > 1, NA_character_, value)) %>%
select(ID, number, name, value) %>%
pivot_wider(names_from = "name",
values_from = "value")
The trick is to reshape the data to a long form (pivot_longer
), then make sure that the condition holds, and use cumsum
(twice, since a first positive is ok).
Finally, restructure the data back using pivot_wider
.
Conditionally replace values with NA in R
It is case where the column is factor
. Convert to character
and it should work
library(dplyr)
have %>%
mutate(gender = as.character(gender),
gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
The change in values in gender
is when it gets coerced to its integer storage values
as.integer(factor(c("Male", "Female", "Male")))
set na all values that starts with certain string in dplyr environment is.na(), na_if(), startsWith(), regex
If you're able to do it for one column using mutate
, you should be able to do it for multiple columns using mutate_at()
or mutate_all()
, explained here: https://dplyr.tidyverse.org/reference/mutate_all.html
Without knowing what your data looks like, I think you'd want mutate_all()
to modify all columns which have data which matches your condition.
In this example using the iris
dataset, we replace all instances of 5
with the word five
:
iris %>%
tibble %>%
mutate_all(function(x) str_replace(x, '5', 'five'))
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 five.1 3.five 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.five 0.2 setosa
5 five 3.6 1.4 0.2 setosa
6 five.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 five 3.4 1.five 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.five 0.1 setosa
Or like your condition, we can do this only when the string starts with 5
, using ^5
regex language (^
indicates the start of the string, and 5
means a 5
at the beginning of the string).
iris %>%
tibble %>%
mutate_all(function(x) str_replace(x, '^5', 'five'))
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 five.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 five 3.6 1.4 0.2 setosa
6 five.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 five 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Update To change the entire value, if it has a 5
at the start, you just need to change the str_replace
function to a function which can change the entire value. In this case, we use an ifelse
statement
iris %>%
tibble %>%
mutate_all(function(x) ifelse(str_detect(x, '^5'), 'had_five', x))
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <dbl> <chr> <dbl> <int>
1 had_five 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 had_five 3.6 1.4 0.2 1
6 had_five 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 had_five 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1
Another update From your comments, it sounds like you want to apply the function to only character columns. To do this, you can substitute mutate_all(your_fun)
for mutate_if(is.character, your_fun)
- as described in the help documentation at the start of this answer (the same info page describes mutate_all
, mutate_if
and mutate_at
).
Using your sample data as an example, we can set anything beginning with '0'
to NA. I am confused by your example though - do you want to look for '0'
or '0\n('
at the start of the string? Either way, this is how to do it:
# sample data
string <- c("asff", "1\n(", '0asfd', '0\n(asdf)')
num <- c(0,1,2,3)
df <- data.frame(string, num)
# for only a 0 at the start of the string
df %>%
mutate_if(is.character, function(x) ifelse(str_detect(x, '^0'), NA, x))
string num
1 asff 0
2 1\n( 1
3 <NA> 2
4 <NA> 3
# for '0\n(' at the start of the string
df %>%
mutate_if(is.character, function(x) ifelse(str_detect(x, '^0\\n\\('), NA, x))
string num
1 asff 0
2 1\n( 1
3 0asfd 2
4 <NA> 3
How to replace certain values in a specific rows and columns with NA in R?
Since your data structure is 2 dimensional, you can find the indices of the rows containing a specific value first and then use this information.
which(DF$Fruits == "Pineapple")
[1] 3
DF$Weight[which(DF$Fruits == "Pineapple")] <- NA
You should be aware of that which
will return a vector, so if you have multiple fruits called "Pineapple" then the previous command will return all indices of them.
replace values with NA in several columns
We may do this in two steps - loop across
the columns that have 'VAR' followed by digits (\\d+
) in column names, replace
the values where the first two characters are not AA
or DD
to NA
, then replace
the corresponding DATE
column to NA
based on the NA
in the 'VAR1', 'VAR2' columns
library(dplyr)
library(stringr)
DF %>%
mutate(across(matches("^VAR\\d+$"),
~ replace(., !substr(., 1, 2) %in% c("AA", "DD"), NA)),
across(ends_with("DATE"),
~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))
-output
# A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 <NA> <NA>
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 <NA> <NA> DDCC 2001-01-03
4 4 DDAA 2001-01-04 <NA> <NA>
5 5 <NA> <NA> <NA> <NA>
Is there a way to list values in na_if function in dplyr?
You can use:
df %>%
mutate(
across(
where(is.character),
~if_else(. %in% c("97", "98", "99"), NA_character_, .)
)
)
# A tibble: 5 × 4
a1 a2 a3 a4
<chr> <chr> <chr> <dbl>
1 one R statistics 80
2 three Python Data 80
3 NA NA Programming 76
4 twenty Java NA 83
5 NA NA Science 78
The reason na_if
doesn't work here is because ~na_if(., c("97", "98", "99"))
is basically equivalent to if_else(. == c("97", "98", "99"), NA_character_, .)
. In other words, it only compares the vectors in an pairwise fashion. You can see why this is an issue:
> if_else(df$a1 == c("97", "98", "99"), NA_character_, df$a1)
[1] "one" "three" "97" "twenty" NA
Warning message:
In df$a1 == c("97", "98", "99") :
longer object length is not a multiple of shorter object length
Related Topics
R Function Not Returning Values
How to Randomize (Or Permute) a Dataframe Rowwise and Columnwise
How to Change the Color Value of Just One Value in Ggplot2's Scale_Fill_Brewer
Speed Up Plot() Function for Large Dataset
Format for Ordinal Dates (Day of Month with Suffixes -St, -Nd, -Rd, -Th)
Extract Prediction Band from Lme Fit
Adding Labels to Ggplot Bar Chart
Rgdal Installation Failed on Ubuntu 16.04
How to Delete Everything After Nth Delimiter in R
How to Create Two Independent Drill Down Plot Using Highcharter
Linear Regression with a Known Fixed Intercept in R
Same Function Over Multiple Data Frames in R
Using Stargazer with Rstudio and Knitr
Forward and Backward Fill Data Frame in R