Fast Replacing Values in Dataframe in R

Fast replacing values in dataframe in R

Try transforming your df to a matrix.

df <- data.frame(a=rnorm(1000),b=rnorm(1000))
m <- as.matrix(df)
m[m<0] <- 0
df <- as.data.frame(m)

Fastest way to replace multiple values in large data.frame

I looked at some sed commands and I figured I should post what I found just in case someone has a similar issue.

The sed commands that I found to work in terminal are (This creates a new file, but you don't have to create new files)

sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt

or this works as well in R

system(paste(sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt))

You can also use the data.table::fread method mentioned by IceCreamToucan

df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")

It interesting to note that typically the sed command you use is

sed 's/old text/new text/g' file > new.file

but since what I needed to replace had a forward slash already / I had to use the + plus sign so sed doesn't get confused.

I am going to do a performance test using my two older methods (posted above), the new sed method, and F. Prive's method that he posted as an answer. I am going to make a smaller subset of the full dataset because it would take too long to test the four methods.

EDIT

So I tested the four different methods out to see which one was fastest. I created a smaller file to test the four methods out. The file I created had 1000000 rows and 340 columns.

METHOD 1

lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.1)
df[-(1:2)] <- lapply(df[-(1:2)], function(x) lookup_table[x])

Runtime - 8 minutes

METHOD 2

replacement<-function(x){
x=replace(x,which(x=='./.'),0.01)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
df=apply(df,2,replacement)
df <- as.data.frame(df)

Runtime - 46 seconds

METHOD 3

df <- df %>% mutate_at(
vars(- CHROM, - POS),
funs(case_when(
. == "0/0" ~ 0,
. == "0/1" ~ 1,
. == "1/1" ~ 2,
. == "./." ~ 0.01
))
)

Runtime - 42 seconds

METHOD 4

df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")

Runtime - 2 min 34 seconds, which was surprising

Conclusion - I wasted my time

Is there a quick way to replace column values in R?

You can use sprintf:

# create the example used by the OP
dat <- data.frame(Temperature = 31:33,
Height = c(157, 159, 139))

# use sprintf along with seq_len
dat$Height <- sprintf("pic_%05d", seq_len(NROW(dat)))

# show the result
dat
#R> Temperature Height
#R> 1 31 pic_00001
#R> 2 32 pic_00002
#R> 3 33 pic_00003

You can change the 05d if you want more leading zeros. E.g. 07d will give a seven digit sequence. The manual page for sprintf have further details.

faster way to replace values in R data.table

Here is one possible way to solve your problem. Note that values not specified in cases (like .|., etc.) will be become NA)

cases = c(REF = "0/0", REF = "0|0", 
HET = "0/1", HET = "0|1", HET = "1/0", HET = "1|0",
ALT = "1/1", ALT = "1|1")

cols = c("Sample1", "Sample2", "Sample3") # names of the columns from 6 to 8

rawdata[, (cols) := lapply(.SD, function(x) names(cases)[chmatch(x, cases)]), .SDcols=cols]

Replace all particular values in a data frame

Like this:

> df[df==""]<-NA
> df
A B
1 <NA> 12
2 xyz <NA>
3 jkl 100

Replace values in R dataframe based on conditions

Yes! There's a command called replace:

df$Age <- with(df, replace(Age, Age > 90 | Age < 16, NA)) 

R - Very slow loop on data frame to replace values

There are functions available which are built for this particular purpose which is known as last observation carried forward. One of the functions is na.locf()from the zoo package:

With that, the complete issue becomes a one-liner (I'm using data.table here because I'm more fluent in and it's usually faster with larger data.tables):

library(data.table)
setDT(df_2)[order(day_event, day_measure), measure := zoo::na.locf(measure), by = day_event]

Here, the rows are ordered by event date and subsequently be measure date. Then, missing elements are filled by last observation carried forward. In addition, the whole operation is grouped by event date to make sure that no false values are being carried forward if the first measurement in each group already is NA.

This is even faster than the OP's own answer which can be demonstrated by a benchmark (using the microbenchmark package)

Benchmark results

#Unit: milliseconds
# expr min lq mean median uq max neval cld
# loop 20.867890 22.037188 23.052667 22.665122 23.510681 27.535109 100 c
# apply 9.011630 9.498314 9.834324 9.752323 9.994688 12.862594 100 b
# na.locf 1.971389 2.132780 2.211467 2.226080 2.290762 2.656973 100 a

Benchmark code

As all 3 methods change the data in place we need to keep a copy of the original data.

library(data.table)
df_0 <- copy(df_2)
library(tidyr)

microbenchmark::microbenchmark(
loop = {
df_2 <- copy(df_0)
for (i in 1:length(df_2$measure)){
row <- df_2[i,]
if (row$day_event +7 < row$day_measure & length(df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure)>0){
row$measure<-df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure
df_2[i,]<-row
}
}
},
apply = {
df_2 <- copy(df_0)
df_temp <- df_2 %>%
dplyr::filter(day_event < day_measure - 7)

df_temp$measure <- apply(X = df_temp
, MARGIN = 1
, FUN = function(x) {
(df_2 %>% dplyr::filter(
day_event == x[[1]] & day_measure == (as.Date(x[[1]], format = "%Y-%m-%d") + 7)
))$measure

})

df_2 <- rbind(df_2 %>% dplyr::filter(day_event >= day_measure - 7)
, df_temp
)

},
na.locf = {
df_2 <- copy(df_0)
df_2[order(day_event, day_measure), measure := zoo::na.locf(measure), by = day_event]
})

Replacing values in data frame in R

With R's capacity for logical indexing using the [<- function, this is really quite easy:

> x$x3[ grepl("bob", x$x1) & x$x2 == 1] <- 1
> x
x1 x2 x3
1 bob 1 1
2 jane 1 22
3 bob 1 1
4 bobby 1 1
5 bob 1 1
6 jane 2 23
7 bobby 2 53
8 bob 2 42
9 jane 2 13
10 bob 2 35

To read the code you should see it as: "for every line of x where column 'x1' has "bob' and column 'x2' is equal to 1 ,... you assign the value 1 to column 'x3'." If you wanted to have a new object with that value, you could make a copy of x with y <- x and working on that instead.

Fast replacing of values in a large vectors

Use:

my.vec[my.vec < -5 | my.vec > 5] <- NA

Or:

my.vec[my.vec < -5 | my.vec > 5] = NA

Or use ifelse:

my.vec = ifelse((my.vec < -5 | my.vec > 5), NA, my.vec)

Replacing values in a dataframe in R

If you want to replace the ? in all the columns, either

 data_clean <- function(data){
data[] <- lapply(data, function(x) replace(x, x=='?',0))
data}
data_clean(df1)
#col_1 col_2
#1 1 <blank>
#2 2 5
#3 3 6
#4 4 0

Or

data_clean2 <- function(data){
data[data=="?"] <- 0
data}
data_clean2(df1)
# col_1 col_2
#1 1 <blank>
#2 2 5
#3 3 6
#4 4 0

data

df1 <-structure(list(col_1 = 1:4, col_2 = c("<blank>", "5", "6", "?"
)), .Names = c("col_1", "col_2"), class = "data.frame",
row.names = c(NA, -4L))


Related Topics



Leave a reply



Submit