Fast replacing values in dataframe in R
Try transforming your df to a matrix.
df <- data.frame(a=rnorm(1000),b=rnorm(1000))
m <- as.matrix(df)
m[m<0] <- 0
df <- as.data.frame(m)
Fastest way to replace multiple values in large data.frame
I looked at some sed
commands and I figured I should post what I found just in case someone has a similar issue.
The sed
commands that I found to work in terminal are (This creates a new file, but you don't have to create new files)
sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt
or this works as well in R
system(paste(sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt))
You can also use the data.table::fread
method mentioned by IceCreamToucan
df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")
It interesting to note that typically the sed
command you use is
sed 's/old text/new text/g' file > new.file
but since what I needed to replace had a forward slash already /
I had to use the +
plus sign so sed
doesn't get confused.
I am going to do a performance test using my two older methods (posted above), the new sed
method, and F. Prive's method that he posted as an answer. I am going to make a smaller subset of the full dataset because it would take too long to test the four methods.
EDIT
So I tested the four different methods out to see which one was fastest. I created a smaller file to test the four methods out. The file I created had 1000000 rows and 340 columns.
METHOD 1
lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.1)
df[-(1:2)] <- lapply(df[-(1:2)], function(x) lookup_table[x])
Runtime - 8 minutes
METHOD 2
replacement<-function(x){
x=replace(x,which(x=='./.'),0.01)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
df=apply(df,2,replacement)
df <- as.data.frame(df)
Runtime - 46 seconds
METHOD 3
df <- df %>% mutate_at(
vars(- CHROM, - POS),
funs(case_when(
. == "0/0" ~ 0,
. == "0/1" ~ 1,
. == "1/1" ~ 2,
. == "./." ~ 0.01
))
)
Runtime - 42 seconds
METHOD 4
df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")
Runtime - 2 min 34 seconds, which was surprising
Conclusion - I wasted my time
Is there a quick way to replace column values in R?
You can use sprintf
:
# create the example used by the OP
dat <- data.frame(Temperature = 31:33,
Height = c(157, 159, 139))
# use sprintf along with seq_len
dat$Height <- sprintf("pic_%05d", seq_len(NROW(dat)))
# show the result
dat
#R> Temperature Height
#R> 1 31 pic_00001
#R> 2 32 pic_00002
#R> 3 33 pic_00003
You can change the 05d
if you want more leading zeros. E.g. 07d
will give a seven digit sequence. The manual page for sprintf
have further details.
faster way to replace values in R data.table
Here is one possible way to solve your problem. Note that values not specified in cases
(like .|.
, etc.) will be become NA
)
cases = c(REF = "0/0", REF = "0|0",
HET = "0/1", HET = "0|1", HET = "1/0", HET = "1|0",
ALT = "1/1", ALT = "1|1")
cols = c("Sample1", "Sample2", "Sample3") # names of the columns from 6 to 8
rawdata[, (cols) := lapply(.SD, function(x) names(cases)[chmatch(x, cases)]), .SDcols=cols]
Replace all particular values in a data frame
Like this:
> df[df==""]<-NA
> df
A B
1 <NA> 12
2 xyz <NA>
3 jkl 100
Replace values in R dataframe based on conditions
Yes! There's a command called replace
:
df$Age <- with(df, replace(Age, Age > 90 | Age < 16, NA))
R - Very slow loop on data frame to replace values
There are functions available which are built for this particular purpose which is known as last observation carried forward. One of the functions is na.locf()
from the zoo
package:
With that, the complete issue becomes a one-liner (I'm using data.table
here because I'm more fluent in and it's usually faster with larger data.tables):
library(data.table)
setDT(df_2)[order(day_event, day_measure), measure := zoo::na.locf(measure), by = day_event]
Here, the rows are ordered by event date and subsequently be measure date. Then, missing elements are filled by last observation carried forward. In addition, the whole operation is grouped by event date to make sure that no false values are being carried forward if the first measurement in each group already is NA
.
This is even faster than the OP's own answer which can be demonstrated by a benchmark (using the microbenchmark
package)
Benchmark results
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# loop 20.867890 22.037188 23.052667 22.665122 23.510681 27.535109 100 c
# apply 9.011630 9.498314 9.834324 9.752323 9.994688 12.862594 100 b
# na.locf 1.971389 2.132780 2.211467 2.226080 2.290762 2.656973 100 a
Benchmark code
As all 3 methods change the data in place we need to keep a copy of the original data.
library(data.table)
df_0 <- copy(df_2)
library(tidyr)
microbenchmark::microbenchmark(
loop = {
df_2 <- copy(df_0)
for (i in 1:length(df_2$measure)){
row <- df_2[i,]
if (row$day_event +7 < row$day_measure & length(df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure)>0){
row$measure<-df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure
df_2[i,]<-row
}
}
},
apply = {
df_2 <- copy(df_0)
df_temp <- df_2 %>%
dplyr::filter(day_event < day_measure - 7)
df_temp$measure <- apply(X = df_temp
, MARGIN = 1
, FUN = function(x) {
(df_2 %>% dplyr::filter(
day_event == x[[1]] & day_measure == (as.Date(x[[1]], format = "%Y-%m-%d") + 7)
))$measure
})
df_2 <- rbind(df_2 %>% dplyr::filter(day_event >= day_measure - 7)
, df_temp
)
},
na.locf = {
df_2 <- copy(df_0)
df_2[order(day_event, day_measure), measure := zoo::na.locf(measure), by = day_event]
})
Replacing values in data frame in R
With R's capacity for logical indexing using the [<-
function, this is really quite easy:
> x$x3[ grepl("bob", x$x1) & x$x2 == 1] <- 1
> x
x1 x2 x3
1 bob 1 1
2 jane 1 22
3 bob 1 1
4 bobby 1 1
5 bob 1 1
6 jane 2 23
7 bobby 2 53
8 bob 2 42
9 jane 2 13
10 bob 2 35
To read the code you should see it as: "for every line of x
where column 'x1' has "bob' and column 'x2' is equal to 1 ,... you assign the value 1 to column 'x3'." If you wanted to have a new object with that value, you could make a copy of x with y <- x
and working on that instead.
Fast replacing of values in a large vectors
Use:
my.vec[my.vec < -5 | my.vec > 5] <- NA
Or:
my.vec[my.vec < -5 | my.vec > 5] = NA
Or use ifelse
:
my.vec = ifelse((my.vec < -5 | my.vec > 5), NA, my.vec)
Replacing values in a dataframe in R
If you want to replace the ?
in all the columns, either
data_clean <- function(data){
data[] <- lapply(data, function(x) replace(x, x=='?',0))
data}
data_clean(df1)
#col_1 col_2
#1 1 <blank>
#2 2 5
#3 3 6
#4 4 0
Or
data_clean2 <- function(data){
data[data=="?"] <- 0
data}
data_clean2(df1)
# col_1 col_2
#1 1 <blank>
#2 2 5
#3 3 6
#4 4 0
data
df1 <-structure(list(col_1 = 1:4, col_2 = c("<blank>", "5", "6", "?"
)), .Names = c("col_1", "col_2"), class = "data.frame",
row.names = c(NA, -4L))
Related Topics
Avoiding Type Conflicts with Dplyr::Case_When
Get Filename and Path of 'Source'D File
How to Leave the R Browser() Mode in the Console Window
Linear Regression and Storing Results in Data Frame
Rollmean with Dplyr and Magrittr
Combining Elements of List of Lists by Index
Match and Replace Multiple Strings in a Vector of Text Without Looping in R
Using Filter_ in Dplyr Where Both Field and Value Are in Variables
Adding Text to Ggplot Geom_Jitter Points That Match a Condition
How to Group by All But One Columns
Using Dplyr to Conditionally Replace Values in a Column
Fast Linear Regression by Group
Adding Time to Posixct Object in R
Visualise Distances Between Texts
Convert a Matrix with Dimnames into a Long Format Data.Frame