Randomly insert NAs into dataframe proportionaly
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
add exact proportion of random missing values to data.frame
This is the way that I do it for my paper on library(imputeMulti)
which is currently in review at JSS. This inserts NA
's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0
.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
Edit: I do assume that x
is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)
Assigning categorical values to NAs randomly or proportionally
We can use ifelse
and is.na
to determine if na
exist, and then use sample
to randomly select female
and male
.
df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
Shrinking dataframe randomly in R
This solved my problem
df[sample(nrow(df), 10000), ]
Generate random missing values in a dataframe using R
Where your error comes from:
The sapply
call is trying to apply the function insert_nas
to each element of incomplete.data
(in this context, the elements of a dataframe are its columns). The function dim
applied to an atomic vector yields NULL
; multiplying by a constant gives a numeric vector of length 0; applying floor
doesn't change this; and finally trying to generate a sequence bounded by an empty vector gives an error.
How to eliminate the error:
Presumably by dim(x)[1]
you were intending to get the number of rows in the dataframe (which is what you get when x
is the dataframe rather than one of its columns). Try replacing it with length(x)
.
For arbitrarily distributed selection of NAs:
To change some proportion p
of values to NA, distributing without regard to column location, it seems most straightforward to just use a random sample of the appropriate size (p*df-size) over the whole dataframe to choose the elements to set to NA:
sel <- sample( nrow(df)*ncol(df), size = p*nrow(df)*ncol(df) )
for(t in 1:length(sel)){
is.na(df[[sel[t]%/%nrow(df) +1]]) <- sel[t]%%nrow(df) + 1
}
randomize values within multiple columns of a data.frame
You probably want this. Use lapply
which applies sample
to each column.
set.seed(42) # for sake of reproducibility
as.data.frame(lapply(myDF, sample))
# V1 V2 V3 V4 V5
# 1 26.805098 21.45579 19.35567 25.61212 24.837689
# 2 20.779622 14.11364 25.62038 10.52022 9.249468
# 3 9.883752 24.51835 10.37063 18.44686 18.402290
# 4 15.816900 11.81731 14.66842 16.12071 15.298724
Edit
Let's give myDF
row names
rownames(myDF) <- letters[1:4]
we could buffer them
nm <- rownames(myDF)
and give them back together with the command above.
set.seed(42)
myDF <- `rownames<-`(as.data.frame(lapply(myDF, sample)), nm)
myDF
# V1 V2 V3 V4 V5
# a 26.805098 21.45579 19.35567 25.61212 24.837689
# b 20.779622 14.11364 25.62038 10.52022 9.249468
# c 9.883752 24.51835 10.37063 18.44686 18.402290
# d 15.816900 11.81731 14.66842 16.12071 15.298724
Data
myDF <- structure(list(V1 = c(9.883752193648, 15.8168998395206, 20.7796219245553,
26.8050975188108), V2 = c(11.8173120437042, 14.1136424787568,
21.4557850824769, 24.5183526363054), V3 = c(10.370627864258,
14.6684224100574, 19.3556715707687, 25.6203798012984), V4 = c(10.520216457555,
16.1207126516696, 18.4468625947703, 25.6121234926508), V5 = c(9.24946800549767,
15.2987236992673, 18.4022904833037, 24.8376890230819)), class = "data.frame", row.names = c(NA,
-4L))
Related Topics
How to Match by Nearest Date from Two Data Frames
Ggplot2 Does Not Appear to Work When Inside a Function R
Changing Million/Billion Abbreviations into Actual Numbers? Ie. 5.12M -> 5,120,000
Using R Statistics Add a Group Sum to Each Row
Merge Data.Frames Based on Year and Fill in Missing Values
How to Find Out Which Package Version Is Loaded in R
Advantages of Reactive VS. Observe VS. Observeevent
Programmatically Creating Markdown Tables in R with Knitr
How to Clear Only a Few Specific Objects from the Workspace
Deploying R Shiny App as a Standalone Application
Passing Several Arguments to Fun of Lapply (And Others *Apply)
Can Dplyr Join on Multiple Columns or Composite Key
Stepwise Regression Using P-Values to Drop Variables with Nonsignificant P-Values