Randomly Insert Nas into Dataframe Proportionaly

Randomly insert NAs into dataframe proportionaly


df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26

as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30

It's a random process, so it might not give 15% every time.

add exact proportion of random missing values to data.frame

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.

createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}

Obviously you should use a random seed for reproducibility, which can be specified before the function call.

This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.

Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

Assigning categorical values to NAs randomly or proportionally

We can use ifelse and is.na to determine if na exist, and then use sample to randomly select female and male.

df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)

Shrinking dataframe randomly in R

This solved my problem

df[sample(nrow(df), 10000), ]

Generate random missing values in a dataframe using R

Where your error comes from:

The sapply call is trying to apply the function insert_nas to each element of incomplete.data (in this context, the elements of a dataframe are its columns). The function dim applied to an atomic vector yields NULL; multiplying by a constant gives a numeric vector of length 0; applying floor doesn't change this; and finally trying to generate a sequence bounded by an empty vector gives an error.

How to eliminate the error:

Presumably by dim(x)[1] you were intending to get the number of rows in the dataframe (which is what you get when x is the dataframe rather than one of its columns). Try replacing it with length(x).

For arbitrarily distributed selection of NAs:

To change some proportion p of values to NA, distributing without regard to column location, it seems most straightforward to just use a random sample of the appropriate size (p*df-size) over the whole dataframe to choose the elements to set to NA:

sel <- sample( nrow(df)*ncol(df), size = p*nrow(df)*ncol(df) )
for(t in 1:length(sel)){
is.na(df[[sel[t]%/%nrow(df) +1]]) <- sel[t]%%nrow(df) + 1
}

randomize values within multiple columns of a data.frame

You probably want this. Use lapply which applies sample to each column.

set.seed(42)  # for sake of reproducibility
as.data.frame(lapply(myDF, sample))
# V1 V2 V3 V4 V5
# 1 26.805098 21.45579 19.35567 25.61212 24.837689
# 2 20.779622 14.11364 25.62038 10.52022 9.249468
# 3 9.883752 24.51835 10.37063 18.44686 18.402290
# 4 15.816900 11.81731 14.66842 16.12071 15.298724

Edit

Let's give myDF row names

rownames(myDF) <- letters[1:4]

we could buffer them

nm <- rownames(myDF)

and give them back together with the command above.

set.seed(42)
myDF <- `rownames<-`(as.data.frame(lapply(myDF, sample)), nm)
myDF
# V1 V2 V3 V4 V5
# a 26.805098 21.45579 19.35567 25.61212 24.837689
# b 20.779622 14.11364 25.62038 10.52022 9.249468
# c 9.883752 24.51835 10.37063 18.44686 18.402290
# d 15.816900 11.81731 14.66842 16.12071 15.298724

Data

myDF <- structure(list(V1 = c(9.883752193648, 15.8168998395206, 20.7796219245553, 
26.8050975188108), V2 = c(11.8173120437042, 14.1136424787568,
21.4557850824769, 24.5183526363054), V3 = c(10.370627864258,
14.6684224100574, 19.3556715707687, 25.6203798012984), V4 = c(10.520216457555,
16.1207126516696, 18.4468625947703, 25.6121234926508), V5 = c(9.24946800549767,
15.2987236992673, 18.4022904833037, 24.8376890230819)), class = "data.frame", row.names = c(NA,
-4L))


Related Topics



Leave a reply



Submit