How to Add Random 'Na's into a Data Frame

How do I add random `NA`s into a data frame

Return x within your function:

> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
id age sex
[45,] "45" "41" NA
[46,] "46" NA "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA
[50,] "50" "74" "f"

Randomly insert NAs into dataframe proportionaly

df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26

as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30

It's a random process, so it might not give 15% every time.

Randomly insert NA's values in a pandas dataframe

Here's a way to clear exactly 10% of cells (or rather, as close to 10% as can be achieved with the existing data frame's size).

import random
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
df.iat[row, col] = np.nan

Here's a way to clear cells independently with a per-cell probability of 10%.

df = df.mask(np.random.random(df.shape) < .1)

add random noise and random NA in pandas dataframe

Why don't you try what is suggested here: Adding gaussian noise to a dataset of floating points and save it (python)

  1. Load the data into a pandas dataframe clean_signal = pd.read_csv("data_file_name")
  2. Use numpy to generate Gaussian noise with the same dimension as the dataset.
  3. Add gaussian noise to the clean signal with signal = clean_signal + noise

add exact proportion of random missing values to data.frame

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.

createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}

Obviously you should use a random seed for reproducibility, which can be specified before the function call.

This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.

Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

R: Randomly Replace Values with NA

You could replace random elements in lapply.

set.seed(42)
r1 <- as.data.frame(lapply(dat, \(x) replace(x, sample(length(x), .1*length(x)), NA)))

r1
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 NA 7 NA 10 3 11 4 4 NA 7
# 2 6 6 8 8 4 11 NA 8 10 9
# 3 1 12 4 5 12 3 10 3 11 1
# 4 3 10 6 2 11 NA 3 11 2 11
# 5 8 NA 10 12 5 7 2 9 4 10
# 6 12 4 9 12 9 2 7 9 8 8
# 7 7 5 9 4 2 12 12 3 4 4
# 8 12 5 3 1 6 1 4 7 6 NA
# 9 4 6 12 NA 5 8 4 4 6 7
# 10 3 2 11 3 NA 5 4 NA 2 4

mean(is.na(r1))
# [1] 0.1

However, this replaces .1 of the values in each column with NA. If we want each cell to be replaced with NA with a probability of .1, we could use apply on both MARGINS=1:2.

set.seed(42)
p <- .1
r2 <- as.data.frame(apply(dat, 1:2, \(x) sample(c(x, NA), 1, prob=c((1 - p), p))))

r2
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 NA 7 NA 10 3 11 4 4 12 7
# 2 NA 6 8 8 4 11 NA 8 10 9
# 3 1 NA NA 5 12 3 10 3 11 1
# 4 3 10 NA 2 NA 9 3 11 2 NA
# 5 8 12 10 12 5 7 2 9 4 NA
# 6 12 NA 9 12 NA 2 7 9 8 8
# 7 7 NA 9 4 2 12 12 3 4 4
# 8 12 5 NA 1 6 1 4 7 6 12
# 9 4 6 12 NA NA 8 4 4 6 7
# 10 3 2 11 3 3 5 4 8 2 4
mean(is.na(r2))
# [1] 0.16

If it's possible to coerce the data as.matrix you could treat it like a vector

set.seed(42)
m <- as.matrix(dat)
m[sample(seq_along(m), .1*length(m))] <- NA

m
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,] 6 7 1 10 3 11 4 NA 12 7
# [2,] 6 6 8 8 4 11 10 8 10 9
# [3,] 1 12 4 5 12 3 10 3 11 1
# [4,] 3 10 NA 2 11 9 3 NA 2 11
# [5,] 8 12 NA 12 5 7 NA 9 4 10
# [6,] 12 4 9 12 9 2 7 9 8 8
# [7,] 7 5 9 4 NA 12 12 3 4 4
# [8,] 12 NA 3 1 6 1 4 7 6 12
# [9,] 4 6 12 3 NA 8 4 4 NA 7
# [10,] 3 2 11 3 3 5 4 8 2 NA

mean(is.na(m))
# [1] 0.1

and coerce back to "data.frame".

dat_na <- as.data.frame(m) |> type.convert(as.is=TRUE)

The type.convert takes care of getting back classes like "numeric" and "character", since matrices can only have one mode. Note that you may lose attributes in the process.


Data:

dat <- structure(list(X1 = c(6L, 6L, 1L, 3L, 8L, 12L, 7L, 12L, 4L, 3L
), X2 = c(7L, 6L, 12L, 10L, 12L, 4L, 5L, 5L, 6L, 2L), X3 = c(1L,
8L, 4L, 6L, 10L, 9L, 9L, 3L, 12L, 11L), X4 = c(10L, 8L, 5L, 2L,
12L, 12L, 4L, 1L, 3L, 3L), X5 = c(3L, 4L, 12L, 11L, 5L, 9L, 2L,
6L, 5L, 3L), X6 = c(11L, 11L, 3L, 9L, 7L, 2L, 12L, 1L, 8L, 5L
), X7 = c(4L, 10L, 10L, 3L, 2L, 7L, 12L, 4L, 4L, 4L), X8 = c(4L,
8L, 3L, 11L, 9L, 9L, 3L, 7L, 4L, 8L), X9 = c(12L, 10L, 11L, 2L,
4L, 8L, 4L, 6L, 6L, 2L), X10 = c(7L, 9L, 1L, 11L, 10L, 8L, 4L,
12L, 7L, 4L)), class = "data.frame", row.names = c(NA, -10L))

replace NA in a dataframe with random numbers within a range

use runif instead of sample:

cars[is.na(cars)] <-  runif(sum(is.na(cars)), min = 0.9, max = 1)

Randomly insert NAs into dataframe proportionaly

df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26

as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30

It's a random process, so it might not give 15% every time.



Related Topics



Leave a reply



Submit