How to Create Example Data Set from Private Data (Replacing Variable Names and Levels With Uninformative Place Holders)

How to create example data set from private data (replacing variable names and levels with uninformative place holders)?

I don't know whether there was a function to automate this, but now there is ;)

## A function to anonymise columns in 'colIDs' 
## colIDs can be either column names or integer indices
anonymiseColumns <- function(df, colIDs) {
id <- if(is.character(colIDs)) match(colIDs, names(df)) else colIDs
for(id in colIDs) {
prefix <- sample(LETTERS, 1)
suffix <- as.character(as.numeric(as.factor(df[[id]])))
df[[id]] <- paste(prefix, suffix, sep="")
}
names(df)[id] <- paste("V", id, sep="")
df
}

## A data.frame containing sensitive information
df <- data.frame(
name = rep(readLines(file.path(R.home("doc"), "AUTHORS"))[9:13], each=2),
hiscore = runif(10, 99, 100),
passwd = replicate(10, paste(sample(c(LETTERS, letters), 9), collapse="")))

## Anonymise it
df2 <- anonymiseColumns(df, c(1,3))

## Check that it worked
> head(df, 3)
name hiscore passwd
1 Douglas Bates 99.96714 ROELIAncz
2 Douglas Bates 99.07243 gDOLNMyVe
3 John Chambers 99.55322 xIVPHDuEW

> head(df2, 3)
name hiscore V3
1 Q1 99.96714 V8
2 Q1 99.07243 V2
3 Q2 99.55322 V9

How to pass on variable names (i.e. var_x) OR transformation of variables (i.e. as.factor(var_x)) in same function?

Try any of these.

my_fun1 <- function(data, x) eval(substitute(x), data)

my_fun2 <- function(data, ...) with(data, ...)

my_fun3 <- with

e.g. (and similarly for the others)

my_fun1(BOD, Time) # returns numeric object
## [1] 1 2 3 4 5 7

my_fun1(BOD, factor(Time)) # returns factor object
## [1] 1 2 3 4 5 7
## Levels: 1 2 3 4 5 7

Anonymize data for each distinct row in R

create function that does the job:

anon <- function(x) {
rl <- rle(x)$lengths
ans<- paste("Value", rep(seq_along(rl), rl))
return(ans)
}

call function:

anon(final_df$Value)

result:

# [1] "Value 1" "Value 1" "Value 1" "Value 2" "Value 3" "Value 3" "Value 3"

generalization:

df1 <- mtcars
df1[] <- lapply(df1, anon)
names(df1) <- paste0("V", seq_along(names(df1)))
rownames(df1) <- NULL

df1

How to replace all columns in matrix with NA based on variable in another dataframe?

We can extract the missing column and use that as the row index in m1 and assign it to NA

m1[df$missing,] <- NA

-output

> m1
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 1
[3,] NA NA NA

Or we may do

> NA^(df$missing) * m1
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 1
[3,] NA NA NA

Rename all variables that contai a particular string and add a sequencial number

Update

From dplyr 1.0.0 you can use rename_with.

You can select columns to rename by position

library(dplyr)
ds %>% rename_with(~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)), -1)

Or by name

ds %>% rename_with(~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)), 
starts_with('nameverybig'))

Both of which return :

#   identification var1_do_you_like_cookies var2_have_you_been_in_europe var3_whats_your_gender
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 4 4
#5 5 5 5 5
#6 6 6 6 6
#7 7 7 7 7
#8 8 8 8 8
#9 9 9 9 9
#10 10 10 10 10

Old Answer

You could use paste0 with sub

ds %>% rename_all(~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)))

To rename only specific variable we can use rename_at

ds %>% rename_at(vars(starts_with("nameverybig")), 
~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)))


Related Topics



Leave a reply



Submit