How to create example data set from private data (replacing variable names and levels with uninformative place holders)?
I don't know whether there was a function to automate this, but now there is ;)
## A function to anonymise columns in 'colIDs'
## colIDs can be either column names or integer indices
anonymiseColumns <- function(df, colIDs) {
id <- if(is.character(colIDs)) match(colIDs, names(df)) else colIDs
for(id in colIDs) {
prefix <- sample(LETTERS, 1)
suffix <- as.character(as.numeric(as.factor(df[[id]])))
df[[id]] <- paste(prefix, suffix, sep="")
}
names(df)[id] <- paste("V", id, sep="")
df
}
## A data.frame containing sensitive information
df <- data.frame(
name = rep(readLines(file.path(R.home("doc"), "AUTHORS"))[9:13], each=2),
hiscore = runif(10, 99, 100),
passwd = replicate(10, paste(sample(c(LETTERS, letters), 9), collapse="")))
## Anonymise it
df2 <- anonymiseColumns(df, c(1,3))
## Check that it worked
> head(df, 3)
name hiscore passwd
1 Douglas Bates 99.96714 ROELIAncz
2 Douglas Bates 99.07243 gDOLNMyVe
3 John Chambers 99.55322 xIVPHDuEW
> head(df2, 3)
name hiscore V3
1 Q1 99.96714 V8
2 Q1 99.07243 V2
3 Q2 99.55322 V9
How to pass on variable names (i.e. var_x) OR transformation of variables (i.e. as.factor(var_x)) in same function?
Try any of these.
my_fun1 <- function(data, x) eval(substitute(x), data)
my_fun2 <- function(data, ...) with(data, ...)
my_fun3 <- with
e.g. (and similarly for the others)
my_fun1(BOD, Time) # returns numeric object
## [1] 1 2 3 4 5 7
my_fun1(BOD, factor(Time)) # returns factor object
## [1] 1 2 3 4 5 7
## Levels: 1 2 3 4 5 7
Anonymize data for each distinct row in R
create function that does the job:
anon <- function(x) {
rl <- rle(x)$lengths
ans<- paste("Value", rep(seq_along(rl), rl))
return(ans)
}
call function:
anon(final_df$Value)
result:
# [1] "Value 1" "Value 1" "Value 1" "Value 2" "Value 3" "Value 3" "Value 3"
generalization:
df1 <- mtcars
df1[] <- lapply(df1, anon)
names(df1) <- paste0("V", seq_along(names(df1)))
rownames(df1) <- NULL
df1
How to replace all columns in matrix with NA based on variable in another dataframe?
We can extract the missing
column and use that as the row index in m1
and assign it to NA
m1[df$missing,] <- NA
-output
> m1
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 1
[3,] NA NA NA
Or we may do
> NA^(df$missing) * m1
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 1
[3,] NA NA NA
Rename all variables that contai a particular string and add a sequencial number
Update
From dplyr
1.0.0 you can use rename_with
.
You can select columns to rename by position
library(dplyr)
ds %>% rename_with(~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)), -1)
Or by name
ds %>% rename_with(~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)),
starts_with('nameverybig'))
Both of which return :
# identification var1_do_you_like_cookies var2_have_you_been_in_europe var3_whats_your_gender
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 4 4
#5 5 5 5 5
#6 6 6 6 6
#7 7 7 7 7
#8 8 8 8 8
#9 9 9 9 9
#10 10 10 10 10
Old Answer
You could use paste0
with sub
ds %>% rename_all(~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)))
To rename only specific variable we can use rename_at
ds %>% rename_at(vars(starts_with("nameverybig")),
~paste0("var", seq_along(.), sub("nameverybig_*", "_", .)))
Related Topics
Add a Variable to a Data Frame Containing Max Value of Each Row
Ggplot Bar Plot With Facet-Dependent Order of Categories
Replace/Translate Characters in a String
Order Stacked Bar Graph in Ggplot
Memory Allocation "Error: Cannot Allocate Vector of Size 75.1 Mb"
Concatenate Row-Wise Across Specific Columns of Dataframe
How to Set Up Conda-Installed R For Use With Rstudio
Extract the First 2 Characters in a String
What Does the Dot Mean in R - Personal Preference, Naming Convention or More
Creating a Comma Separated Vector
How to Load Packages in R Automatically
Replace All Particular Values in a Data Frame
Forcing Garbage Collection to Run in R With the Gc() Command