Apply a Function to All Variables Starting with Specific Pattern in R

Apply a function to all variables starting with specific pattern in R

To answer exactly what the OP asked for (mapply(c, test1, test2,..testn)), do:

do.call(mapply, c(FUN = c, mget(paste0("test", 1:n))))

If you don't know how many (n) lists you have and want to find them using a pattern:

do.call(mapply, c(FUN = c, mget(ls(pattern = "^test\\d+$"))))

Like the other answers so far, this method using ls will not sort the objects properly if there are more than nine of them because they are sorted alphabetically. The longer but fully robust version would be:

test.lists    <- ls(pattern = "^test\\d+$")
ordered.lists <- test.lists[order(as.integer(sub("test", "", test.lists)))]
do.call(mapply, c(FUN = c, mget(ordered.lists)))

apply function to all variables with string in name

Normally one tries to group such variables in a list but if not then we can do this:

for(nm in ls(pattern = "^VAR")) .GlobalEnv[[nm]] <- as.character(.GlobalEnv[[nm]])

Environment that is not the global environment

If you have these in an environment that is not the global environment then modify this as follows. The first line of the function body defines the test data, the next line puts the current environment in a variable e for convenience and the line after that performs the transformations. Finally we check what the variables have been transformed to.

f <- function() {
VAR1 <- 1; VAR2 <- 2; VAR3 <- 3 # test data
e <- environment() # current environment
for(nm in ls(pattern = "^VAR")) e[[nm]] <- as.character(e[[nm]])
str(VAR1); str(VAR2); str(VAR3) # check results
}
f()

List

If you can arrange that these are in a list instead then:

L <- list(VAR1 = 1, VAR2 = 2, VAR3 = 3) # test data
L <- lapply(L, as.character)

or if there are some elements that are not to be processed:

L2 <- list(VAR1 = 1, VAR2 = 2, VAR3 = 3, other = 4) # test data
ix <- grep("^VAR", names(L2))
L2[ix] <- lapply(L2[ix], as.character)

If you don't want to overwrite L and L2 -- overwriting tends to make debugging more difficult -- then use Lnew <- lapply(L, as.character) and L2new <- replace(L2, ix, lapply(L2[ix], as.character)) instead.

How to get all variables with pattern in name into a list while inside function

You can create an environment and then create variables inside it. Then using ls() function with the environment name and the correct pattern, you can see the list of variables in the environment that matches the given pattern.

test_function <- function(x) {
myenv <- new.env()
myenv$hello1 = "hello1"
myenv$hello2 = "hello2"
myenv$cello2 = "hello2"
mylist <- ls(name = myenv, pattern = "hello")
print(mylist)
}
test_function(1)
# [1] "hello1" "hello2"

You can use mget to extract values for a list of variables inside an environment.

test_function <- function(x, y, z, pattern) {
myenv <- new.env()
ls_vars <- list( hello1 = x,
hello2 = y,
cello2 = z)
list2env( ls_vars, myenv ) # add list of variables to myenv environment
newvar <- "hello3"
assign(newvar, value = "dfsfsf", envir = myenv) # assign new variable
mylist <- ls(name = myenv, pattern = pattern)
return(mget(mylist, envir = myenv))
}
test_function(x = "hello1", y = "hello2", z = "sdfsd", pattern = "hello")
# $hello1
# [1] "hello1"
#
# $hello2
# [1] "hello2"
#
# $hello3
# [1] "dfsfsf"

test_function(x = "hello1", y = "hello2", z = "sdfsd", pattern = "cello")
# $cello2
# [1] "sdfsd"

Apply function to several variables with same name pattern

Just use grepl to match the column names you want to operate on returning a logical vector, inside the [ operator to subset the dataframe. Because log10 is vectorised you can just do this....

df[ , grepl( "htotal_" , names( df ) ) ] <-  -log10( df[ , grepl( "htotal_" , names( df ) ) ] )

Vectorised example

#  Set up the data
df <- data.frame( matrix( sample( c(1,10,1000) , 16 , repl = TRUE ) , 4 , 4 ) )
names( df ) <- c("htotal_1" , "htotal_2" , "not1" , "not2" )
# htotal_1 htotal_2 not1 not2
#1 10 10 10 1000
#2 10 10 1 10
#3 1000 1 1 1000
#4 10 1000 10 1000

df[ , grepl( "htotal_" , names( df ) ) ] <- -log10( df[ , grepl( "htotal_" , names( df ) ) ] )

# htotal_1 htotal_2 not1 not2
#1 -1 -1 10 1000
#2 -1 -1 1 10
#3 -3 0 1 1000
#4 -1 -3 10 1000

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

Apply filter criteria to variables that contain/start with certain string in R

In base R you can use lapply/sapply :

d[Reduce(`|`, lapply(d[-1], grepl, pattern = 'd')), ]
#d[rowSums(sapply(d[-1], grepl, pattern = 'd')) > 0, ]

# ID test1 test2 test3 test4
#2 b b b c d
#4 d d a c a
#5 e a s d f

If you are interested in dplyr solution you can use any of the below method :

library(dplyr)
library(stringr)

#1.
d %>%
filter_at(vars(starts_with('test')), any_vars(str_detect(., 'd')))

#2.
d %>%
rowwise() %>%
filter(any(str_detect(c_across(starts_with('test')), 'd')))

#3.
d %>%
filter(Reduce(`|`, across(starts_with('test'), str_detect, 'd')))

How to apply the same function to several variables in R?

Here is an option

library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"

#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"

It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)

f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}

map(actorlist, f1, dat = df)

NOTE: This can be done more easily, but here we are using similar code from the OP's post


Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs

lapply(df[2:3], function(x)  {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})

set na all values that starts with certain string in dplyr environment is.na(), na_if(), startsWith(), regex

If you're able to do it for one column using mutate, you should be able to do it for multiple columns using mutate_at() or mutate_all(), explained here: https://dplyr.tidyverse.org/reference/mutate_all.html

Without knowing what your data looks like, I think you'd want mutate_all() to modify all columns which have data which matches your condition.

In this example using the iris dataset, we replace all instances of 5 with the word five:

iris %>%
tibble %>%
mutate_all(function(x) str_replace(x, '5', 'five'))

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 five.1 3.five 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.five 0.2 setosa
5 five 3.6 1.4 0.2 setosa
6 five.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 five 3.4 1.five 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.five 0.1 setosa

Or like your condition, we can do this only when the string starts with 5, using ^5 regex language (^ indicates the start of the string, and 5 means a 5 at the beginning of the string).

iris %>%
tibble %>%
mutate_all(function(x) str_replace(x, '^5', 'five'))

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <chr> <chr> <chr> <chr>
1 five.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 five 3.6 1.4 0.2 setosa
6 five.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 five 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa

Update To change the entire value, if it has a 5 at the start, you just need to change the str_replace function to a function which can change the entire value. In this case, we use an ifelse statement

iris %>%
tibble %>%
mutate_all(function(x) ifelse(str_detect(x, '^5'), 'had_five', x))

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <dbl> <chr> <dbl> <int>
1 had_five 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 had_five 3.6 1.4 0.2 1
6 had_five 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 had_five 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1

Another update From your comments, it sounds like you want to apply the function to only character columns. To do this, you can substitute mutate_all(your_fun) for mutate_if(is.character, your_fun) - as described in the help documentation at the start of this answer (the same info page describes mutate_all, mutate_if and mutate_at).

Using your sample data as an example, we can set anything beginning with '0' to NA. I am confused by your example though - do you want to look for '0' or '0\n(' at the start of the string? Either way, this is how to do it:

# sample data
string <- c("asff", "1\n(", '0asfd', '0\n(asdf)')
num <- c(0,1,2,3)
df <- data.frame(string, num)

# for only a 0 at the start of the string
df %>%
mutate_if(is.character, function(x) ifelse(str_detect(x, '^0'), NA, x))

string num
1 asff 0
2 1\n( 1
3 <NA> 2
4 <NA> 3

# for '0\n(' at the start of the string
df %>%
mutate_if(is.character, function(x) ifelse(str_detect(x, '^0\\n\\('), NA, x))

string num
1 asff 0
2 1\n( 1
3 0asfd 2
4 <NA> 3

Function to look for different patterns in specific positions in a string in R

base R

rowSums(outer(strings, seq_len(nrow(mutations)),
function(st, i) {
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
}))
# [1] 2 1 1

Walk-through:

  • outer effectively just produces two vectors, an expansion of the cartesian product of the two arguments. If we insert a browser() as the first line of the inner anon-func, we'd see

    data.frame(st, i)
    # st i
    # 1 EVQLVESGGGLAKPG 1
    # 2 VQLVESGGGLAKPGGS 1
    # 3 EVQLVESGGALAKPGGSLRLSCAAS 1
    # 4 EVQLVESGGGLAKPG 2
    # 5 VQLVESGGGLAKPGGS 2
    # 6 EVQLVESGGALAKPGGSLRLSCAAS 2

    (Shown as a frame only for a columnar presentation. Both st and i are simple vectors.)

    From here, knowing that substr is vectorized across all arguments, then a single call to substr will find the ith character in each of the strings.

  • The result of the substr is a vector of letters. Continuing the same browser() session from above,

    substr(st, mutations$position[i], mutations$position[i])
    # [1] "G" "G" "G" "G" "L" "A"
    mutations$AA[i]
    # [1] "G" "G" "G" "G" "G" "G"
    substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
    # [1] TRUE TRUE TRUE TRUE FALSE FALSE

    The mutations$AA[i] shows us what we're looking for. A nice thing of the vectorized method here is that mutations$AA[i] will always be the same length and in the expected order of letters retrieved by substr(.).

  • The outer itself returns a matrix, with length(X) rows and length(Y) columns (X and Y are the first and second args to outer, respective).

    outer(strings, seq_len(nrow(mutations)),
    function(st, i) {
    substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
    })
    # [,1] [,2]
    # [1,] TRUE TRUE
    # [2,] TRUE FALSE
    # [3,] TRUE FALSE

    The number of correct mutations found in each string is just a sum of each row. (Ergo rowSums.)


If you're concerned due to a large amount of mutations and strings, you can replace the outer and iterate over each row of mutations instead:

rowSums(sapply(seq_len(nrow(mutations)), function(i) substr(strings, mutations$position[i], mutations$position[i]) == mutations$AA[i]))
# [1] 2 1 1

This calls substr once for each mutations row, so if the outer-explosion is too much, this might reduce the memory footprint.



Related Topics



Leave a reply



Submit