Calculate Row Means Based on (Partial) Matching Column Names

R: Calculating row mean based on column name partial match

We can split and get the rowMeans

sapply(split.default(df1, sub("\\..*", "", names(df1))), rowMeans)
# as er op rt
#a 34.66667 3.5 87 4
#b 22.66667 4.5 9 5
#c 5.00000 7.5 6 9
#d 6.00000 0.5 6 3
#e 3.00000 8.0 7 89

Calculate row means based on (partial) matching column names

I'm sure it can be done more elegantly, but this is one possibility that seems to work.

# declare the column names
colnames = c("V1", "V2", "V3", "V4")

# calculate the means
means = lapply(colnames, function(name) { apply(DF[,grep(name, names(DF))], 1, mean) })

# build the result
result = do.call(cbind, means)
result = as.data.frame(t(result))
rownames(result) = DF$Date

I should also describe, what I did.

First, I declared the column names to be partially matched.

Then, using the grep command to partially select the columns in your data frame (that matched the particular substring). The apply command calculates the means and lapply does it for all columns partially matched by the substring.

Using do.call and cbind (as suggested by DWin), we concatenate individual columns.
Finally, we set the column names from the Date column of the original data frame.

The problem can be solved more elgantly and efficiently, see solutions by DWin and Maiasaura.

Calculate column means based on (partial) matching row names

Here's an alternative approach using tidyverse

X<-c(1,2,3,4,5,6,7,8,9)
Y<-c(2,4,6,8,10,12,14,16,18)
Z<-c(1,3,5,7,10,13,16,19,22)

Name<-c("AA.11", "AA.12", "AA.13", "AB.11", "AB.12", "AB.13", "AC.11", "AC.12", "AC.13")

DF<-data.frame(Name,X,Y,Z)

library(tidyverse)

df_res = tibble(rownames = c("AA", "AB", "AC", ".11", ".12", ".13")) %>% # save rownames of interest as a tibble
mutate(d = map(rownames, ~DF[grep(.x, DF$Name),]), # get the corresponding subsets
c_means = map(d, ~colMeans(.x[-1]))) %>% # get the corresponding column means
print() # print result

# # A tibble: 6 x 3
# rownames d c_means
# <chr> <list> <list>
# 1 AA <data.frame [3 x 4]> <dbl [3]>
# 2 AB <data.frame [3 x 4]> <dbl [3]>
# 3 AC <data.frame [3 x 4]> <dbl [3]>
# 4 .11 <data.frame [3 x 4]> <dbl [3]>
# 5 .12 <data.frame [3 x 4]> <dbl [3]>
# 6 .13 <data.frame [3 x 4]> <dbl [3]>

So far you have an output that has stored the subsets (d) the process used to get the column means and the column means themselves (c_means), for each rowname you provided.

You can access those stored variables like this:

# see the subset used for AC
df_res$d[df_res$rownames == "AC"]

# [[1]]
# Name X Y Z
# 7 AC.11 7 14 16
# 8 AC.12 8 16 19
# 9 AC.13 9 18 22

# see the column means for AC
df_res$c_means[df_res$rownames == "AC"]

# [[1]]
# X Y Z
# 8 16 19

# see all column means
df_res$c_means

# [[1]]
# X Y Z
# 2 4 3
#
# [[2]]
# X Y Z
# 5 10 10
#
# [[3]]
# X Y Z
# 8 16 19
#
# [[4]]
# X Y Z
# 4 8 8
#
# [[5]]
# X Y Z
# 5.00000 10.00000 10.66667
#
# [[6]]
# X Y Z
# 6.00000 12.00000 13.33333

How to grep columns matching a pattern and calculate the row means of those columns and add the mean values as a new column to the data frame in r?

An option is to remove the digits at the end (\\d+$) with sub, use that to split the dataset into a list of data.frames, get the rowMeans and assign it to new columns in the dataset

nm1 <- sub("\\d+$", "", names(df))
df[paste0(unique(nm1), "_mean")] <- sapply(split.default(df, nm1), rowMeans)

Match column names by sequential numeric pattern, and calculate rowMeans for all matches

a data.table approach using .SDcols with a regex-patterns for column selection

library(data.table)
setDT(DF)
DF[, Redbax := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*bax$")]
DF[, Redstr := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^Red.*str$")]
DF[, N34 := rowMeans(.SD, na.rm = TRUE), .SDcols = patterns("^N34.*str$")]
# A2 B3 st ba N34_303str Red1str Red2str Red7bax Red2bax Red9str Red1bax N34_22str Redbax Redstr N34
# 1: 1 3 1 3 2 3 2 3 2 3 3 3 2.666667 2.666667 2.5
# 2: 2 4 2 4 4 4 3 4 3 4 4 4 3.666667 3.666667 4.0
# 3: 2 3 2 3 2 5 4 5 4 5 5 2 4.666667 4.666667 2.0
# 4: 3 2 3 2 6 6 5 6 NA 6 6 6 6.000000 5.666667 6.0

R: Sorting columns based on partial match of column names with row names

We can loop through the rownames and grep to find the index of the column names that match, unlist and use that to arrange the columns

df1[unlist(lapply(gsub("\\d+", "", row.names(df1)), function(x) grep(x, names(df1))))]
#W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
#FA 3 4 7 1 5 2 6
#BI 3 4 7 1 5 2 6
#KL 3 4 7 1 5 2 6
#EW 3 4 7 1 5 2 6
#LC 3 4 7 1 5 2 6

Subset data based on partial match of column names

You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]] as our regular expression. This will find all the strings with punctuation symbols in the column names.

d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
"PartString1()","PartString2()")

d[grepl("[[:punct:]]", names(d))]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1

This last part just illustrates another way to do this with other string processing functions from stringr

library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1

ADD per OPs comment

d[grepl("ring[12()]", names(d))]

to get either of the substrings ring1() or ring2() from the names vector

How do you identify column numbers based on partial string match in a way that can be used with lapply?

I'm not sure what your expected output is, but it seems that I can't reproduce the problem described above. Your approach seems to work here, even with non-syntactical names:

cols <- c(grep("support", colnames(df)))

df[cols] <- lapply(df[cols], factor,
levels = c("Strongly Oppose", "Oppose", "Neither oppose nor support", "Support", "Strongly Support"),
labels = c("Strongly Oppose", "Oppose", "Neither oppose nor support", "Support", "Strongly Support"))

str(df)

#> Classes 'tbl_df', 'tbl' and 'data.frame': 25 obs. of 7 variables:
#> $ mission support : Factor w/ 5 levels "Strongly Oppose",..: 5 5 4 4 5 4 4 5 4 5 ...
#> $ mission opposition : chr NA NA NA NA ...
#> $ ed support : Factor w/ 5 levels "Strongly Oppose",..: 5 5 4 4 5 4 4 5 4 4 ...
#> $ ed mission opposition: chr NA NA NA NA ...
#> $ non-agency engagement: chr "Yes" "No" "No" "No" ...
#> $ agency knowledge : chr "Yes" "Yes" "Yes" "Yes" ...
#> $ agency engagement : chr "Yes" "Yes" "Yes" "Yes" ...

Created on 2021-12-28 by the reprex package (v0.3.0)


Another option is to use dplyr::across instead of lapply:

library(dplyr)

df %>%
mutate(across(contains("support"),
factor,
levels = c("Strongly Oppose", "Oppose", "Neither oppose nor support", "Support", "Strongly Support"),
labels = c("Strongly Oppose", "Oppose", "Neither oppose nor support", "Support", "Strongly Support"))
) %>%
glimpse

#> Rows: 25
#> Columns: 7
#> $ `mission support` <fct> Strongly Support, Strongly Support, Support, S…
#> $ `mission opposition` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ `ed support` <fct> Strongly Support, Strongly Support, Support, S…
#> $ `ed mission opposition` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ `non-agency engagement` <chr> "Yes", "No", "No", "No", "Maybe", "No", "No", …
#> $ `agency knowledge` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No",…
#> $ `agency engagement` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", NA, NA, "Ye…

Created on 2021-12-28 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit