Subset Data Based on Partial Match of Column Names

Subset data based on partial match of column names

You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]] as our regular expression. This will find all the strings with punctuation symbols in the column names.

d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
"PartString1()","PartString2()")

d[grepl("[[:punct:]]", names(d))]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1

This last part just illustrates another way to do this with other string processing functions from stringr

library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1

ADD per OPs comment

d[grepl("ring[12()]", names(d))]

to get either of the substrings ring1() or ring2() from the names vector

R subset data.frame by column names using partial string match from another list

# Specify `interesting.list` items manually
df[,grep("P3170|C453", x=names(df))]
#> P3170.Tp2 C453.Tn7 P3170.Tn10
#> 1 1 3 5

# Use paste to create pattern from lots of items in `interesting.list`
il <- c("P3170", "C453")
df[,grep(paste(il, collapse = "|"), x=names(df))]
#> P3170.Tp2 C453.Tn7 P3170.Tn10
#> 1 1 3 5

Example data:

n <- c("P3170.Tp2" , "P3189.Tn10" ,"C453.Tn7" ,"F678.Tc23" ,"P3170.Tn10")
df <- data.frame(1,2,3,4,5)
names(df) <- n
Created on 2021-10-20 by the reprex package (v2.0.1)

Subset Columns based on partial matching of column names in the same data frame

You could try:

v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])

Or using map() + select_()

library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))

Which gives:

#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Should you want to make it into a function:

checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}

Then simply use:

checkExpression(eatable, 5)

subset columns based on partial match and group level in python

May be not what you are looking for, but anyway.

A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.

Plain python solution

Simple collections wrangling, but there may be a simpler way.

from collections import defaultdict

groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)

Groups =

{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}

Then after

df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))

df =

    sub          test           
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2

pandas solution

  • Columns to dataframe
  • product of dataframes (join )
  • filtering of the resulting dataframe

There is surely a better way

df1 = ids.copy()

df2 = df.columns.to_frame(index=False)
df2.columns = ['col']

# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')

df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)

dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)

dfm =

   group     col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887

# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account

then after

df.columns = pd.MultiIndex.from_frame(dfm)

df =

group   sub          test           
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2

subset pandas df columns with partial string match OR match before ? using lists of names

You can form a dynamic regex for each df lists:

df_lists = [df1_lst, df2_lst, df3_lst]

result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]

e.g., for the first list, the regex is \b(ab|cd)\?? i.e. look for either ab or cd but they should be standalone from the left side (\b) and there might be an optional ? afterwards.

The desired entries are in the result list e.g.

>>> result[1]

efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None

how to subset a datafrom with partial match column names

May be you can try

 dat[,grep(paste0("^",paste(lst, collapse="|")), colnames(dat))]

data

set.seed(42)
dat <- as.data.frame(matrix(sample(1:25,14*10, replace=TRUE), ncol=14))

colnames(dat) <- c("LC10096.2", "LD08.1593.s1", "LD08.1593.s2","LD08.1692.1",
"LD08.1692.2","LD09.10917.s1","LD09.10917.s2","LD10.10226-s1",
"LD10.10226-s2","LEC.12.6056.70","LEC.12.6113.02","M05.353086",
"Thore_t1","Thore_t5")

lst <- c("LD08.1593","LD09.10917","LD10.10226","M05.353086","Thore")

Subset data to contain only columns whose names match a condition

Try grepl on the names of your data.frame. grepl matches a regular expression to a target and returns TRUE if a match is found and FALSE otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.

Example

#  Data
df <- data.frame( ABC_1 = runif(3),
ABC_2 = runif(3),
XYZ_1 = runif(3),
XYZ_2 = runif(3) )

# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290

# Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
# ABC_1 ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444

# grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1] TRUE TRUE FALSE FALSE

To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE) )

# We will want to discard the second row because 'all' ABC values are 0:
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0 1 1 0
#2 0 0 1 0
#3 1 1 1 0

df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]
# ABC_1 ABC_2
#1 0 1
#3 1 1

R subsetting dataframe on partial string match or separating or split

Using base R functions, you could do:

subset(df, grepl('2.16.840.1.113883.10.20.22.4.19', cOL2))
col1 cOL2
1 1 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.1.47
4 4 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.1.47
5 5 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.1.47

How to subset dataframe using list that includes partial strings of another variable

You were on the right track, grepl is your friend. So that you can use the countries with it, paste them together while collapsing on an or |.

Then, using subset

EU_p <- paste(EU, collapse='|')

subset(df, grepl(EU_p, a))
# a b
# 2 Croatia USA 2
# 4 Switzerland Hungary 4
# 5 Lithuania Indonesia 5

or as you indicated using brackets

df[grepl(EU_p, df$a), ]
# a b
# 2 Croatia USA 2
# 4 Switzerland Hungary 4
# 5 Lithuania Indonesia 5

The result is any row of df containing at least one country of the EU vector, since the pattern as is doesn't distinguish the position.


Data:

df <- structure(list(a = c("Albania Canada", "Croatia USA", "Mexico Egypt", 
"Switzerland Hungary", "Lithuania Indonesia"), b = c(1, 2, 3,
4, 5)), class = "data.frame", row.names = c(NA, -5L))


Related Topics



Leave a reply



Submit