Subset data based on partial match of column names
You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]]
as our regular expression. This will find all the strings with punctuation symbols in the column names.
d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
"PartString1()","PartString2()")
d[grepl("[[:punct:]]", names(d))]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1
This last part just illustrates another way to do this with other string processing functions from stringr
library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1
ADD per OPs comment
d[grepl("ring[12()]", names(d))]
to get either of the substrings ring1()
or ring2()
from the names vector
R subset data.frame by column names using partial string match from another list
# Specify `interesting.list` items manually
df[,grep("P3170|C453", x=names(df))]
#> P3170.Tp2 C453.Tn7 P3170.Tn10
#> 1 1 3 5
# Use paste to create pattern from lots of items in `interesting.list`
il <- c("P3170", "C453")
df[,grep(paste(il, collapse = "|"), x=names(df))]
#> P3170.Tp2 C453.Tn7 P3170.Tn10
#> 1 1 3 5
Example data:
n <- c("P3170.Tp2" , "P3189.Tn10" ,"C453.Tn7" ,"F678.Tc23" ,"P3170.Tn10")
df <- data.frame(1,2,3,4,5)
names(df) <- n
Created on 2021-10-20 by the reprex package (v2.0.1)
Subset Columns based on partial matching of column names in the same data frame
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map()
+ select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
subset columns based on partial match and group level in python
May be not what you are looking for, but anyway.
A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.
Plain python solution
Simple collections wrangling, but there may be a simpler way.
from collections import defaultdict
groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)
Groups =
{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}
Then after
df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))
df =
sub test
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
pandas solution
- Columns to dataframe
- product of dataframes (join )
- filtering of the resulting dataframe
There is surely a better way
df1 = ids.copy()
df2 = df.columns.to_frame(index=False)
df2.columns = ['col']
# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')
df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)
dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)
dfm =
group col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887
# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account
then after
df.columns = pd.MultiIndex.from_frame(dfm)
df =
group sub test
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
subset pandas df columns with partial string match OR match before ? using lists of names
You can form a dynamic regex for each df lists:
df_lists = [df1_lst, df2_lst, df3_lst]
result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]
e.g., for the first list, the regex is \b(ab|cd)\??
i.e. look for either ab
or cd
but they should be standalone from the left side (\b
) and there might be an optional ?
afterwards.
The desired entries are in the result
list e.g.
>>> result[1]
efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None
how to subset a datafrom with partial match column names
May be you can try
dat[,grep(paste0("^",paste(lst, collapse="|")), colnames(dat))]
data
set.seed(42)
dat <- as.data.frame(matrix(sample(1:25,14*10, replace=TRUE), ncol=14))
colnames(dat) <- c("LC10096.2", "LD08.1593.s1", "LD08.1593.s2","LD08.1692.1",
"LD08.1692.2","LD09.10917.s1","LD09.10917.s2","LD10.10226-s1",
"LD10.10226-s2","LEC.12.6056.70","LEC.12.6113.02","M05.353086",
"Thore_t1","Thore_t5")
lst <- c("LD08.1593","LD09.10917","LD10.10226","M05.353086","Thore")
Subset data to contain only columns whose names match a condition
Try grepl
on the names of your data.frame
. grepl
matches a regular expression to a target and returns TRUE
if a match is found and FALSE
otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.
Example
# Data
df <- data.frame( ABC_1 = runif(3),
ABC_2 = runif(3),
XYZ_1 = runif(3),
XYZ_2 = runif(3) )
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290
# Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
# ABC_1 ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444
# grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1] TRUE TRUE FALSE FALSE
To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...
set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE) )
# We will want to discard the second row because 'all' ABC values are 0:
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0 1 1 0
#2 0 0 1 0
#3 1 1 1 0
df1 <- df[ , grepl( "ABC" , names( df ) ) ]
ind <- apply( df1 , 1 , function(x) any( x > 0 ) )
df1[ ind , ]
# ABC_1 ABC_2
#1 0 1
#3 1 1
R subsetting dataframe on partial string match or separating or split
Using base R functions, you could do:
subset(df, grepl('2.16.840.1.113883.10.20.22.4.19', cOL2))
col1 cOL2
1 1 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.1.47
4 4 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.1.47
5 5 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.22.4.19 | 2.16.840.1.113883.10.20.1.47
How to subset dataframe using list that includes partial strings of another variable
You were on the right track, grepl
is your friend. So that you can use the countries with it, paste
them together while collapsing on an or |
.
Then, using subset
EU_p <- paste(EU, collapse='|')
subset(df, grepl(EU_p, a))
# a b
# 2 Croatia USA 2
# 4 Switzerland Hungary 4
# 5 Lithuania Indonesia 5
or as you indicated using brackets
df[grepl(EU_p, df$a), ]
# a b
# 2 Croatia USA 2
# 4 Switzerland Hungary 4
# 5 Lithuania Indonesia 5
The result is any row of df
containing at least one country of the EU
vector, since the pattern as is doesn't distinguish the position.
Data:
df <- structure(list(a = c("Albania Canada", "Croatia USA", "Mexico Egypt",
"Switzerland Hungary", "Lithuania Indonesia"), b = c(1, 2, 3,
4, 5)), class = "data.frame", row.names = c(NA, -5L))
Related Topics
Ggplot2: How to Adjust Fill Colour in a Boxplot (And Change Legend Text)
Change Color Actionbutton Shiny R
S4 Classes: Multiple Types Per Slot
Remove a Character from the Entire Data Frame
Using Get Inside Lapply, Inside a Function
Stacked Histograms Like in Flow Cytometry
How to Jitter Two Ggplot Geoms in the Same Way
Convert String Date to R Date Fast for All Dates
Custom Fill Color in Ggvis (And Other Options)
Replace Blank Cells with Character
Collapse and Merge Overlapping Time Intervals
Sum Multiple Columns by Group with Tapply
Converting R Matrix into Latex Matrix in the Math or Equation Environment
Change Thickness Median Line Geom_Boxplot()
Joining Two Datasets Using Fuzzy Logic
R Name Colnames and Rownames in List of Data.Frames with Lapply