Add Column to Data Frame Which Returns 1 If String Match a Certain Pattern

Add column to data frame which returns 1 if string match a certain pattern

How about

iris$check <- as.numeric(grepl(".*(sa)", iris$Species))

grepl returns a logical vector (TRUE/FALSE) which can easily be converted to 1/0 by using as.numeric.

Also possible:

iris$check <- grepl(".*(sa)", iris$Species) + 0L

Create new column in dataframe based on partial string matching other column

Since you have only two conditions, you can use a nested ifelse:

#random data; it wasn't easy to copy-paste yours  
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))

DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other

EDIT 10/3/2016 (..after receiving more attention than expected)

A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:

ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))

ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)

for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}

return(ans)
}

ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"

Search for string pattern in dataframe column, return each occurence and join to another dataframe

(edited).

The pattern piece is a good start, but then you have to merge / join it with the original dataframe:

df.index.name = "inx"
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\])')

# extract the attributes.
extracts = df.MDX_TEXT.str.extractall(pattern).rename(columns={0:"attrname"})

# join the result with the original dataframe.
res = df.join(extracts).reset_index()[["ID", "USER", "attrname"]].drop_duplicates()

# take just the last part of each attribute name.
res["attrname"] = res["attrname"].str.split(".", expand = True).iloc[:, -1]

The result is:

   ID USER attrname
0 1 JOE [ATTR1]
1 1 JOE [ATTR2]
2 1 JOE [ATTR3]
3 2 JAY [ATTR1]
4 2 JAY [ATTR3]

Create new column if DataFrame contains specific string

You could use pandas.Series.str.extract to achieve the desired output


import numpy as np
import pandas as pd

df = pd.DataFrame({
"Name": ["name first RB LA a", "name LB second", "RB name third", "name LB fourth"]
})
df["Example"] = df["Name"].str.extract("(LB|RB)")[0] + " category"

    Name                Example
0 name first RB LA a RB category
1 name LB second LB category
2 RB name third RB category
3 name LB fourth LB category

Edit

To change category names within Example column use .str.replace:

df["Example"] = (df["Example"]
.str.replace("RB", "Round Blade")
.str.replace("LB", "Long Biased")
)

How to find a pattern in a string and extract it as a new column of data frame

You can try the following :

library(tidyverse)

df %>%
extract(col, c('First', 'cut-off', 'Second'),
'(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)')) %>%
select(-col)

# First cut-off Second Bonus
#1 3.2 100000 1.1 <NA>
#2 3.3 100000 1.2 3000
#3 <NA> <NA> <NA> <NA>
#4 3.3 100000 1.2 <NA>
#5 3.3 100000 1.2 <NA>
#6 3.2 100000 1.1 <NA>

data

df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE"))

Create column based on presence of string pattern and ifelse

To check if a string contains a certain substring, you can't use == because it performs an exact matching (i.e. returns true only if the string is exactly "non").

You could use for example grepl function (belonging to grep family of functions) that performs a pattern matching:

df$loc01 <- ifelse(grepl("non",df$loc_01),'outside','inside')

Result :

> df
loc_01 loc01_land loc01
1 apis 165730500 inside
2 indu 62101800 inside
3 isro 540687600 inside
4 miss 161140500 inside
5 non_apis 1694590200 outside
6 non_indu 1459707300 outside
7 non_isro 1025051400 outside
8 non_miss 1419866100 outside
9 non_piro 2037064500 outside
10 non_sacn 2204629200 outside
11 non_slbe 1918840500 outside
12 non_voya 886299300 outside
13 piro 264726000 inside
14 sacn 321003900 inside
15 slbe 241292700 inside
16 voya 530532000 inside

Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row

You can't use a pandas builtin method directly. You will need to apply a re.search per row:

import re

mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]

or using a (faster) list comprehension:

mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]

output:

  strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2

R: Add new column by specific patterns in another column of the dataframe

dfA <- data.frame(group=seq(1,4), pattern=c("Black & White", "Black OR Pink", "Red", "Pink"), stringsAsFactors=F)
dfB <- data.frame(color=c("Pink", "Red", "Black", "White"), value=c(2,4,84,100), stringsAsFactors=F)

getVal2return <- function(i, dfA, dfB){

andv <- unlist(strsplit(dfA$pattern[i], split=" & "))
orv <- unlist(strsplit(dfA$pattern[i], split=" OR "))
if (length(andv) > 1) {
val <- sum(dfB$value[match(andv, dfB$color)])
} else if (length(orv)> 1){
val <- max(dfB$value[match(orv, dfB$color)])
} else {
val <- dfB$value[match(dfA$pattern[i], dfB$color)]
}
return(val)
}

dfA$newVal <- sapply(1:nrow(dfA), function(x) { getVal2return(x, dfA, dfB) })

> dfA
group pattern newVal
1 1 Black & White 184
2 2 Black OR Pink 84
3 3 Red 4
4 4 Pink 2

Based on Partial string Match fill one data frame column from another dataframe

I would do something like this:

  1. Create a new column indexes where for every Equipment in df2 find a list of Indexes in df1 where df1.TagName contains the Equipment.

  2. Flatten the indexes by creating one row for each item using stack() and reset_index()

  3. Join the flatten df2 with df1 to get all information you want
from io import StringIO
import numpy as np
import pandas as pd
df1=StringIO("""Line;TagName;CLASS
187877;PT_WOA;.ZS01_LA120_T05.SB.S2384_LesSwL;10
187878;PT_WOA;.ZS01_RB2202_T05.SB.S2385_FLOK;10
187879;PT_WOA;.ZS01_LA120_T05.SB._CBAbsHy;10
187880;PT_WOA;.ZS01_LA120_T05.SB.S3110_CBAPV;10
187881;PT_WOA;.ZS01_LARB2204.SB.S3111_CBRelHy;10""")
df2=StringIO("""EquipmentNo;EquipmentDescription;Equipment
1311256;Lifting table;LA120
1311257;Roller bed;RB2200
1311258;Lifting table;LT2202
1311259;Roller bed;RB2202
1311260;Roller bed;RB2204""")
df1=pd.read_csv(df1,sep=";")
df2=pd.read_csv(df2,sep=";")

df2['indexes'] = df2['Equipment'].apply(lambda x: df1.index[df1.TagName.str.contains(str(x)).tolist()].tolist())
indexes = df2.apply(lambda x: pd.Series(x['indexes']),axis=1).stack().reset_index(level=1, drop=True)
indexes.name = 'indexes'
df2 = df2.drop('indexes', axis=1).join(indexes).dropna()
df2.index = df2['indexes']
matches = df2.join(df1, how='inner')
print(matches[['Line','TagName','EquipmentDescription','EquipmentNo']])

OUTPUT:

          Line                          TagName EquipmentDescription  EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260


Related Topics



Leave a reply



Submit