Add column to data frame which returns 1 if string match a certain pattern
How about
iris$check <- as.numeric(grepl(".*(sa)", iris$Species))
grepl
returns a logical vector (TRUE/FALSE) which can easily be converted to 1/0 by using as.numeric
.
Also possible:
iris$check <- grepl(".*(sa)", iris$Species) + 0L
Create new column in dataframe based on partial string matching other column
Since you have only two conditions, you can use a nested ifelse
:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
Search for string pattern in dataframe column, return each occurence and join to another dataframe
(edited).
The pattern piece is a good start, but then you have to merge / join it with the original dataframe:
df.index.name = "inx"
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\])')
# extract the attributes.
extracts = df.MDX_TEXT.str.extractall(pattern).rename(columns={0:"attrname"})
# join the result with the original dataframe.
res = df.join(extracts).reset_index()[["ID", "USER", "attrname"]].drop_duplicates()
# take just the last part of each attribute name.
res["attrname"] = res["attrname"].str.split(".", expand = True).iloc[:, -1]
The result is:
ID USER attrname
0 1 JOE [ATTR1]
1 1 JOE [ATTR2]
2 1 JOE [ATTR3]
3 2 JAY [ATTR1]
4 2 JAY [ATTR3]
Create new column if DataFrame contains specific string
You could use pandas.Series.str.extract
to achieve the desired output
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Name": ["name first RB LA a", "name LB second", "RB name third", "name LB fourth"]
})
df["Example"] = df["Name"].str.extract("(LB|RB)")[0] + " category"
Name Example
0 name first RB LA a RB category
1 name LB second LB category
2 RB name third RB category
3 name LB fourth LB category
Edit
To change category names within Example
column use .str.replace
:
df["Example"] = (df["Example"]
.str.replace("RB", "Round Blade")
.str.replace("LB", "Long Biased")
)
How to find a pattern in a string and extract it as a new column of data frame
You can try the following :
library(tidyverse)
df %>%
extract(col, c('First', 'cut-off', 'Second'),
'(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)')) %>%
select(-col)
# First cut-off Second Bonus
#1 3.2 100000 1.1 <NA>
#2 3.3 100000 1.2 3000
#3 <NA> <NA> <NA> <NA>
#4 3.3 100000 1.2 <NA>
#5 3.3 100000 1.2 <NA>
#6 3.2 100000 1.1 <NA>
data
df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE"))
Create column based on presence of string pattern and ifelse
To check if a string contains a certain substring, you can't use ==
because it performs an exact matching (i.e. returns true only if the string is exactly "non").
You could use for example grepl
function (belonging to grep family of functions) that performs a pattern matching:
df$loc01 <- ifelse(grepl("non",df$loc_01),'outside','inside')
Result :
> df
loc_01 loc01_land loc01
1 apis 165730500 inside
2 indu 62101800 inside
3 isro 540687600 inside
4 miss 161140500 inside
5 non_apis 1694590200 outside
6 non_indu 1459707300 outside
7 non_isro 1025051400 outside
8 non_miss 1419866100 outside
9 non_piro 2037064500 outside
10 non_sacn 2204629200 outside
11 non_slbe 1918840500 outside
12 non_voya 886299300 outside
13 piro 264726000 inside
14 sacn 321003900 inside
15 slbe 241292700 inside
16 voya 530532000 inside
Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row
You can't use a pandas builtin method directly. You will need to apply
a re.search
per row:
import re
mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]
or using a (faster) list comprehension:
mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]
output:
strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2
R: Add new column by specific patterns in another column of the dataframe
dfA <- data.frame(group=seq(1,4), pattern=c("Black & White", "Black OR Pink", "Red", "Pink"), stringsAsFactors=F)
dfB <- data.frame(color=c("Pink", "Red", "Black", "White"), value=c(2,4,84,100), stringsAsFactors=F)
getVal2return <- function(i, dfA, dfB){
andv <- unlist(strsplit(dfA$pattern[i], split=" & "))
orv <- unlist(strsplit(dfA$pattern[i], split=" OR "))
if (length(andv) > 1) {
val <- sum(dfB$value[match(andv, dfB$color)])
} else if (length(orv)> 1){
val <- max(dfB$value[match(orv, dfB$color)])
} else {
val <- dfB$value[match(dfA$pattern[i], dfB$color)]
}
return(val)
}
dfA$newVal <- sapply(1:nrow(dfA), function(x) { getVal2return(x, dfA, dfB) })
> dfA
group pattern newVal
1 1 Black & White 184
2 2 Black OR Pink 84
3 3 Red 4
4 4 Pink 2
Based on Partial string Match fill one data frame column from another dataframe
I would do something like this:
Create a new column
indexes
where for everyEquipment
in df2 find a list of Indexes in df1 where df1.TagName contains theEquipment
.Flatten the
indexes
by creating one row for each item usingstack()
andreset_index()
- Join the flatten df2 with df1 to get all information you want
from io import StringIO
import numpy as np
import pandas as pd
df1=StringIO("""Line;TagName;CLASS
187877;PT_WOA;.ZS01_LA120_T05.SB.S2384_LesSwL;10
187878;PT_WOA;.ZS01_RB2202_T05.SB.S2385_FLOK;10
187879;PT_WOA;.ZS01_LA120_T05.SB._CBAbsHy;10
187880;PT_WOA;.ZS01_LA120_T05.SB.S3110_CBAPV;10
187881;PT_WOA;.ZS01_LARB2204.SB.S3111_CBRelHy;10""")
df2=StringIO("""EquipmentNo;EquipmentDescription;Equipment
1311256;Lifting table;LA120
1311257;Roller bed;RB2200
1311258;Lifting table;LT2202
1311259;Roller bed;RB2202
1311260;Roller bed;RB2204""")
df1=pd.read_csv(df1,sep=";")
df2=pd.read_csv(df2,sep=";")
df2['indexes'] = df2['Equipment'].apply(lambda x: df1.index[df1.TagName.str.contains(str(x)).tolist()].tolist())
indexes = df2.apply(lambda x: pd.Series(x['indexes']),axis=1).stack().reset_index(level=1, drop=True)
indexes.name = 'indexes'
df2 = df2.drop('indexes', axis=1).join(indexes).dropna()
df2.index = df2['indexes']
matches = df2.join(df1, how='inner')
print(matches[['Line','TagName','EquipmentDescription','EquipmentNo']])
OUTPUT:
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
Related Topics
R Ggplot Ordering Bars in "Barplot-Like " Plot
Add a Dynamic Value into Rmysql Getquery
Convert Vector to Matrix Without Recycling
How to Split a Data Frame Among Columns, Say at Every Nth Column
Fastest Way to Sort Each Row of a Large Matrix in R
Are Data Tables with More Than 2^31 Rows Supported in R with the Data Table Package Yet
3D Equivalent of the Curve Function in R
Change the Order of Stacked Fill Columns in Ggplot2
Find Most Frequent Combination of Values in a Data.Frame
From Long to Wide Data with Multiple Columns
How to Calculate Confidence Intervals for Nonlinear Least Squares in R
Force a Regular Plot Object into a Grob for Use in Grid.Arrange
Using User-Defined "For Loop" Function to Construct a Data Frame