Evaluate Different Logical Conditions from String for Each Row

Evaluate different logical conditions from string for each row

Not entirely sure whether you are looking for something like this, however, you can also use lazy_eval() from lazyeval:

df %>%
rowwise() %>%
mutate(res = lazy_eval(sub("value", value, condition)))

value condition res
<dbl> <chr> <lgl>
1 0.46 value > 0.5 FALSE
2 0.96 value == 0.79 FALSE
3 0.45 value <= 0.65 TRUE
4 0.68 value == 0.88 FALSE
5 0.570 value < 0.9 TRUE
6 0.1 value > 0.01 TRUE
7 0.9 value >= 0.6 TRUE
8 0.25 value < 0.91 TRUE
9 0.04 value > 0.2 FALSE

And even though it is very close to eval(parse(...)), a possibility is also using parse_expr() from rlang:

df %>%
rowwise() %>%
mutate(res = eval(rlang::parse_expr(condition)))

Populate value from a different row when conditions are matched

A couple of joins should work:

db %>% left_join(
db %>% inner_join(db,
by=c("Person.Local"="Person.Travel",
"Person.Travel"="Person.Local",
"Date"="Date"), suffix=c("",".y")) %>%
rename(Value.Travel=Value.Local.y),
by=c("Person.Local"="Person.Local",
"Person.Travel"="Person.Travel",
"Date"="Date", "Value.Local"))

# A tibble: 13 x 5
Person.Local Date Person.Travel Value.Local Value.Travel
<chr> <chr> <chr> <dbl> <dbl>
1 A 2019-10-31 C 1 7
2 A 2019-10-14 J 4 NA
3 A 2019-10-13 K 5 NA
4 A 2019-10-12 B 5 3
5 A 2019-10-12 I 7 NA
6 B 2019-10-18 C 9 NA
7 B 2019-10-21 K 7 NA
8 B 2019-10-22 V 5 NA
9 B 2019-10-12 A 3 5
10 B 2019-10-29 P 8 NA
11 C 2019-10-31 A 7 1
12 C 2019-04-04 Z 4 NA
13 C 2019-10-31 H 5 NA

By the way, you should create data like this to prevent warnings about factor levels:

db <- tibble(Person.Local, Date, Person.Travel, Value.Local)

Edit: Thanks to Josedv for reminding me about the Date they met. ^_^

resolving logical operations - AND, OR, looping conditions dynamically

Here's the complete solution which does not include third-party libraries like ANTLR or JavaCC. Note that while it's extensible, its capabilities are still limited. If you want to create much more complex expressions, you'd better use grammar generator.

First, let's write a tokenizer which splits the input string to the tokens. Here's the token types:

private static enum TokenType {
WHITESPACE, AND, OR, EQUALS, LEFT_PAREN, RIGHT_PAREN, IDENTIFIER, LITERAL, EOF
}

The token class itself:

private static class Token {
final TokenType type;
final int start; // start position in input (for error reporting)
final String data; // payload

public Token(TokenType type, int start, String data) {
this.type = type;
this.start = start;
this.data = data;
}

@Override
public String toString() {
return type + "[" + data + "]";
}
}

To simplify the tokenization let's create a regexp which reads the next token from the input string:

private static final Pattern TOKENS = 
Pattern.compile("(\\s+)|(AND)|(OR)|(=)|(\\()|(\\))|(\\w+)|\'([^\']+)\'");

Note that it has many groups, one group per TokenType in the same order (first comes WHITESPACE, then AND and so on). Finally the tokenizer method:

private static TokenStream tokenize(String input) throws ParseException {
Matcher matcher = TOKENS.matcher(input);
List<Token> tokens = new ArrayList<>();
int offset = 0;
TokenType[] types = TokenType.values();
while (offset != input.length()) {
if (!matcher.find() || matcher.start() != offset) {
throw new ParseException("Unexpected token at " + offset, offset);
}
for (int i = 0; i < types.length; i++) {
if (matcher.group(i + 1) != null) {
if (types[i] != TokenType.WHITESPACE)
tokens.add(new Token(types[i], offset, matcher.group(i + 1)));
break;
}
}
offset = matcher.end();
}
tokens.add(new Token(TokenType.EOF, input.length(), ""));
return new TokenStream(tokens);
}

I'm using java.text.ParseException. Here we apply the regex Matcher till the end of the input. If it doesn't match at the current position, we throw an exception. Otherwise we look for found matching group and create a token from it ignoring the WHITESPACE tokens. Finally we add a EOF token which indicates the end of the input. The result is returned as special TokenStream object. Here's the TokenStream class which will help us to do the parsing:

private static class TokenStream {
final List<Token> tokens;
int offset = 0;

public TokenStream(List<Token> tokens) {
this.tokens = tokens;
}

// consume next token of given type (throw exception if type differs)
public Token consume(TokenType type) throws ParseException {
Token token = tokens.get(offset++);
if (token.type != type) {
throw new ParseException("Unexpected token at " + token.start
+ ": " + token + " (was looking for " + type + ")",
token.start);
}
return token;
}

// consume token of given type (return null and don't advance if type differs)
public Token consumeIf(TokenType type) {
Token token = tokens.get(offset);
if (token.type == type) {
offset++;
return token;
}
return null;
}

@Override
public String toString() {
return tokens.toString();
}
}

So we have a tokenizer, hoorah. You can test it right now using System.out.println(tokenize("Acct1 = 'Y' AND (Acct2 = 'N' OR Acct3 = 'N')"));

Now let's write the parser which will create the tree-like representation of our expression. First the interface Expr for all the tree nodes:

public interface Expr {
public boolean evaluate(Map<String, String> data);
}

Its only method used to evaluate the expression for given data set and return true if data set matches.

The most basic expression is the EqualsExpr which is like Acct1 = 'Y' or 'Y' = Acct1:

private static class EqualsExpr implements Expr {
private final String identifier, literal;

public EqualsExpr(TokenStream stream) throws ParseException {
Token token = stream.consumeIf(TokenType.IDENTIFIER);
if(token != null) {
this.identifier = token.data;
stream.consume(TokenType.EQUALS);
this.literal = stream.consume(TokenType.LITERAL).data;
} else {
this.literal = stream.consume(TokenType.LITERAL).data;
stream.consume(TokenType.EQUALS);
this.identifier = stream.consume(TokenType.IDENTIFIER).data;
}
}

@Override
public String toString() {
return identifier+"='"+literal+"'";
}

@Override
public boolean evaluate(Map<String, String> data) {
return literal.equals(data.get(identifier));
}
}

The toString() method is just for information, you can remove it.

Next we will define the SubExpr class which is either EqualsExpr or something more complex in parentheses (if we see the parenthesis):

private static class SubExpr implements Expr {
private final Expr child;

public SubExpr(TokenStream stream) throws ParseException {
if(stream.consumeIf(TokenType.LEFT_PAREN) != null) {
child = new OrExpr(stream);
stream.consume(TokenType.RIGHT_PAREN);
} else {
child = new EqualsExpr(stream);
}
}

@Override
public String toString() {
return "("+child+")";
}

@Override
public boolean evaluate(Map<String, String> data) {
return child.evaluate(data);
}
}

Next is AndExpr which is a set of SubExpr expressions joined by AND operator:

private static class AndExpr implements Expr {
private final List<Expr> children = new ArrayList<>();

public AndExpr(TokenStream stream) throws ParseException {
do {
children.add(new SubExpr(stream));
} while(stream.consumeIf(TokenType.AND) != null);
}

@Override
public String toString() {
return children.stream().map(Object::toString).collect(Collectors.joining(" AND "));
}

@Override
public boolean evaluate(Map<String, String> data) {
for(Expr child : children) {
if(!child.evaluate(data))
return false;
}
return true;
}
}

I use Java-8 Stream API in the toString for brevity. If you cannot use Java-8, you may rewrite it with the for loop or remove toString completely.

Finally we define OrExpr which is a set of AndExpr joined by OR (usually OR has lower priority than AND). It's very similar to AndExpr:

private static class OrExpr implements Expr {
private final List<Expr> children = new ArrayList<>();

public OrExpr(TokenStream stream) throws ParseException {
do {
children.add(new AndExpr(stream));
} while(stream.consumeIf(TokenType.OR) != null);
}

@Override
public String toString() {
return children.stream().map(Object::toString).collect(Collectors.joining(" OR "));
}

@Override
public boolean evaluate(Map<String, String> data) {
for(Expr child : children) {
if(child.evaluate(data))
return true;
}
return false;
}
}

And the final parse method:

public static Expr parse(TokenStream stream) throws ParseException {
OrExpr expr = new OrExpr(stream);
stream.consume(TokenType.EOF); // ensure that we parsed the whole input
return expr;
}

So you can parse your expressions to get the Expr objects, then evaluate them against the rows of your CSV file. I assume that you're capable to parse the CSV row into the Map<String, String>. Here's usage example:

Map<String, String> data = new HashMap<>();
data.put("Acct1", "Y");
data.put("Acct2", "N");
data.put("Acct3", "Y");
data.put("Acct4", "N");

Expr expr = parse(tokenize("Acct1 = 'Y' AND (Acct2 = 'Y' OR Acct3 = 'Y')"));
System.out.println(expr.evaluate(data)); // true
expr = parse(tokenize("Acct1 = 'N' OR 'Y' = Acct2 AND Acct3 = 'Y'"));
System.out.println(expr.evaluate(data)); // false

How to represent multiple conditions in a shell if statement?

Classic technique (escape metacharacters):

if [ \( "$g" -eq 1 -a "$c" = "123" \) -o \( "$g" -eq 2 -a "$c" = "456" \) ]
then echo abc
else echo efg
fi

I've enclosed the references to $g in double quotes; that's good practice, in general. Strictly, the parentheses aren't needed because the precedence of -a and -o makes it correct even without them.

Note that the -a and -o operators are part of the POSIX specification for test, aka [, mainly for backwards compatibility (since they were a part of test in 7th Edition UNIX, for example), but they are explicitly marked as 'obsolescent' by POSIX. Bash (see conditional expressions) seems to preempt the classic and POSIX meanings for -a and -o with its own alternative operators that take arguments.


With some care, you can use the more modern [[ operator, but be aware that the versions in Bash and Korn Shell (for example) need not be identical.

for g in 1 2 3
do
for c in 123 456 789
do
if [[ ( "$g" -eq 1 && "$c" = "123" ) || ( "$g" -eq 2 && "$c" = "456" ) ]]
then echo "g = $g; c = $c; true"
else echo "g = $g; c = $c; false"
fi
done
done

Example run, using Bash 3.2.57 on Mac OS X:

g = 1; c = 123; true
g = 1; c = 456; false
g = 1; c = 789; false
g = 2; c = 123; false
g = 2; c = 456; true
g = 2; c = 789; false
g = 3; c = 123; false
g = 3; c = 456; false
g = 3; c = 789; false

You don't need to quote the variables in [[ as you do with [ because it is not a separate command in the same way that [ is.


Isn't it a classic question?

I would have thought so. However, there is another alternative, namely:

if [ "$g" -eq 1 -a "$c" = "123" ] || [ "$g" -eq 2 -a "$c" = "456" ]
then echo abc
else echo efg
fi

Indeed, if you read the 'portable shell' guidelines for the autoconf tool or related packages, this notation — using '||' and '&&' — is what they recommend. I suppose you could even go so far as:

if [ "$g" -eq 1 ] && [ "$c" = "123" ]
then echo abc
elif [ "$g" -eq 2 ] && [ "$c" = "456" ]
then echo abc
else echo efg
fi

Where the actions are as trivial as echoing, this isn't bad. When the action block to be repeated is multiple lines, the repetition is too painful and one of the earlier versions is preferable — or you need to wrap the actions into a function that is invoked in the different then blocks.

How do I check to see if there are certain characters in a column in a dataframe for use in an if statement?

let me see if I understand, if the column error has the word "Error" in at least one of his rows then you want to run a clause (the condition evaluates to true). This would be my solution:

    library(stringr) # Useful for string manipulation
findingError <- str_detect(pattern = "Error", dataframe$errorcolumn) # This function detects the presence of a pattern within a string

if(any(findingError)) { #The Any function evaluates to true when at least one of the components of a logical vector is TRUE
# Do something
}

Styling multi-line conditions in 'if' statements?

You don't need to use 4 spaces on your second conditional line. Maybe use:

if (cond1 == 'val1' and cond2 == 'val2' and 
cond3 == 'val3' and cond4 == 'val4'):
do_something

Also, don't forget the whitespace is more flexible than you might think:

if (   
cond1 == 'val1' and cond2 == 'val2' and
cond3 == 'val3' and cond4 == 'val4'
):
do_something
if (cond1 == 'val1' and cond2 == 'val2' and
cond3 == 'val3' and cond4 == 'val4'):
do_something

Both of those are fairly ugly though.

Maybe lose the brackets (the Style Guide discourages this though)?

if cond1 == 'val1' and cond2 == 'val2' and \
cond3 == 'val3' and cond4 == 'val4':
do_something

This at least gives you some differentiation.

Or even:

if cond1 == 'val1' and cond2 == 'val2' and \
cond3 == 'val3' and \
cond4 == 'val4':
do_something

I think I prefer:

if cond1 == 'val1' and \
cond2 == 'val2' and \
cond3 == 'val3' and \
cond4 == 'val4':
do_something

Here's the Style Guide, which (since 2010) recommends using brackets.

Dynamically evaluate an expression from a formula in Pandas

You can use 1) pd.eval(), 2) df.query(), or 3) df.eval(). Their various features and functionality are discussed below.

Examples will involve these dataframes (unless otherwise specified).

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))


1) pandas.eval

This is the "Missing Manual" that pandas doc should contain.
Note: of the three functions being discussed, pd.eval is the most important. df.eval and df.query call
pd.eval under the hood. Behaviour and usage is more or less
consistent across the three functions, with some minor semantic
variations which will be highlighted later. This section will
introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.

pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do

x = 5
pd.eval("df1.A + (df1.B * x)")

Some things to note here:

  1. The entire expression is a string
  2. df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
  3. Specific columns are accessed using the attribute accessor index. You can also use "df1['A'] + (df1['B'] * x)" to the same effect.

I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:

pd.eval("df1.A + df2.A")   # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5") # Valid, returns a pd.DataFrame object

...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.

pd.eval("df1 > df2")
pd.eval("df1 > 5")
pd.eval("df1 < df2 and df3 < df4")
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")

A list detailing all the supported features and syntax can be found in the documentation. In summary,

  • Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
  • Comparison operations, including chained comparisons, e.g., 2 < df < df2
  • Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool
    list and tuple literals, e.g., [1, 2] or (1, 2)
  • Attribute access, e.g., df.a
  • Subscript expressions, e.g., df[0]
  • Simple variable evaluation, e.g., pd.eval('df') (this is not very useful)
  • Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and
    arctan2.

This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.

From the list, it is obvious you can also pass expressions involving the index, such as

pd.eval('df1.A * (df1.index > 1)')

1a) Parser Selection: The parser=... argument

pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.

Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and or. So,

pd.eval("(df1 > df2) & (df3 < df4)")

Will be the same as

pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')

And also the same as

pd.eval("df1 > df2 and df3 < df4")

Here, the parentheses are necessary. To do this conventionally, the parentheses would be required to override the higher precedence of bitwise operators:

(df1 > df2) & (df3 < df4)

Without that, we end up with

df1 > df2 & df3 < df4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Use parser='python' if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.

pd.eval("(df1 > df2) & (df3 < df4)", parser='python')

The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the 'pandas' parser. For example,

pd.eval("df1 == [1, 2, 3]")

Is valid, and will run with the same semantics as

pd.eval("df1 in [1, 2, 3]")

OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') will throw a NotImplementedError error.

1b) Backend Selection: The engine=... argument

There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.

With Python backend, your expression is evaluated similar to just passing the expression to Python's eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.

df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')

0 True
1 False
2 True
Name: A, dtype: bool

Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so use at your own risk! It is generally not recommended to change this option to 'python' unless you know what you're doing.

1c) local_dict and global_dict arguments

Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict

For example:

pd.eval("df1 > thresh")

UndefinedVariableError: name 'thresh' is not defined

This fails because thresh is not defined. However, this works:

pd.eval("df1 > thresh", local_dict={'thresh': 10})

This is useful when you have variables to supply from a dictionary. Alternatively, with the Python engine, you could simply do this:

mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')

But this is going to possibly be much slower than using the 'numexpr' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.

1d) The target (+ inplace) argument, and Assignment Expressions

This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.

Consider the example in the question

x = 5
df2['D'] = df1['A'] + (df1['B'] * x)

To assign a column "D" to df2, we do

pd.eval('D = df1.A + (df1.B * x)', target=df2)

A B C D
0 5 9 8 5
1 4 3 0 52
2 5 0 2 22
3 8 1 3 48
4 3 7 0 42

This is not an in-place modification of df2 (but it can be... read on). Consider another example:

pd.eval('df1.A + df2.A')

0 10
1 11
2 7
3 16
4 10
dtype: int32

If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:

df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
F B G H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to
# df = df.assign(B=pd.eval('df1.A + df2.A'))

df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN

If you wanted to perform an in-place mutation on df, set inplace=True.

pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to
# df['B'] = pd.eval('df1.A + df2.A')

df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN

If inplace is set without a target, a ValueError is raised.

While the target argument is fun to play around with, you will seldom need to use it.

If you wanted to do this with df.eval, you would use an expression involving an assignment:

df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df

F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN

Note

One of pd.eval's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:

pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)

It can also parse nested lists with the 'python' engine:

pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]

And lists of strings:

pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]

The problem, however, is for lists with length larger than 100:

pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python')

AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'

More information can this error, causes, fixes, and workarounds can be found here.



2) DataFrame.eval:

As mentioned above, df.eval calls pd.eval under the hood, with a bit of juxtaposition of arguments. The v0.23 source code shows this:

def eval(self, expr, inplace=False, **kwargs):

from pandas.core.computation.eval import eval as _eval

inplace = validate_bool_kwarg(inplace, 'inplace')
resolvers = kwargs.pop('resolvers', None)
kwargs['level'] = kwargs.pop('level', 0) + 1
if resolvers is None:
index_resolvers = self._get_index_resolvers()
resolvers = dict(self.iteritems()), index_resolvers
if 'target' not in kwargs:
kwargs['target'] = self
kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
return _eval(expr, inplace=inplace, **kwargs)

eval creates arguments, does a little validation, and passes the arguments on to pd.eval.

For more, you can read on: When to use DataFrame.eval() versus pandas.eval() or Python eval()



2a) Usage Differences

2a1) Expressions with DataFrames vs. Series Expressions

For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2") when you call df1.eval or df2.eval.

2a2) Specifying Column Names

Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1, you would call pd.eval with the following expression:

pd.eval("df1.A + df1.B")

With df.eval, you need only supply the column names:

df1.eval("A + B")

Since, within the context of df1, it is clear that "A" and "B" refer to column names.

You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).

df1.eval("A + index")


Related Topics



Leave a reply



Submit