Logical Operators For Boolean Indexing in Pandas

Logical operators for Boolean indexing in Pandas

When you say

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1) and (a['y']==10) to Boolean values.

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a Boolean value -- in other words, they raise

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

when used as a Boolean value. That's because it's unclear when it should be True or False. Some users might assume they are True if they have non-zero length, like a Python list. Others might desire for it to be True only if all its elements are True. Others might want it to be True if any of its elements are True.

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.

Instead, you must be explicit, by calling the empty(), all() or any() method to indicate which behavior you desire.

In this case, however, it looks like you do not want Boolean evaluation, you want element-wise logical-and. That is what the & binary operator performs:

(a['x']==1) & (a['y']==10)

returns a boolean array.


By the way, as alexpmil notes,
the parentheses are mandatory since & has a higher operator precedence than ==.

Without the parentheses, a['x']==1 & a['y']==10 would be evaluated as a['x'] == (1 & a['y']) == 10 which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10). That is an expression of the form Series and Series.
The use of and with two Series would again trigger the same ValueError as above. That's why the parentheses are mandatory.

difference between & and and in pandas

The len(df_temp > 0) and len(df_temp4 > 0) probably don't do what you expect. The comparison operators with pandas DataFrames return element-wise results, that means they create a boolean DataFrame where each value indicates if the corresponding value in the DataFrame is greater than zero:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': [-1,0,1], 'b': [-1,0,1]})
>>> df
a b
0 -1 -1
1 0 0
2 1 1
>>> df > 0
a b
0 False False
1 False False
2 True True

So the len of df is the same as the len of df > 0:

>>> len(df)
3
>>> len(df > 0)
3

difference between "&" and "and"

They mean different things:

  • & is bitwise and
  • and is logical and (and short-circuiting)

Since you asked specifically about pandas (assuming at least one operand is a NumPy array, pandas Series, or pandas DataFrame):

  • & also refers to the element-wise "bitwise and".
  • The element-wise "logical and" for pandas isn't and but one has to use a function, i.e. numpy.logical_and.

For more explanation you can refer to "Difference between 'and' (boolean) vs. '&' (bitwise) in python. Why difference in behavior with lists vs numpy arrays?"

not sure what would cause this statement to fail all of a sudden.

You did not provide the "fail" nor the expected behavior so unfortunately I cannot help you there.

Element-wise logical OR in Pandas

The corresponding operator is |:

 df[(df < 3) | (df == 5)]

would elementwise check if value is less than 3 or equal to 5.


If you need a function to do this, we have np.logical_or. For two conditions, you can use

df[np.logical_or(df<3, df==5)]

Or, for multiple conditions use the logical_or.reduce,

df[np.logical_or.reduce([df<3, df==5])]

Since the conditions are specified as individual arguments, parentheses grouping is not needed.

More information on logical operations with pandas can be found here.

What happens when I pass a boolean dataframe to the indexing operator for another dataframe in pandas?

test_df[..] calls an indexing method __getitem__(). From the source code:

    def __getitem__(self, key):
...

# Do we have a (boolean) DataFrame?
if isinstance(key, DataFrame):
return self.where(key)

# Do we have a (boolean) 1d indexer?
if com.is_bool_indexer(key):
return self._getitem_bool_array(key)

As you can see, if the key is a boolean DataFrame, it will call pandas.DataFrame.where(). The function of where() is to replace values where the condition is False with NaN by default.

# print(test_df.isnull())
0 1 2 3
0 False False False False
1 False False False True
2 False False True True

# print(test_df)
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN

test_df.where(test_df.isnull()) replaces not null values with NaN.

Using boolean indexing or groupby to sum customers and identify products

You can do everything with boolean indexing:

sub = df[
(df["Customer Cat"] == "Disloyal")
& (df["Satisfaction"] == "Dissatisfied")
& df["Age"].between(30, 40)
]

Then you run your analysis like:

sub[(sub["Prod A Rank"] <= 2) & (sub["Prod B Rank"] <= 2)].shape[0]
# Given your example this outputs 1

Alternatively, if you want to know whether your subset of customer was dissatisfied with either one of your products you can use the logical operator | (OR):

sub[(sub["Prod A Rank"] <= 2) | (sub["Prod B Rank"] <= 2)].shape[0]

If you want to study the products, you can try this:

(sub.melt(value_vars=[c for c in sub.columns if c.startswith("Prod")])
.groupby("variable")
.value_counts()
.to_frame()
.reset_index()
.rename(columns={0: "count"}))

This outputs:

    variable    value   count
0 Prod A Rank 0 1
1 Prod B Rank 2 1

Applying an IF condition in multiple columns with pandas

If you have multiple columns that start with val to process at a step, you can use .filter() to filter the columns and set it into a list cols. Then, use .loc to set the selected columns, as follows:

# put all columns that start with `val` into a list
cols = df.filter(regex='^val').columns

# set 0 all the variables val*
df.loc[(df['Lon'] == 22) & (df['Lat'] == 38), cols] = 0


Related Topics



Leave a reply



Submit