Pandas Dataframe Check If Column Value Exists in a Group of Columns

Pandas DataFrame check if column value exists in a group of columns

You can use the underlying numpy arrays for performance:

Setup

a = df.v0.values
b = df.iloc[:, 2:].values

df.assign(out=(a[:, None]==b).any(1).astype(int))
   id  v0  v1  v2   v3   v4  out
0 1 10 5 10 22 50 1
1 2 22 23 55 60 50 0
2 3 8 2 40 80 110 0
3 4 15 15 25 100 101 1

This solution leverages broadcasting to allow for pairwise comparison:

First, we broadcast a:

>>> a[:, None]
array([[10],
[22],
[ 8],
[15]], dtype=int64)

Which allows for pairwise comparison with b:

>>> a[:, None] == b
array([[False, True, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, False]])

We then simply check for any True results along the first axis, and convert to integer.


Performance


Functions

def user_chris(df):
a = df.v0.values
b = df.iloc[:, 2:].values
return (a[:, None]==b).any(1).astype(int)

def rahlf23(df):
df = df.set_index('id')
return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

def chris_a(df):
return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)

def chris(df):
return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

def anton_vbr(df):
df.set_index('id', inplace=True)
return df.isin(df.pop('v0')).any(1).astype(int)

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit

res = pd.DataFrame(
index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
columns=[10, 50, 100, 500, 1000, 5000],
dtype=float
)

for f in res.index:
for c in res.columns:
vals = np.random.randint(1, 100, (c, c))
vals = np.column_stack((np.arange(vals.shape[0]), vals))
df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Output

enter image description here

Check if a value exists using multiple conditions within group in pandas

Use groupby on Group column and then use transform and lambda function as:

g = df.groupby('Group')
df['Expected'] = (g['Value1'].transform(lambda x: x.eq(7).any()))&(g['Value2'].transform(lambda x: x.eq(9).any()))

Or using groupby, apply and merge using parameter how='left' as:

df.merge(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()).reset_index(),how='left').rename(columns={0:'Expected_Output'})

Or using groupby, apply and map as:

df['Expected_Output'] = df['Group'].map(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()))

print(df)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False

How to check if a value in a dataframe's column is in another dataframe's column with group by

Left merge:

m = x.merge(x, left_on=['x','y','z'], 
right_on=['x','y','s'],
how='left', suffixes=['','_']
)

You would see:

   x  y  z  s   z_   s_
0 1 4 a a a a
1 1 4 a a b a
2 1 4 b a c b
3 1 4 c b NaN NaN
4 1 5 a a a a
5 1 5 a a b a
6 1 5 a a c a
7 1 5 b a NaN NaN
8 1 5 c a NaN NaN
9 2 4 a b NaN NaN

Then your data is where s_ is NaN, so

m.loc[m['s_'].isna(), x.columns]

Output:

   x  y  z  s
3 1 4 c b
7 1 5 b a
8 1 5 c a
9 2 4 a b

Option 2: do an apply with isin on groupby:

(x.groupby(['x','y'])
.apply(lambda d: d[~d['z'].isin(d['s'])])
.reset_index(level=['x','y'], drop=True)
)

Output:

   x  y  z  s
2 1 4 c b
4 1 5 b a
5 1 5 c a
6 2 4 a b

Pandas dataframe check if a value exists in multiple columns for one row

To get the first row that meets the criteria:

df.index[df.sum(axis=1).gt(1)][0]

Output:

Out[14]: 1

Since you can get multiple matches, you can exclude the [0] to get all the rows that meet your criteria

Pandas - check if a value exists in multiple columns for each row

using numpy to sum by row to occurrences of Y should do it:

df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]

output:

      Name ID1   ID2   ID3 multi
Index
1 A Y Y Y Y
2 B Y Y None Y
3 B Y None None N
4 C Y None None N

Pandas checking if values in multiple column exists in other columns

Assuming that the indices in your Actual and estimate DataFrames are the same, one approach would be to just apply a check along the columns with isin.

Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')

Here we use the name attribute as the glue between the two DataFrames.

Demo

>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')

Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1


Related Topics



Leave a reply



Submit