Pandas DataFrame check if column value exists in a group of columns
You can use the underlying numpy
arrays for performance:
Setup
a = df.v0.values
b = df.iloc[:, 2:].values
df.assign(out=(a[:, None]==b).any(1).astype(int))
id v0 v1 v2 v3 v4 out
0 1 10 5 10 22 50 1
1 2 22 23 55 60 50 0
2 3 8 2 40 80 110 0
3 4 15 15 25 100 101 1
This solution leverages broadcasting to allow for pairwise comparison:
First, we broadcast a
:
>>> a[:, None]
array([[10],
[22],
[ 8],
[15]], dtype=int64)
Which allows for pairwise comparison with b
:
>>> a[:, None] == b
array([[False, True, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, False]])
We then simply check for any True
results along the first axis, and convert to integer.
Performance
Functions
def user_chris(df):
a = df.v0.values
b = df.iloc[:, 2:].values
return (a[:, None]==b).any(1).astype(int)
def rahlf23(df):
df = df.set_index('id')
return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)
def chris_a(df):
return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)
def chris(df):
return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)
def anton_vbr(df):
df.set_index('id', inplace=True)
return df.isin(df.pop('v0')).any(1).astype(int)
Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit
res = pd.DataFrame(
index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
columns=[10, 50, 100, 500, 1000, 5000],
dtype=float
)
for f in res.index:
for c in res.columns:
vals = np.random.randint(1, 100, (c, c))
vals = np.column_stack((np.arange(vals.shape[0]), vals))
df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Output
Check if a value exists using multiple conditions within group in pandas
Use groupby
on Group
column and then use transform
and lambda function
as:
g = df.groupby('Group')
df['Expected'] = (g['Value1'].transform(lambda x: x.eq(7).any()))&(g['Value2'].transform(lambda x: x.eq(9).any()))
Or using groupby
, apply
and merge
using parameter how='left'
as:
df.merge(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()).reset_index(),how='left').rename(columns={0:'Expected_Output'})
Or using groupby
, apply
and map
as:
df['Expected_Output'] = df['Group'].map(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()))
print(df)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
How to check if a value in a dataframe's column is in another dataframe's column with group by
Left merge:
m = x.merge(x, left_on=['x','y','z'],
right_on=['x','y','s'],
how='left', suffixes=['','_']
)
You would see:
x y z s z_ s_
0 1 4 a a a a
1 1 4 a a b a
2 1 4 b a c b
3 1 4 c b NaN NaN
4 1 5 a a a a
5 1 5 a a b a
6 1 5 a a c a
7 1 5 b a NaN NaN
8 1 5 c a NaN NaN
9 2 4 a b NaN NaN
Then your data is where s_
is NaN
, so
m.loc[m['s_'].isna(), x.columns]
Output:
x y z s
3 1 4 c b
7 1 5 b a
8 1 5 c a
9 2 4 a b
Option 2: do an apply
with isin
on groupby
:
(x.groupby(['x','y'])
.apply(lambda d: d[~d['z'].isin(d['s'])])
.reset_index(level=['x','y'], drop=True)
)
Output:
x y z s
2 1 4 c b
4 1 5 b a
5 1 5 c a
6 2 4 a b
Pandas dataframe check if a value exists in multiple columns for one row
To get the first row that meets the criteria:
df.index[df.sum(axis=1).gt(1)][0]
Output:
Out[14]: 1
Since you can get multiple matches, you can exclude the [0]
to get all the rows that meet your criteria
Pandas - check if a value exists in multiple columns for each row
using numpy to sum by row to occurrences of Y should do it:
df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]
output:
Name ID1 ID2 ID3 multi
Index
1 A Y Y Y Y
2 B Y Y None Y
3 B Y None None N
4 C Y None None N
Pandas checking if values in multiple column exists in other columns
Assuming that the indices in your Actual
and estimate
DataFrames are the same, one approach would be to just apply
a check along the columns with isin
.
Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Here we use the name
attribute as the glue between the two DataFrames.
Demo
>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1
Related Topics
How to Create a for Loop That Goes Through All Diagonal Possibilities of a List
How to Track the Number of Times a Function Is Called
Type Conversion in Python Attributeerror: 'Str' Object Has No Attribute 'Astype'
Iterating Over Every Two Elements in a List
Typeerror: Image Data Can Not Convert to Float
How to Append Dataframes Inside a for Loop in Python
Fillna in Multiple Columns in Place in Python Pandas
Getting the Id of the Last Record Inserted for Postgresql Serial Key With Python
Centering Text in Ipython Notebook Markdown/Heading Cells
Matching Text Between a Pair of Single Quotes
Python: How to Read and Load an Excel File from Aws S3
Generate List of Quarters Betweeen Given Dates
Cannot Convert the Series to <Class 'Int''>
Counting the Number of Duplicates in a List
Pandas Get Frequency of Item Occurrences in a Column as Percentage
Sum Numbers of Each Row of a Matrix Python