How to Find Rows of One Dataframe in Another Dataframe

pandas get rows which are NOT in other dataframe

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14

How to find rows of one dataframe in another dataframe?

Use this:

mask = (df1[['COL1','COL2']].isin(df2[['COL1','COL2']])).all(axis=1)
df1[mask]

COL1 COL2 COL3
1 Bob 12 1
2 Clarke 13 4

selected_rows = list(df1[mask].index)
[1, 2]

Find rows with similar values in another dataframe

This is a perfect use case for melt as starting point before merge your two dataframes. melt flat your value columns (FeatureX). After merging, you have two columns values_x (features from df1) and values_y (features from df2) you need to compare.

Now, with query, keep rows where this 2 columns are equals. Then, use value_counts on (Fruit, Order) columns then reformat the dataframe with rename and reset_index. Finally, drop_duplicates on Fruit column to keep the first count, the highest value because the Matches column is already sorted.

You can execute this one-line step by step to see the transformation of the dataframe:

out = pd.merge(df1.melt(['Fruit', 'Site']),
df2.melt(['Order', 'Site']),
on=['Site', 'variable']) \
.query('value_x == value_y') \
.value_counts(['Fruit', 'Order']) \
.rename('Matches') \
.reset_index() \
.drop_duplicates('Fruit')

Final output:

>>> out
Fruit Order Matches
0 Apple XY 3
1 Banana XY 3
6 Cherry XY 2
7 Durian YY 2
12 Grape ZZ 1

Note: check carefully my result because it's not equal to your output.

Pandas: Find rows which don't exist in another DataFrame by multiple columns

Since 0.17.0 there is a new indicator param you can pass to merge which will tell you whether the rows are only present in left, right or both:

In [5]:
merged = df.merge(other, how='left', indicator=True)
merged

Out[5]:
col1 col2 extra_col _merge
0 0 a this left_only
1 1 b is both
2 1 c just left_only
3 2 b something left_only

In [6]:
merged[merged['_merge']=='left_only']

Out[6]:
col1 col2 extra_col _merge
0 0 a this left_only
2 1 c just left_only
3 2 b something left_only

So you can now filter the merged df by selecting only 'left_only' rows

How do I select rows from a DataFrame based on column values?

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14

If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14

Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B
one foo 0 0
one bar 1 2
one foo 6 12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12

How to match rows in DataFrame on another DataFrame with multiply condition

Use merge with how='cross' before use boolean masks to select right rows:

out = pd.merge(df_1, df_2, how='cross', suffixes=('_df1', '_df2'))
m1 = out['num_df1'] != out['num_df2']
m2 = abs(out['time_df2'] - out['time_df1']) <= 10
out = out[m1 & m2]

Output:

>>> out
num_df1 time_df1 num_df2 time_df2
1 1 100 2 104
5 2 200 3 200


Related Topics



Leave a reply



Submit