Deleting Dataframe Row in Pandas If a Combination of Column Values Equals a Tuple in a List

Deleting dataFrame row in Pandas if a combination of column values equals a tuple in a list

Say you have

removal_list = [(item1,store1),(item2,store1),(item2,store2)]

Then

df[['column_1', 'column_2']].apply(tuple, axis=1)

should create a Series of tuples, and so

df[['column_1', 'column_2']].apply(tuple, axis=1).isin(removal_list)

is the binary condition you're after. Removal is the same as you did before. This should work for any number of columns.

Example

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df[['a', 'b']].apply(tuple, axis=1).isin([(1, 3), (30, 40)])
0 (1, 3)
1 (2, 4)
dtype: object

Compare list of tuples against column in dataframe

You can make a dataframe out of list of tuples, like this:
df = pd.DataFrame(list_of_tuples, columns =['id', 'pattern_id'])
and then join it with the main dataframe, like this:
joined = main_df.merge(df, on='id', how='inner'). The pattern_id is included in joined for rows having matched id.

Check whether tuple column in pandas contains some value from a list

You can use set.intersection and use astype(bool)

code = set(codes)
df.b.map(code.intersection).astype(bool)

0 True
1 False
Name: b, dtype: bool

Timeit analysis

#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0 [6, 2, 5]
1 [7, 4, 0]
2 [1, 8, 2]
3 [4, 8, 9]
4 [7, 3, 4]
...
9995 [3, 9, 4]
9996 [6, 2, 9]
9997 [2, 0, 5]
9998 [5, 0, 7]
9999 [7, 4, 2]
Length: 10000, dtype: object


# Adam's answer
In [38]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
19.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#komatiraju's answer
In [39]: %timeit s.apply(lambda x: any(val in x for val in codes))
83.8 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#My answer
In [42]: %%timeit
...: code = set(codes)
...: s.map(code.intersection).astype(bool)
...:
...:
15.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#wwnde's answer
In [74]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
19.5 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For Series of size 1 million

bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0 [6, 2, 5]
1 [6, 2, 5]
2 [6, 2, 5]
3 [6, 2, 5]
4 [6, 2, 5]
...
999995 [7, 4, 2]
999996 [7, 4, 2]
999997 [7, 4, 2]
999998 [7, 4, 2]
999999 [7, 4, 2]
Length: 1000000, dtype: object


In [54]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit s.apply(lambda x: any(val in x for val in codes))
8.9 s ± 652 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [56]: %%timeit
...: code = set(codes)
...: s.map(code.intersection).astype(bool)
...:
...:
1.54 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
1.95 s ± 88.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

how do you filter a Pandas dataframe by a multi-column set?

You can use Series.isin, but first is necessary create tuples by first 3 columns:

print (df[df[['a','b','c']].apply(tuple, axis=1).isin(value_set)])

Or convert columns to index and use Index.isin:

print (df[df.set_index(['a','b','c']).index.isin(value_set)])

a b c d
0 1 2 3 not
1 1 2 3 relevant

Another idea is use inner join of DataFrame.merge by helper DataFrame by same 3 columns names, then on parameter should be omit, because join by intersection of columns names of both df:

print (df.merge(pd.DataFrame(value_set, columns=['a','b','c'])))
a b c d
0 1 2 3 not
1 1 2 3 relevant

Filter Pandas dataframe based on combination of two columns

Use -

df[df[['a', 'b']].apply(tuple, axis=1).isin([(1,2), (4,3)])]

Output

    a   b
0 1 2
3 4 3

Explanation

df[['a', 'b']].apply(tuple, axis=1) gives a series of tuples -

0    (1, 2)
1 (2, 3)
2 (4, 2)
3 (4, 3)

.isin([(1,2), (4,3)]) searches for the desired tuples and gives a boolean series

Remove rows with empty lists from pandas data frame

You could try slicing as though the data frame were strings instead of lists:

import pandas as pd
df = pd.DataFrame({
'donation_orgs' : [[], ['the research of Dr.']],
'donation_context': [[], ['In lieu of flowers , memorial donations']]})

df[df.astype(str)['donation_orgs'] != '[]']

Out[9]:
donation_context donation_orgs
1 [In lieu of flowers , memorial donations] [the research of Dr.]

Function to select pandas dataframe rows based on list of tuples of columns and cutoffs?

Dynamic Query function

Since you want to check for all the conditions, these will be AND. So we can start filtering them one by one.

import pandas as pd

def sub_df(dx,cuts):

for cx in cuts:
col = cx[0]
minval = cx[1]
maxval = cx[2]
dx = dx[(dx[col] >= minval) & (dx[col] <= maxval)]

#or you can also give it like this
#
#dx = dx[dx[col].between(minval, maxval)]
#
return dx


df = pd.DataFrame( {"A": [100, 200, 300, 400],"B": [10,20,30,40],
"C": [200, 400, 600, 800],"D": [20,40,60,80],
"E": [150, 300, 450, 600],"F": [15,30,45,60],
"G": [500, 600, 700, 800],"H": [50,60,70,80]})

print (df)

cutoffs = [('A',150, 350),('G',650, 750)]
df1 = sub_df(df,cutoffs)
print (df1)

cutoffs = [('B',10, 30),('C',50, 350),('F',10, 50)]
df1 = sub_df(df,cutoffs)
print (df1)

cutoffs = [('B',10, 30),('D',50, 100),('H',10, 50)]
df1 = sub_df(df,cutoffs)
print (df1)

Outputs for these are as follows:

Original DataFrame:

     A   B    C   D    E   F    G   H
0 100 10 200 20 150 15 500 50
1 200 20 400 40 300 30 600 60
2 300 30 600 60 450 45 700 70
3 400 40 800 80 600 60 800 80

Results for condition 1: [('A',150, 350),('G',650, 750)]

     A   B    C   D    E   F    G   H
2 300 30 600 60 450 45 700 70

Results for condition 2: [('B',10, 30),('C',50, 350),('F',10, 50)]

     A   B    C   D    E   F    G   H
0 100 10 200 20 150 15 500 50

Results for condition 3: [('B',10, 30),('D',50, 100),('H',10, 50)]

Empty DataFrame
Columns: [A, B, C, D, E, F, G, H]
Index: []

Prev Answer

I think you are looking for this:

import pandas as pd

def sub_df(dx,tup_vals):
return dx[(dx[tup_vals[0]] >= tup_vals[1]) & (dx[tup_vals[0]] <= tup_vals[2])]

Here dx is the dataframe passed to the function
tup_vals will have (colname,min,max)

Example of usage of this function:

df = pd.DataFrame( {"A": [200, 400, 600, 800],"B": [10,20,30,40]})

print (df)

tups = ('A',300, 700)
df1 = sub_df(df,tups)
print (df1)

Output of this will be:

Original DF:

     A   B
0 200 10
1 400 20
2 600 30
3 800 40

Returned DF: (values in col A between 300 and 700)

     A   B
1 400 20
2 600 30


Related Topics



Leave a reply



Submit