Deleting dataFrame row in Pandas if a combination of column values equals a tuple in a list
Say you have
removal_list = [(item1,store1),(item2,store1),(item2,store2)]
Then
df[['column_1', 'column_2']].apply(tuple, axis=1)
should create a Series of tuples, and so
df[['column_1', 'column_2']].apply(tuple, axis=1).isin(removal_list)
is the binary condition you're after. Removal is the same as you did before. This should work for any number of columns.
Example
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df[['a', 'b']].apply(tuple, axis=1).isin([(1, 3), (30, 40)])
0 (1, 3)
1 (2, 4)
dtype: object
Compare list of tuples against column in dataframe
You can make a dataframe out of list of tuples, like this:df = pd.DataFrame(list_of_tuples, columns =['id', 'pattern_id'])
and then join it with the main dataframe, like this:joined = main_df.merge(df, on='id', how='inner')
. The pattern_id
is included in joined
for rows having matched id
.
Check whether tuple column in pandas contains some value from a list
You can use set.intersection
and use astype(bool)
code = set(codes)
df.b.map(code.intersection).astype(bool)
0 True
1 False
Name: b, dtype: bool
Timeit analysis
#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000
s = pd.Series(o)
s
0 [6, 2, 5]
1 [7, 4, 0]
2 [1, 8, 2]
3 [4, 8, 9]
4 [7, 3, 4]
...
9995 [3, 9, 4]
9996 [6, 2, 9]
9997 [2, 0, 5]
9998 [5, 0, 7]
9999 [7, 4, 2]
Length: 10000, dtype: object
# Adam's answer
In [38]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
19.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#komatiraju's answer
In [39]: %timeit s.apply(lambda x: any(val in x for val in codes))
83.8 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#My answer
In [42]: %%timeit
...: code = set(codes)
...: s.map(code.intersection).astype(bool)
...:
...:
15.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#wwnde's answer
In [74]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
19.5 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For Series
of size 1 million
bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0 [6, 2, 5]
1 [6, 2, 5]
2 [6, 2, 5]
3 [6, 2, 5]
4 [6, 2, 5]
...
999995 [7, 4, 2]
999996 [7, 4, 2]
999997 [7, 4, 2]
999998 [7, 4, 2]
999999 [7, 4, 2]
Length: 1000000, dtype: object
In [54]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [55]: %timeit s.apply(lambda x: any(val in x for val in codes))
8.9 s ± 652 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [56]: %%timeit
...: code = set(codes)
...: s.map(code.intersection).astype(bool)
...:
...:
1.54 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
1.95 s ± 88.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
how do you filter a Pandas dataframe by a multi-column set?
You can use Series.isin
, but first is necessary create tuples by first 3 columns:
print (df[df[['a','b','c']].apply(tuple, axis=1).isin(value_set)])
Or convert columns to index and use Index.isin
:
print (df[df.set_index(['a','b','c']).index.isin(value_set)])
a b c d
0 1 2 3 not
1 1 2 3 relevant
Another idea is use inner join of DataFrame.merge
by helper DataFrame
by same 3 columns names, then on
parameter should be omit, because join by intersection of columns names of both df:
print (df.merge(pd.DataFrame(value_set, columns=['a','b','c'])))
a b c d
0 1 2 3 not
1 1 2 3 relevant
Filter Pandas dataframe based on combination of two columns
Use -
df[df[['a', 'b']].apply(tuple, axis=1).isin([(1,2), (4,3)])]
Output
a b
0 1 2
3 4 3
Explanation
df[['a', 'b']].apply(tuple, axis=1)
gives a series of tuples -
0 (1, 2)
1 (2, 3)
2 (4, 2)
3 (4, 3)
.isin([(1,2), (4,3)])
searches for the desired tuples and gives a boolean series
Remove rows with empty lists from pandas data frame
You could try slicing as though the data frame were strings instead of lists:
import pandas as pd
df = pd.DataFrame({
'donation_orgs' : [[], ['the research of Dr.']],
'donation_context': [[], ['In lieu of flowers , memorial donations']]})
df[df.astype(str)['donation_orgs'] != '[]']
Out[9]:
donation_context donation_orgs
1 [In lieu of flowers , memorial donations] [the research of Dr.]
Function to select pandas dataframe rows based on list of tuples of columns and cutoffs?
Dynamic Query function
Since you want to check for all the conditions, these will be AND. So we can start filtering them one by one.
import pandas as pd
def sub_df(dx,cuts):
for cx in cuts:
col = cx[0]
minval = cx[1]
maxval = cx[2]
dx = dx[(dx[col] >= minval) & (dx[col] <= maxval)]
#or you can also give it like this
#
#dx = dx[dx[col].between(minval, maxval)]
#
return dx
df = pd.DataFrame( {"A": [100, 200, 300, 400],"B": [10,20,30,40],
"C": [200, 400, 600, 800],"D": [20,40,60,80],
"E": [150, 300, 450, 600],"F": [15,30,45,60],
"G": [500, 600, 700, 800],"H": [50,60,70,80]})
print (df)
cutoffs = [('A',150, 350),('G',650, 750)]
df1 = sub_df(df,cutoffs)
print (df1)
cutoffs = [('B',10, 30),('C',50, 350),('F',10, 50)]
df1 = sub_df(df,cutoffs)
print (df1)
cutoffs = [('B',10, 30),('D',50, 100),('H',10, 50)]
df1 = sub_df(df,cutoffs)
print (df1)
Outputs for these are as follows:
Original DataFrame:
A B C D E F G H
0 100 10 200 20 150 15 500 50
1 200 20 400 40 300 30 600 60
2 300 30 600 60 450 45 700 70
3 400 40 800 80 600 60 800 80
Results for condition 1: [('A',150, 350),('G',650, 750)]
A B C D E F G H
2 300 30 600 60 450 45 700 70
Results for condition 2: [('B',10, 30),('C',50, 350),('F',10, 50)]
A B C D E F G H
0 100 10 200 20 150 15 500 50
Results for condition 3: [('B',10, 30),('D',50, 100),('H',10, 50)]
Empty DataFrame
Columns: [A, B, C, D, E, F, G, H]
Index: []
Prev Answer
I think you are looking for this:
import pandas as pd
def sub_df(dx,tup_vals):
return dx[(dx[tup_vals[0]] >= tup_vals[1]) & (dx[tup_vals[0]] <= tup_vals[2])]
Here dx
is the dataframe passed to the functiontup_vals
will have (colname,min,max)
Example of usage of this function:
df = pd.DataFrame( {"A": [200, 400, 600, 800],"B": [10,20,30,40]})
print (df)
tups = ('A',300, 700)
df1 = sub_df(df,tups)
print (df1)
Output of this will be:
Original DF:
A B
0 200 10
1 400 20
2 600 30
3 800 40
Returned DF: (values in col A between 300 and 700)
A B
1 400 20
2 600 30
Related Topics
Pickle - Cpickle.Unpicklingerror: Invalid Load Key, '?'
Write a Program That Find the Largest Integer in a String
Exception Has Occurred: Filenotfounderror [Errno 2] No Such File or Directory: 'Data.Json'
Stripping Non Printable Characters from a String in Python
How to Expand Input Buffer Size of Pyserial
How to Calculate R-Squared Using Python and Numpy
Get Business Days Between Start and End Date Using Pandas
Run a Python Script from Another Python Script, Passing in Arguments
Get Absolute Paths of All Files in a Directory
Pandas Rank by Multiple Columns
_Tkinter.Tclerror: No Display Name and No $Display Environment Variable
How to Further Filter a Result of Resultset
Get Rid of Columns With Null Value in Json Output
Keras + Tensorflow and Multiprocessing in Python
How to Repeat a Function N Times
How to Upgrade the Sqlite Version Used by Python'S Sqlite3 Module on Mac
Efficiently Find Repeated Characters in a String
Exclude First Row When Importing Data from Excel into Python