How to Filter Pandas Dataframes by Multiple Columns

how do you filter pandas dataframes by multiple columns

Using & operator, don't forget to wrap the sub-statements with ():

males = df[(df[Gender]=='Male') & (df[Year]==2014)]

To store your dataframes in a dict using a for loop:

from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict

EDIT:

A demo for your getDF:

def getDF(dic, gender, year):
return dic[gender][year]

print genDF(dic, 'male', 2014)

Filter data based on multiple columns in second dataframe python

You could use multiple isin and chain them using & operator. Since final_gps can be either gps1 or gps2, we use | operator in brackets:

out = (df1[df1['date'].isin(df2['date']) & 
df1['agent_id'].isin(df2['agent_id']) &
(df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
.reset_index(drop=True))

Output:

         date agent_id final_gps ….
0 14-02-2020 12abc (1, 2) …
1 14-02-2020 12abc (7, 6) …
2 14-02-2020 12abc (3, 4) …
3 14-02-2020 33bcd (6, 7) …
4 14-02-2020 33bcd (8, 9) …
5 20-02-2020 12abc (3, 5) …
6 20-02-2020 12abc (3, 1) …
7 20-02-2020 44hgf (1, 6) …
8 20-02-2020 44hgf (3, 7) …

How to properly filter multiple columns in Pandas?

You can use apply to filter all columns at once, check-in each if a value is 0, and return true if any.

result = df.drop(["Outcome"], axis=1).apply(lambda x: x != 0 , axis=0).any(1)
df[result]

Alternative solution without using apply:

# determine for each value cell whether it it zero
matches = df.drop(["Outcome"], axis=1) == 0

# build rowsums. It counts the number of zero values.
# if there are no zero values in a row, the rowsum is 0
# find all rows with a rowsum of 0
relevant_rows = matches.sum(axis=1) == 0

# subset just those rows with rowsum == 0
df.loc[relevant_rows, :]

Query or filter pandas dataframe on multiple columns and cell values

Here is a more concise approach:

  • Filter the Neighbour like columns
  • Create boolean mask with DataFrame.isin to check whether each element in dataframe is contained in state column of non_treated
  • Reduce the boolean mask along the columns axis with any
cols = treated_states.filter(like='Neigh')
mask = cols.isin(not_treated.state).any(axis=1)


>>> treated_states[mask]

year state BC_law n_ipo Neighbor1 Neighbor2 Neighbor3 Treated
0 1980 AZ 1999 100 CA AK WV 1
1 1999 AZ 1999 50 CA AK WV 1

how to filter pandas dataframes by multiple columns and conditions

Using .gt() .le() .ne() .eq() etc. can save a lot of headache when it comes to getting all your () correct. As well as formatting your code across multiple lines for clarity:

mask = (df['Left In Stock'].gt(10)
& df['Price (£)'].ge(10)
& df['Weight(g)'].le(700)
& ~(df['Title'].str.contains('the raven|on the')
& df['Colour'].eq('White')))

out = df.loc[mask]

Filter pandas dataframe by multiple columns, using tuple from list of tuples

TL;DR: use df[df[["A","B"]].apply(tuple, 1) == AB_col[0]].


I think you might be overthinking the matter. Let's dissect the code a bit:

df[["A","B"]].apply(tuple, 1) # or: df[["A","B"]].apply(tuple, axis=1)
# meaning: create tuples for each row

0 (0, 230)
1 (20, 192)
2 (50, 90)
dtype: object

So this just gets us A and B as tuples. Next, applying df.isin is a way to look whether any of these tuples exist inside your list AB_col. It's a consecutive evaluation:

(0, 230) in AB_col # True
(20, 192) in AB_col # True
(50, 90) in AB_col # False

# no different than:
1 in [1,2,3] # True

The resulting Series with booleans is then used to select from the df. Hence, the result (first two rows, not the third):

     account   A    B    C
0 Jones LLC 0 230 140
1 Alpha Co 20 192 215

All you want to do, is return the rows that match the first element from the list AB_col:

I want to only get one record back, that matches the first tuple in the list of tuples.

So, that's easy enough. We simply need ==:

first_elem = AB_col[0]

df[df[["A","B"]].apply(tuple, 1) == first_elem]

account A B C
0 Jones LLC 0 230 140

Filter dataframe based on 2 columns

Try:

>>> df[df["city"].ne("Vienna")|df["Flow"]]
city Flow
0 Berlin False
1 Berlin True
3 Vienna True
5 Frankfurt True
6 Frankfurt False

Filter rows based on multiple columns entries

Alternative 1: pd.DataFrame.query()

You could work with query (see also the illustrative examples here):

expr = "Chr=={chr} & Start=={pos} & Alt=='{alt}'"
ret = df.query(expr.format(chr=chrom, pos=int(position), alt=allele))

In my experiments, this led already to a considerable speedup.

Optimizing this further requires additional information about the data types involved. There are several things you could try:

Alternative 2: Query sorted data

If you can afford to sort your DataFrame prior to querying, you can use pd.Series.searchsorted(). Here is a possible approach:

def query_sorted(df, chrom, position, allele):
"""
Returns index of the matches.
"""
assert df["Start"].is_monotonic_increasing
i_min, i_max = df["Start"].searchsorted([position, position+1])
df = df.iloc[i_min:i_max]
return df[(df["Chr"] == chrom) & (df["Alt"] == allele)].index

# Usage: first sort df by column "Start", then query:
df = df.sort_values("Start")
ret_index = query_sorted(df, chrom, position, allele)
print(len(ret_index))

Alternative 3: Use hashes

Another idea would be to use hashes. Again, this requires some calculations up front, but it speeds up the query considerably. Here is an example based on pd.util.hash_pandas_object():

def query_hash(df, chrom, position, allele):
"""
Returns a view on df
"""
assert "hash" in df
dummy = pd.DataFrame([[chrom, position, allele]])
query_hash = pd.util.hash_pandas_object(dummy, index=False).squeeze()
return df[df["hash"] == query_hash].index

# Usage: first compute hashes over the columns of interest, then query
df["hash"] = pd.util.hash_pandas_object(df[["Chr", "Start", "Alt"]],
index=False)
ret_index = query_hash(df, chrom, position, allele)
print(len(ret_index))

Alternative 4: Use a multi-index

Pandas also operates with hashes when accessing rows via the index. Thus, instead of calculating hashes explicitly, as in the previous alternative, one could simply set the index of the DataFrame prior to querying. (Since setting all columns as index would result in an empty DataFrame, I first create a dummy column. For a real DataFrame with additional columns this will probably not be necessary.)

df["dummy"] = None
df = df.set_index(["Chr", "Start", "Alt"])
df = df.sort_index() # Improves performance
print(len(df.loc[(chrom, position, allele)])
# Interestingly, chaining .loc[] is about twice as fast
print(len(df.loc[chrom].loc[position].loc[allele]))

Note that using an index where one index value maps to many records is not always a good idea. Also, this approach is slower than alternative 3, indicating that Pandas does some extra work here.

There are certainly many more ways to improve this, though the alternative approaches will depend on your specific needs.

Results

I tested with n=10M samples on a MacBook Pro (Mid 2015), running Python 3.8, Pandas 1.2.4 and IPython 7.24.1. Note that the performance evaluation depends on the problem size. The relative assessment of the methods therefore will change for different problem sizes.

# original (sum(s)):  1642.0 ms ± 19.1 ms
# original (s.sum()): 639.0 ms ± 21.9 ms
# query(): 175.0 ms ± 1.1 ms
# query_sorted(): 17.5 ms ± 60.4 µs
# query-hash(): 10.6 ms ± 62.5 µs
# multi-index: 71.5 ms ± 0.7 ms
# multi-index (seq.): 36.5 ms ± 0.6 ms


Implementation

This is how I constructed the data and compared the different approaches.

import numpy as np 
import pandas as pd

# Create test data
n = int(10*1e6)
df = pd.DataFrame({"Chr": np.random.randint(1,23+1,n),
"Start": np.random.randint(100,999, n),
"Alt": np.random.choice(list("ACTG"), n)})

# Query point
chrom, position, allele = 1, 142, "A"

# Create test data
n = 10000000
df = pd.DataFrame({"Chr": np.random.randint(1,23+1,n),
"Start": np.random.randint(100,999, n),
"Alt": np.random.choice(list("ACTG"), n)})

# Query point
chrom, position, allele = 1, 142, "A"

# Measure performance in IPython
print("original (sum(s)):")
%timeit sum((df["Chr"] == chrom) & \
(df["Start"] == int(position)) & \
(df["Alt"] == allele))

print("original (s.sum()):")
%timeit ((df["Chr"] == chrom) & \
(df["Start"] == int(position)) & \
(df["Alt"] == allele)).sum()

print("query():")
%timeit len(df.query(expr.format(chr=chrom, \
pos=position, \
alt=allele)))

print("query_sorted():")
df_sorted = df.sort_values("Start")
%timeit query_sorted(df_sorted, chrom, position, allele)

print("query-hash():")
df_hash = df.copy()
df_hash["hash"] = pd.util.hash_pandas_object(df_hash[["Chr", "Start", "Alt"]],
index=False)
%timeit query_hash(df_hash, chrom, position, allele)

print("multi-index:")
df_multi = df.copy()
df_multi["dummy"] = None
df_multi = df_multi.set_index(["Chr", "Start", "Alt"]).sort_index()
%timeit df_multi.loc[(chrom, position, allele)]
print("multi-index (seq.):")
%timeit len(df_multi.loc[chrom].loc[position].loc[allele])


Related Topics



Leave a reply



Submit