how do you filter pandas dataframes by multiple columns
Using &
operator, don't forget to wrap the sub-statements with ()
:
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
To store your dataframes in a dict
using a for loop:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
EDIT:
A demo for your getDF
:
def getDF(dic, gender, year):
return dic[gender][year]
print genDF(dic, 'male', 2014)
Filter data based on multiple columns in second dataframe python
You could use multiple isin
and chain them using &
operator. Since final_gps
can be either gps1
or gps2
, we use |
operator in brackets:
out = (df1[df1['date'].isin(df2['date']) &
df1['agent_id'].isin(df2['agent_id']) &
(df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
.reset_index(drop=True))
Output:
date agent_id final_gps ….
0 14-02-2020 12abc (1, 2) …
1 14-02-2020 12abc (7, 6) …
2 14-02-2020 12abc (3, 4) …
3 14-02-2020 33bcd (6, 7) …
4 14-02-2020 33bcd (8, 9) …
5 20-02-2020 12abc (3, 5) …
6 20-02-2020 12abc (3, 1) …
7 20-02-2020 44hgf (1, 6) …
8 20-02-2020 44hgf (3, 7) …
How to properly filter multiple columns in Pandas?
You can use apply
to filter all columns at once, check-in each if a value is 0
, and return true if any.
result = df.drop(["Outcome"], axis=1).apply(lambda x: x != 0 , axis=0).any(1)
df[result]
Alternative solution without using apply:
# determine for each value cell whether it it zero
matches = df.drop(["Outcome"], axis=1) == 0
# build rowsums. It counts the number of zero values.
# if there are no zero values in a row, the rowsum is 0
# find all rows with a rowsum of 0
relevant_rows = matches.sum(axis=1) == 0
# subset just those rows with rowsum == 0
df.loc[relevant_rows, :]
Query or filter pandas dataframe on multiple columns and cell values
Here is a more concise approach:
- Filter the
Neighbour
like columns - Create boolean mask with
DataFrame.isin
to check whether each element in dataframe is contained instate
column ofnon_treated
- Reduce the boolean mask along the columns axis with
any
cols = treated_states.filter(like='Neigh')
mask = cols.isin(not_treated.state).any(axis=1)
>>> treated_states[mask]
year state BC_law n_ipo Neighbor1 Neighbor2 Neighbor3 Treated
0 1980 AZ 1999 100 CA AK WV 1
1 1999 AZ 1999 50 CA AK WV 1
how to filter pandas dataframes by multiple columns and conditions
Using .gt() .le() .ne() .eq() etc.
can save a lot of headache when it comes to getting all your ()
correct. As well as formatting your code across multiple lines for clarity:
mask = (df['Left In Stock'].gt(10)
& df['Price (£)'].ge(10)
& df['Weight(g)'].le(700)
& ~(df['Title'].str.contains('the raven|on the')
& df['Colour'].eq('White')))
out = df.loc[mask]
Filter pandas dataframe by multiple columns, using tuple from list of tuples
TL;DR: use df[df[["A","B"]].apply(tuple, 1) == AB_col[0]]
.
I think you might be overthinking the matter. Let's dissect the code a bit:
df[["A","B"]].apply(tuple, 1) # or: df[["A","B"]].apply(tuple, axis=1)
# meaning: create tuples for each row
0 (0, 230)
1 (20, 192)
2 (50, 90)
dtype: object
So this just gets us A
and B
as tuples. Next, applying df.isin
is a way to look whether any of these tuples exist inside your list AB_col
. It's a consecutive evaluation:
(0, 230) in AB_col # True
(20, 192) in AB_col # True
(50, 90) in AB_col # False
# no different than:
1 in [1,2,3] # True
The resulting Series
with booleans
is then used to select from the df
. Hence, the result (first two rows, not the third):
account A B C
0 Jones LLC 0 230 140
1 Alpha Co 20 192 215
All you want to do, is return the rows that match the first element from the list AB_col
:
I want to only get one record back, that matches the first tuple in the list of tuples.
So, that's easy enough. We simply need ==
:
first_elem = AB_col[0]
df[df[["A","B"]].apply(tuple, 1) == first_elem]
account A B C
0 Jones LLC 0 230 140
Filter dataframe based on 2 columns
Try:
>>> df[df["city"].ne("Vienna")|df["Flow"]]
city Flow
0 Berlin False
1 Berlin True
3 Vienna True
5 Frankfurt True
6 Frankfurt False
Filter rows based on multiple columns entries
Alternative 1: pd.DataFrame.query()
You could work with query (see also the illustrative examples here):
expr = "Chr=={chr} & Start=={pos} & Alt=='{alt}'"
ret = df.query(expr.format(chr=chrom, pos=int(position), alt=allele))
In my experiments, this led already to a considerable speedup.
Optimizing this further requires additional information about the data types involved. There are several things you could try:
Alternative 2: Query sorted data
If you can afford to sort your DataFrame prior to querying, you can use pd.Series.searchsorted()
. Here is a possible approach:
def query_sorted(df, chrom, position, allele):
"""
Returns index of the matches.
"""
assert df["Start"].is_monotonic_increasing
i_min, i_max = df["Start"].searchsorted([position, position+1])
df = df.iloc[i_min:i_max]
return df[(df["Chr"] == chrom) & (df["Alt"] == allele)].index
# Usage: first sort df by column "Start", then query:
df = df.sort_values("Start")
ret_index = query_sorted(df, chrom, position, allele)
print(len(ret_index))
Alternative 3: Use hashes
Another idea would be to use hashes. Again, this requires some calculations up front, but it speeds up the query considerably. Here is an example based on pd.util.hash_pandas_object()
:
def query_hash(df, chrom, position, allele):
"""
Returns a view on df
"""
assert "hash" in df
dummy = pd.DataFrame([[chrom, position, allele]])
query_hash = pd.util.hash_pandas_object(dummy, index=False).squeeze()
return df[df["hash"] == query_hash].index
# Usage: first compute hashes over the columns of interest, then query
df["hash"] = pd.util.hash_pandas_object(df[["Chr", "Start", "Alt"]],
index=False)
ret_index = query_hash(df, chrom, position, allele)
print(len(ret_index))
Alternative 4: Use a multi-index
Pandas also operates with hashes when accessing rows via the index. Thus, instead of calculating hashes explicitly, as in the previous alternative, one could simply set the index of the DataFrame prior to querying. (Since setting all columns as index would result in an empty DataFrame, I first create a dummy column. For a real DataFrame with additional columns this will probably not be necessary.)
df["dummy"] = None
df = df.set_index(["Chr", "Start", "Alt"])
df = df.sort_index() # Improves performance
print(len(df.loc[(chrom, position, allele)])
# Interestingly, chaining .loc[] is about twice as fast
print(len(df.loc[chrom].loc[position].loc[allele]))
Note that using an index where one index value maps to many records is not always a good idea. Also, this approach is slower than alternative 3, indicating that Pandas does some extra work here.
There are certainly many more ways to improve this, though the alternative approaches will depend on your specific needs.
Results
I tested with n=10M samples on a MacBook Pro (Mid 2015), running Python 3.8, Pandas 1.2.4 and IPython 7.24.1. Note that the performance evaluation depends on the problem size. The relative assessment of the methods therefore will change for different problem sizes.
# original (sum(s)): 1642.0 ms ± 19.1 ms
# original (s.sum()): 639.0 ms ± 21.9 ms
# query(): 175.0 ms ± 1.1 ms
# query_sorted(): 17.5 ms ± 60.4 µs
# query-hash(): 10.6 ms ± 62.5 µs
# multi-index: 71.5 ms ± 0.7 ms
# multi-index (seq.): 36.5 ms ± 0.6 ms
Implementation
This is how I constructed the data and compared the different approaches.
import numpy as np
import pandas as pd
# Create test data
n = int(10*1e6)
df = pd.DataFrame({"Chr": np.random.randint(1,23+1,n),
"Start": np.random.randint(100,999, n),
"Alt": np.random.choice(list("ACTG"), n)})
# Query point
chrom, position, allele = 1, 142, "A"
# Create test data
n = 10000000
df = pd.DataFrame({"Chr": np.random.randint(1,23+1,n),
"Start": np.random.randint(100,999, n),
"Alt": np.random.choice(list("ACTG"), n)})
# Query point
chrom, position, allele = 1, 142, "A"
# Measure performance in IPython
print("original (sum(s)):")
%timeit sum((df["Chr"] == chrom) & \
(df["Start"] == int(position)) & \
(df["Alt"] == allele))
print("original (s.sum()):")
%timeit ((df["Chr"] == chrom) & \
(df["Start"] == int(position)) & \
(df["Alt"] == allele)).sum()
print("query():")
%timeit len(df.query(expr.format(chr=chrom, \
pos=position, \
alt=allele)))
print("query_sorted():")
df_sorted = df.sort_values("Start")
%timeit query_sorted(df_sorted, chrom, position, allele)
print("query-hash():")
df_hash = df.copy()
df_hash["hash"] = pd.util.hash_pandas_object(df_hash[["Chr", "Start", "Alt"]],
index=False)
%timeit query_hash(df_hash, chrom, position, allele)
print("multi-index:")
df_multi = df.copy()
df_multi["dummy"] = None
df_multi = df_multi.set_index(["Chr", "Start", "Alt"]).sort_index()
%timeit df_multi.loc[(chrom, position, allele)]
print("multi-index (seq.):")
%timeit len(df_multi.loc[chrom].loc[position].loc[allele])
Related Topics
How to Check If Character in a String Is a Letter? (Python)
How to Implement _Getattribute_ Without an Infinite Recursion Error
Function Name Is Undefined in Python Class
What Limitations Have Closures in Python Compared to Language X Closures
Valueerror: Numpy.Dtype Has the Wrong Size, Try Recompiling
Underscore VS Double Underscore with Variables and Methods
Python - Add Pythonpath During Command Line Module Run
Nested Dictionary to Multiindex Dataframe Where Dictionary Keys Are Column Labels
Python Read JSON File and Modify
High-Precision Clock in Python
What's the Difference Between _Builtin_ and _Builtins_
Matplotlib: How to Plot Images Instead of Points
How to Improve the Label Placement in Scatter Plot
Fastest Way to Take a Screenshot with Python on Windows
Typeerror: 'Dict' Object Is Not Callable