pandas - filter dataframe by another dataframe by row elements
You can do this efficiently using isin
on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on @IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2
is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.
R: Filter a dataframe based on another dataframe
If you are only wanting to keep the rownames in e
that occur in pf
(or that don't occur, then use !rownames(e)
), then you can just filter
on the rownames:
library(tidyverse)
e %>%
filter(rownames(e) %in% rownames(pf))
Another possibility is to create a rownames column for both dataframes. Then, we can do the semi_join
on the rownames (i.e., rn
). Then, convert the rn
column back to the rownames.
library(tidyverse)
list(e, pf) %>%
map(~ .x %>%
as.data.frame %>%
rownames_to_column('rn')) %>%
reduce(full_join, by = 'rn') %>%
column_to_rownames('rn')
Output
JHU_113_2.CEL JHU_144.CEL JHU_173.CEL JHU_176R.CEL JHU_182.CEL JHU_186.CEL JHU_187.CEL JHU_188.CEL JHU_203.CEL
2315374 6.28274 6.79161 6.11265 6.13997 6.68056 6.48156 6.45415 6.04542 5.99176
2315376 5.81678 5.71165 6.02794 5.37082 5.95527 5.75999 5.87863 5.54830 6.35571
2315587 8.88557 8.95699 8.36898 8.28993 8.41361 8.64980 8.74305 8.31915 8.43548
2315588 6.28650 6.66750 6.07503 6.76625 6.19819 6.84260 6.13916 6.40219 6.45059
2315591 6.97515 6.61705 6.51994 6.74982 6.60917 6.55182 6.62240 6.44394 5.76592
2315595 5.94179 5.39178 5.09497 4.96199 2.96431 4.95204 5.00979 4.06493 5.38048
2315598 4.99420 5.56888 5.57912 5.43960 5.19249 5.87991 5.60540 5.09513 5.43618
2315603 7.67845 7.90005 7.47594 6.75087 7.62805 8.00069 7.34296 6.81338 7.52014
2315604 6.20952 6.59687 6.14608 5.70518 6.49572 6.12622 6.23690 6.39569 6.70869
2315640 5.85307 6.07303 6.41875 6.07282 6.28283 6.13699 6.16377 6.48616 6.34162
How do I filter out rows based on another data frame in Python?
Plug and play script for you. If this doesn't work on your regular code, check to make sure you have the same types in the same columns.
import pandas as pd
df1 = pd.DataFrame(
{"system": ["AIII", "CIII", "LV"], "Code": [423, 123, 142]}
)
df2 = pd.DataFrame(
{"StatusMessage": [123], "Event": ["Gearbox warm up"]}
)
### This is what you need
df1 = df1[df1.Code.isin(df2.StatusMessage.unique())]
print(df1)
Filtering dataframe based on another dataframe
You can use .isin() to filter to the list of tickers available in df2.
df1_filtered = df1[df1['ticker'].isin(df2['ticker'].tolist())]
Filter record from one data frame based on column values in second data frame in python
IIUC, you're looking for a chained isin
:
out = df1[df1['date'].isin(df2['date']) & df1['id'].isin(df2['id']) & (df1['log'].isin(df2['log1']) | df1['log'].isin(df2['log2']))]
Output:
date id log name col1 col2
0 1 uu1q (2,4) xyz 1123 qqq
1 1 uu1q (3,5) aas 2132 wew
2 1 uu1q (7,6) wqas 2567 uuo
3 5 u25a (4,7) enj 666 ttt
Filtering the dataframe based on the column value of another dataframe
Update
You could use a simple mask:
m = df2.SKU.isin(df1.SKU)
df2 = df2[m]
You are looking for an inner join. Try this:
df3 = df1.merge(df2, on=['SKU','Sales'], how='inner')
# SKU Sales
#0 A 100
#1 B 200
#2 C 300
Or this:
df3 = df1.merge(df2, on='SKU', how='inner')
# SKU Sales_x Sales_y
#0 A 100 100
#1 B 200 200
#2 C 300 300
Filter one data frame based on other data frame in pandas
For anyone who is interested, I figured out a way to do it...
df3=[]
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1["Name"] == row2["Name"]:
x = range(row1["start"],row1["stop"])
x = set(x)
y = range(row2["start"],row2["stop"])
if len(x.intersection(y)) > 0:
df3.append(row1)
df3 = pd.DataFrame(df3).reset_index(drop=True)
print(df3)
Name start stop
0 B 124 200
1 C 159 200
2 D 12 24
3 D 26 30
4 E 110 160
Gets the job done albeit a bit clumsy.
Would be interested if anyone can suggest a less messy way!
Remove rows in one dataframe if they are present in another dataframe
In Base R
df[-match(df2$ASV, df$ASV),]
or even
dplyr::anti_join(df, df2)
How to filter a dataframe based on values in another dataframe in R
You need to group them by A
and then we can use double inner_join
.
Data:
df1 <- data.frame(A=c(1,5,1,5,1))
df2 <- data.frame(A=c(1,5,1,5,1), B=c(0.92,0.02,0.18,0.87,0.46))
Solution:
df1 %>%
inner_join(df2 %>%
filter(A == df1$A & B > 0.5) %>%
group_by(A)%>%
summarize(count=n())) %>%
inner_join(df2 %>%
filter(A == df1$A) %>%
group_by(A)%>%
summarize(A_count=n())) %>%
mutate(C= count/A_count) %>%
select(A,C)-> df1
Output:
A C
1 1 0.3333333
2 5 0.5000000
3 1 0.3333333
4 5 0.5000000
5 1 0.3333333
Related Topics
How to Request an Early Exit When Knitting an Rmd Document
How to Convert Entire Dataframe to Numeric While Preserving Decimals
Time-Series - Data Splitting and Model Evaluation
How to Specify "Does Not Contain" in Dplyr Filter
What Is the Correct Way to Ask for User Input in an R Program
How to Use R (Rcurl/Xml Packages !) to Scrape This Webpage
Sort Matrix According to First Column in R
How to Give Color to Each Class in Scatter Plot in R
Getting Frequency Values from Histogram in R
Automatic Documentation of Datasets
Fast Replacing Values in Dataframe in R
How to Highlight Time Ranges on a Plot
Anti-Aliasing in R Graphics Under Windows (As Per MAC)
Interactively Change the Selectinput Choices