Subsetting a Data Frame to the Rows Not Appearing in Another Data Frame

Subset of dataframe based on values in another dataframe

As mentioned in the comments there were whitespaces in the data hence it didn't match. We can use trimws to remove the whitespace and then try to subset it.

df2[trimws(df2$relevantcolumn) %in% trimws(df1), ]

Or if df1 is dataframe

df2[trimws(df2$relevantcolumn) %in% trimws(df1$relevant_column), ]

pandas get rows which are NOT in other dataframe

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14

subset a column in data frame based on another data frame/list

We can use %in% to get a logical vector and subset the rows of the 'table1' based on that.

subset(table1, gene_ID %in% accessions40$V1)

A better option would be data.table

library(data.table)
setDT(table1)[gene_ID %chin% accessions40$V1]

Or use filter from dplyr

library(dplyr)
table1 %>%
filter(gene_ID %in% accessions40$V1)

Subsetting a dataframe based on values in another dataframe

d <- read.table(text="hh_id   trans_type  transaction_value
hh1 food 4
hh1 water 5
hh1 transport 4
hh2 water 3
hh3 transport 1
hh3 food 10
hh4 food 5
hh4 transport 15
hh4 water 10", header=T)

dw <- as.character(with(d, hh_id[trans_type=="water"]))
ds <- d[which(d$hh_id%in%dw),]
ds
# hh_id trans_type transaction_value
# 1 hh1 food 4
# 2 hh1 water 5
# 3 hh1 transport 4
# 4 hh2 water 3
# 7 hh4 food 5
# 8 hh4 transport 15
# 9 hh4 water 10

Subset a dataframe using start and stop points from another dataframe?

Using dplyr we can do a left_join on dat and df and select only those rows which lie in between first and last of their respective id.

library(dplyr)

left_join(dat, df, by = c("dat_id" = "id")) %>%
filter(between(dat_frame, first, last)) %>%
select(-first, -last)

Or using the same logic in base R

subset(merge(dat, df, by.x = "dat_id", by.y = "id", all.x = TRUE), 
dat_frame >= first & dat_frame <= last)


Related Topics



Leave a reply



Submit