Filter Each Column of a Data.Frame Based on a Specific Value

How do I select rows from a DataFrame based on column values?

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14

If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14

Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B
one foo 0 0
one bar 1 2
one foo 6 12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12

Filter record from one data frame based on column values in second data frame in python

IIUC, you're looking for a chained isin:

out = df1[df1['date'].isin(df2['date']) & df1['id'].isin(df2['id']) & (df1['log'].isin(df2['log1']) | df1['log'].isin(df2['log2']))]

Output:

   date    id    log  name  col1 col2
0 1 uu1q (2,4) xyz 1123 qqq
1 1 uu1q (3,5) aas 2132 wew
2 1 uu1q (7,6) wqas 2567 uuo
3 5 u25a (4,7) enj 666 ttt

Filter a dataframe based on condition in columns selected by name pattern

You can filter multiple columns at once using if_all:

library(dplyr)

df %>%
filter(if_all(matches("_qvalue"), ~ . < 0.05))

In this case I use the filtering condition x < 0.05 on all columns whose name matches _qvalue.

Your second approach can also work if you group by ID first and then use all inside filter:

df_ID = df %>% mutate(ID = 1:n())

df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
group_by(ID) %>%
filter(all(value < 0.05)) %>%
semi_join(df_ID, by = "ID")

Filter each column of a data.frame based on a specific value

Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))

What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.


Let's see how it compares to the other answers:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50L,
unit = "relative"
)

#Unit: relative
# expr min lq median uq max neval
# Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50
# Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50

pic

Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).


Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]

Benchmark:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ],
times = 50L,
unit = "relative"
)

#Unit: relative
# expr min lq median uq max neval
# Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50
# Richard 1.124045 1.160075 1.163240 1.169573 1.076267 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
# base 2.784058 2.769062 2.710305 2.669699 2.576825 50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50

pic2

Not really any better or comparable performance with these two base R approaches.

Edit note #2: added benchmark with base R options.

Filter a Dataframe on a column, if a list value is contained in the column value. Pandas

Here you go:

df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
contains_filter = '|'.join(filter_list)
df = df[pd.notna(df.column) & df.column.str.contains(contains_filter)]

Output:

     column
0 abc
3 abc, def
4 ghi, jkl
5 abc

Filter data based on multiple columns in second dataframe python

You could use multiple isin and chain them using & operator. Since final_gps can be either gps1 or gps2, we use | operator in brackets:

out = (df1[df1['date'].isin(df2['date']) & 
df1['agent_id'].isin(df2['agent_id']) &
(df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
.reset_index(drop=True))

Output:

         date agent_id final_gps ….
0 14-02-2020 12abc (1, 2) …
1 14-02-2020 12abc (7, 6) …
2 14-02-2020 12abc (3, 4) …
3 14-02-2020 33bcd (6, 7) …
4 14-02-2020 33bcd (8, 9) …
5 20-02-2020 12abc (3, 5) …
6 20-02-2020 12abc (3, 1) …
7 20-02-2020 44hgf (1, 6) …
8 20-02-2020 44hgf (3, 7) …

Filtering Dataframe by keeping numeric values of a specific column only in R

You could use a regular expression to filter the relevant rows of your dataframe.
The regular expression ^\\d+(\\.\\d+)?$ will check for character that contains only digits, possibly with . as a decimal separator (i.e. 2, 2.3). You could then convert the Cost column to numeric using as.numeric() if needed.

See the example below:

Group = c("A", "A", "A", "B", "B", "C", "C", "C")
Cost = c(21,22,"closed", 12, 11,"ended", "closing", 13)
Year = c(2017,2016,2015,2017,2016,2017,2016,2015)
df = data.frame(Group, Cost, Year)

df[grep(pattern = "^\\d+(\\.\\d+)?$", df[,"Cost"]), ]
#> Group Cost Year
#> 1 A 21 2017
#> 2 A 22 2016
#> 4 B 12 2017
#> 5 B 11 2016
#> 8 C 13 2015

Note that this technique works even if your Cost column is of factor class while using df[!is.na(as.numeric(df$Cost)), ] does not. For the latter you need to add as.character() first: df[!is.na(as.numeric(as.character(df$Cost))), ]. Both techniques keep factor levels.



Related Topics



Leave a reply



Submit