How do I select rows from a DataFrame based on column values?
To select rows whose column value equals a scalar, some_value
, use ==
:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values
, use isin
:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &
:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, &
binds more tightly than <=
and >=
. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equal some_value
, use !=
:
df.loc[df['column_name'] != some_value]
isin
returns a boolean Series, so to select rows whose value is not in some_values
, negate the boolean Series using ~
:
df.loc[~df['column_name'].isin(some_values)]
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin
:
print(df.loc[df['B'].isin(['one','three'])])
yields
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc
:
df = df.set_index(['B'])
print(df.loc['one'])
yields
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin
:
df.loc[df.index.isin(['one','two'])]
yields
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
Filter record from one data frame based on column values in second data frame in python
IIUC, you're looking for a chained isin
:
out = df1[df1['date'].isin(df2['date']) & df1['id'].isin(df2['id']) & (df1['log'].isin(df2['log1']) | df1['log'].isin(df2['log2']))]
Output:
date id log name col1 col2
0 1 uu1q (2,4) xyz 1123 qqq
1 1 uu1q (3,5) aas 2132 wew
2 1 uu1q (7,6) wqas 2567 uuo
3 5 u25a (4,7) enj 666 ttt
Filter a dataframe based on condition in columns selected by name pattern
You can filter
multiple columns at once using if_all
:
library(dplyr)
df %>%
filter(if_all(matches("_qvalue"), ~ . < 0.05))
In this case I use the filtering condition x < 0.05
on all columns whose name matches _qvalue
.
Your second approach can also work if you group by ID
first and then use all
inside filter:
df_ID = df %>% mutate(ID = 1:n())
df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
group_by(ID) %>%
filter(all(value < 0.05)) %>%
semi_join(df_ID, by = "ID")
Filter each column of a data.frame based on a specific value
Here's another option with slice
which can be used similarly to filter
in this case. Main difference is that you supply an integer vector to slice
whereas filter
takes a logical vector.
df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))
What I like about this approach is that because we use select
inside rowSums
you can make use of all the special functions that select
supplies, like matches
for example.
Let's see how it compares to the other answers:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50L,
unit = "relative"
)
#Unit: relative
# expr min lq median uq max neval
# Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50
# Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).
Following a comment that base R would have the same speed as the slice
approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]
Benchmark:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ],
times = 50L,
unit = "relative"
)
#Unit: relative
# expr min lq median uq max neval
# Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50
# Richard 1.124045 1.160075 1.163240 1.169573 1.076267 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
# base 2.784058 2.769062 2.710305 2.669699 2.576825 50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50
Not really any better or comparable performance with these two base R approaches.
Edit note #2: added benchmark with base R options.
Filter a Dataframe on a column, if a list value is contained in the column value. Pandas
Here you go:
df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
contains_filter = '|'.join(filter_list)
df = df[pd.notna(df.column) & df.column.str.contains(contains_filter)]
Output:
column
0 abc
3 abc, def
4 ghi, jkl
5 abc
Filter data based on multiple columns in second dataframe python
You could use multiple isin
and chain them using &
operator. Since final_gps
can be either gps1
or gps2
, we use |
operator in brackets:
out = (df1[df1['date'].isin(df2['date']) &
df1['agent_id'].isin(df2['agent_id']) &
(df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
.reset_index(drop=True))
Output:
date agent_id final_gps ….
0 14-02-2020 12abc (1, 2) …
1 14-02-2020 12abc (7, 6) …
2 14-02-2020 12abc (3, 4) …
3 14-02-2020 33bcd (6, 7) …
4 14-02-2020 33bcd (8, 9) …
5 20-02-2020 12abc (3, 5) …
6 20-02-2020 12abc (3, 1) …
7 20-02-2020 44hgf (1, 6) …
8 20-02-2020 44hgf (3, 7) …
Filtering Dataframe by keeping numeric values of a specific column only in R
You could use a regular expression to filter the relevant rows of your dataframe.
The regular expression ^\\d+(\\.\\d+)?$
will check for character that contains only digits, possibly with .
as a decimal separator (i.e. 2, 2.3). You could then convert the Cost
column to numeric using as.numeric()
if needed.
See the example below:
Group = c("A", "A", "A", "B", "B", "C", "C", "C")
Cost = c(21,22,"closed", 12, 11,"ended", "closing", 13)
Year = c(2017,2016,2015,2017,2016,2017,2016,2015)
df = data.frame(Group, Cost, Year)
df[grep(pattern = "^\\d+(\\.\\d+)?$", df[,"Cost"]), ]
#> Group Cost Year
#> 1 A 21 2017
#> 2 A 22 2016
#> 4 B 12 2017
#> 5 B 11 2016
#> 8 C 13 2015
Note that this technique works even if your Cost
column is of factor
class while using df[!is.na(as.numeric(df$Cost)), ]
does not. For the latter you need to add as.character()
first: df[!is.na(as.numeric(as.character(df$Cost))), ]
. Both techniques keep factor levels.
Related Topics
How to Split a Data Frame by Rows, and Then Process the Blocks
Arrange Plots in a Layout Which Cannot Be Achieved by 'Par(Mfrow ='
Using Grid and Ggplot2 to Create Join Plots Using R
Difference Between Read.Csv() and Read.Csv2() in R
Remove Parenthesis from a Character String
How to Use Earlier Declared Variables Within Aes in Ggplot with Special Operators (..Count.., etc.)
Reading Excel File: How to Find the Start Cell in Messy Spreadsheets
R Scoping: Disallow Global Variables in Function
Email Dataframe as Table in Email Body with Sendmailr
R Programming: How to Get Euler's Number
Checking Cran Incoming Feasibility ... Note Maintainer
How to Suppress Row Names When Using Dt::Renderdatatable in R Shiny
How to Specify Command Line Parameters to R-Script in Rstudio
Choosing Eps and Minpts for Dbscan (R)
Annotating Facet Title as Strip Over Facet
Regression Tables in Markdown Format (For Flexible Use in R Markdown V2)