Filtering a Data Frame by Values in a Column

How do I select rows from a DataFrame based on column values?

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14

If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14

Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B
one foo 0 0
one bar 1 2
one foo 6 12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12

Filter a Dataframe on a column, if a list value is contained in the column value. Pandas

Here you go:

df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
contains_filter = '|'.join(filter_list)
df = df[pd.notna(df.column) & df.column.str.contains(contains_filter)]

Output:

     column
0 abc
3 abc, def
4 ghi, jkl
5 abc

Filtering a large data frame based on column values using R

We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)

library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")

-output

# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131

If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column

df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")

If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call

library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])

-output

   IDs code
1: 1 E109
2: 1 E341
3: 3 E131

data

df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31", 
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))

how can I Filter single column in a dataframe on multiple values

Put all 61 MRNs into a list-

mrnList = [val1, val2, ...,val61]

Then filter these MRNs like-

df_filtered = df[df['MRN'].isin(mrnList)]

Keep note of your MRN value's datatype while making mrnList.

Filter pandas dataframe based on values in multiple columns

UPDATE:

you can replace empty strings with NaN, 7 or N and then use isin:

In [196]: df[~df[cols].replace('',np.nan).isin(['7','N', np.nan]).all(axis=1)]
Out[196]:
a b c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0 0 A X W N X
2 7 W N W W 1 Z
4 Y 0 W N X 1
5 N X 1 E 1 Z 7
6 1 X 7 0 A W A
7 X X Z X N A 1
8 7 1 A N X Z N
10 A N Z 7 0 A E
11 E N A Z N N 1
12 E A 1 Z E E W
13 N W Z E X A 0
14 Y 1 A W A E X

OLD answer:

show those containing 7 or N

In [197]: df.loc[df[cols].isin(['7','N']).any(axis=1)]
Out[197]:
a b c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0 0 A X W N X
1 Z W 2 7 7
3 1 7 E N N N N
4 Y 0 W N X 1
5 N X 1 E 1 Z 7
7 X X Z X N A 1
8 7 1 A N X Z N
9 N A Z N N N
10 A N Z 7 0 A E
11 E N A Z N N 1

remove those containing 7 or N

In [198]: df.loc[~df[cols].isin(['7','N']).any(axis=1)]
Out[198]:
a b c dxpoa1 dxpoa2 dxpoa3 dxpoa4
2 7 W N W W 1 Z
6 1 X 7 0 A W A
12 E A 1 Z E E W
13 N W Z E X A 0
14 Y 1 A W A E X

replace any to all if you want to have/exclude those where all columns should contain either 7 or N

setup:

rows = 15

s = [''] + list('YWE17N0AZX')
df = pd.DataFrame(np.random.choice(s, size=(rows, 7)), columns=list('abc') + ['dxpoa1', 'dxpoa2', 'dxpoa3', 'dxpoa4'])

cols = df.filter(like='dxpoa').columns

Filtering Dataframe by keeping numeric values of a specific column only in R

You could use a regular expression to filter the relevant rows of your dataframe.
The regular expression ^\\d+(\\.\\d+)?$ will check for character that contains only digits, possibly with . as a decimal separator (i.e. 2, 2.3). You could then convert the Cost column to numeric using as.numeric() if needed.

See the example below:

Group = c("A", "A", "A", "B", "B", "C", "C", "C")
Cost = c(21,22,"closed", 12, 11,"ended", "closing", 13)
Year = c(2017,2016,2015,2017,2016,2017,2016,2015)
df = data.frame(Group, Cost, Year)


df[grep(pattern = "^\\d+(\\.\\d+)?$", df[,"Cost"]), ]
#> Group Cost Year
#> 1 A 21 2017
#> 2 A 22 2016
#> 4 B 12 2017
#> 5 B 11 2016
#> 8 C 13 2015

Note that this technique works even if your Cost column is of factor class while using df[!is.na(as.numeric(df$Cost)), ] does not. For the latter you need to add as.character() first: df[!is.na(as.numeric(as.character(df$Cost))), ]. Both techniques keep factor levels.

Filter a dataframe based on condition in columns selected by name pattern

You can filter multiple columns at once using if_all:

library(dplyr)

df %>%
filter(if_all(matches("_qvalue"), ~ . < 0.05))

In this case I use the filtering condition x < 0.05 on all columns whose name matches _qvalue.

Your second approach can also work if you group by ID first and then use all inside filter:

df_ID = df %>% mutate(ID = 1:n())

df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
group_by(ID) %>%
filter(all(value < 0.05)) %>%
semi_join(df_ID, by = "ID")


Related Topics



Leave a reply



Submit