Filter data.frame rows by a logical condition
To select rows according to one 'cell_type' (e.g. 'hesc'), use ==
:
expr[expr$cell_type == "hesc", ]
To select rows according to two or more different 'cell_type', (e.g. either 'hesc' or 'bj fibroblast'), use %in%
:
expr[expr$cell_type %in% c("hesc", "bj fibroblast"), ]
How to filter a data frame
You are missing a comma in your statement.
Try this:
data[data[, "Var1"]>10, ]
Or:
data[data$Var1>10, ]
Or:
subset(data, Var1>10)
As an example, try it on the built-in dataset, mtcars
data(mtcars)
mtcars[mtcars[, "mpg"]>25, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
mtcars[mtcars$mpg>25, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
subset(mtcars, mpg>25)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Filtering Pandas Dataframe using OR statement
From the docs:
Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses.
http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#boolean-indexing
Try:
alldata_balance = alldata[(alldata[IBRD] !=0) | (alldata[IMF] !=0)]
Filter data.frame with all colums NA but keep when some are NA
We can use base R
teste[rowSums(!is.na(teste)) >0,]
# a b c
#1 1 NA 1
#3 3 3 3
#4 NA 4 4
Or using apply
and any
teste[apply(!is.na(teste), 1, any),]
which can be also used within filter
teste %>%
filter(rowSums(!is.na(.)) >0)
Or using c_across
from dplyr
, we can directly remove the rows with all
NA
library(dplyr)
teste %>%
rowwise %>%
filter(!all(is.na(c_across(everything()))))
# A tibble: 3 x 3
# Rowwise:
# a b c
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 3 3
#3 NA 4 4
NOTE: filter_all
is getting deprecated
Filter one data frame based on other data frame in pandas
For anyone who is interested, I figured out a way to do it...
df3=[]
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1["Name"] == row2["Name"]:
x = range(row1["start"],row1["stop"])
x = set(x)
y = range(row2["start"],row2["stop"])
if len(x.intersection(y)) > 0:
df3.append(row1)
df3 = pd.DataFrame(df3).reset_index(drop=True)
print(df3)
Name start stop
0 B 124 200
1 C 159 200
2 D 12 24
3 D 26 30
4 E 110 160
Gets the job done albeit a bit clumsy.
Would be interested if anyone can suggest a less messy way!
How to filter data frame column with list elements
If order doesn't matter, you can use set operations:
You have several options depending on whether you want exact match, all items present or at least one item:
S = set(element_list)
data['equal'] = data['column'].apply(lambda x: S==set(x))
data['subset'] = data['column'].apply(S.issubset)
data['superset'] = data['column'].apply(S.issuperset)
Output:
column equal subset superset
0 [a, b, c, d] False False False
1 [e, f, g, h] True True True
2 [i, j, k] False False False
3 [m, n] False False False
4 [q, r] False False False
5 [s] False False False
6 [e] False False True
7 [e, g, h, f] True True True
8 [e, g, h, f, i] False True False
You can use the boolean series to subset the dataframe:
data[data['column'].apply(lambda x: S==set(x))]
Output:
column
1 [e, f, g, h]
7 [e, g, h, f]
Performance
If performance is important, you can use list comprehensions instead of apply
:
data['equal'] = [S==set(x) for x in data['column']]
data['subset'] = [S.issubset(x) for x in data['column']]
data['superset'] = [S.issuperset(x) for x in data['column']]
Filter data frame based on index value in Python
try this:
Filter_df = df[df.index.isin(my_list)]
Filtering a large data frame based on column values using R
We can reshape to 'long' format with pivot_longer
and filter
by creating a logical vector from the first character extracted (with substr
)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
-output
# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131
If the data is really big, we may do a filter
before the pivot_longer
to keep only rows having at least one 'E' in the column
df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
If it is a very big data, another option is data.table
. Convert the data.frame to 'data.table' (setDT
), loop across the columns of interest (.SDcols
) with lapply
, replace
the elements that are not starting with "E" to NA
, then use fcoalesce
to get the first non-NA element for each row using do.call
library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])
-output
IDs code
1: 1 E109
2: 1 E341
3: 3 E131
data
df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31",
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))
How can I filter panda data-frame based on dynamic operator passed as a variable by user from front end
You could take advantage of Python built-in function eval and f-strings like this:
import pandas as pd
# Toy dataframe
df = pd.DataFrame(
{
"term": ["Apple", "Banana", "Orange", "Pear"],
"amount": [100, 50, 200, 25],
}
)
# Examples of values passed from front-end
operator_var1 = ">="
amount_threshold = 75
# Filter dataframe
df = df[eval(f"df['amount'] {operator_var1} {amount_threshold}")]
print(df)
# Outputs
term amount
0 Apple 100
2 Orange 200
Simply be aware that the use of eval is controversial.
R: Filter a dataframe based on another dataframe
If you are only wanting to keep the rownames in e
that occur in pf
(or that don't occur, then use !rownames(e)
), then you can just filter
on the rownames:
library(tidyverse)
e %>%
filter(rownames(e) %in% rownames(pf))
Another possibility is to create a rownames column for both dataframes. Then, we can do the semi_join
on the rownames (i.e., rn
). Then, convert the rn
column back to the rownames.
library(tidyverse)
list(e, pf) %>%
map(~ .x %>%
as.data.frame %>%
rownames_to_column('rn')) %>%
reduce(full_join, by = 'rn') %>%
column_to_rownames('rn')
Output
JHU_113_2.CEL JHU_144.CEL JHU_173.CEL JHU_176R.CEL JHU_182.CEL JHU_186.CEL JHU_187.CEL JHU_188.CEL JHU_203.CEL
2315374 6.28274 6.79161 6.11265 6.13997 6.68056 6.48156 6.45415 6.04542 5.99176
2315376 5.81678 5.71165 6.02794 5.37082 5.95527 5.75999 5.87863 5.54830 6.35571
2315587 8.88557 8.95699 8.36898 8.28993 8.41361 8.64980 8.74305 8.31915 8.43548
2315588 6.28650 6.66750 6.07503 6.76625 6.19819 6.84260 6.13916 6.40219 6.45059
2315591 6.97515 6.61705 6.51994 6.74982 6.60917 6.55182 6.62240 6.44394 5.76592
2315595 5.94179 5.39178 5.09497 4.96199 2.96431 4.95204 5.00979 4.06493 5.38048
2315598 4.99420 5.56888 5.57912 5.43960 5.19249 5.87991 5.60540 5.09513 5.43618
2315603 7.67845 7.90005 7.47594 6.75087 7.62805 8.00069 7.34296 6.81338 7.52014
2315604 6.20952 6.59687 6.14608 5.70518 6.49572 6.12622 6.23690 6.39569 6.70869
2315640 5.85307 6.07303 6.41875 6.07282 6.28283 6.13699 6.16377 6.48616 6.34162
Related Topics
Get Specific Object from Rdata File
Subsetting R Data Frame Results in Mysterious Na Rows
Ggplot Bar Plot With Facet-Dependent Order of Categories
Order Stacked Bar Graph in Ggplot
How to Sum a Numeric List Elements
Sample from Vector of Varying Length (Including 1)
Detect At Least One Match Between Each Data Frame Row and Values in Vector
Ggplot2 Change Axis Limits For Each Individual Facet Panel
Dplyr Join on By=(A = B), Where a and B Are Variables Containing Strings
R Shiny - Add Tabpanel to Tabsetpanel Dynamically (With the Use of Renderui)
Understanding the Order() Function
Clang-7: Error: Linker Command Failed With Exit Code 1 For Macos Big Sur
Compute Mean and Standard Deviation by Group For Multiple Variables in a Data.Frame