How to Filter a Data Frame

Filter data.frame rows by a logical condition

To select rows according to one 'cell_type' (e.g. 'hesc'), use ==:

expr[expr$cell_type == "hesc", ]

To select rows according to two or more different 'cell_type', (e.g. either 'hesc' or 'bj fibroblast'), use %in%:

expr[expr$cell_type %in% c("hesc", "bj fibroblast"), ]

How to filter a data frame

You are missing a comma in your statement.

Try this:

data[data[, "Var1"]>10, ]

Or:

data[data$Var1>10, ]

Or:

subset(data, Var1>10)

As an example, try it on the built-in dataset, mtcars

data(mtcars)

mtcars[mtcars[, "mpg"]>25, ]

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2


mtcars[mtcars$mpg>25, ]

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

subset(mtcars, mpg>25)

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Filtering Pandas Dataframe using OR statement

From the docs:

Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses.

http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#boolean-indexing

Try:

alldata_balance = alldata[(alldata[IBRD] !=0) | (alldata[IMF] !=0)]

Filter data.frame with all colums NA but keep when some are NA

We can use base R

teste[rowSums(!is.na(teste)) >0,]
#   a  b c
#1  1 NA 1
#3  3  3 3
#4 NA  4 4

Or using apply and any

teste[apply(!is.na(teste), 1, any),]

which can be also used within filter

teste %>%
      filter(rowSums(!is.na(.)) >0)

Or using c_across from dplyr, we can directly remove the rows with all NA

library(dplyr)
teste %>% 
    rowwise %>% 
    filter(!all(is.na(c_across(everything()))))
# A tibble: 3 x 3
# Rowwise: 
#      a     b     c
#  <dbl> <dbl> <dbl>
#1     1    NA     1
#2     3     3     3
#3    NA     4     4

NOTE: filter_all is getting deprecated

Filter one data frame based on other data frame in pandas

For anyone who is interested, I figured out a way to do it...

df3=[]
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1["Name"] == row2["Name"]:
            x = range(row1["start"],row1["stop"])
            x = set(x)
            y = range(row2["start"],row2["stop"])
            if len(x.intersection(y)) > 0:
                df3.append(row1)
df3 = pd.DataFrame(df3).reset_index(drop=True)
print(df3)

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Gets the job done albeit a bit clumsy.

Would be interested if anyone can suggest a less messy way!

How to filter data frame column with list elements

If order doesn't matter, you can use set operations:

You have several options depending on whether you want exact match, all items present or at least one item:

S = set(element_list)

data['equal'] = data['column'].apply(lambda x: S==set(x))

data['subset'] = data['column'].apply(S.issubset)

data['superset'] = data['column'].apply(S.issuperset)

Output:

            column  equal  subset  superset
0     [a, b, c, d]  False   False     False
1     [e, f, g, h]   True    True      True
2        [i, j, k]  False   False     False
3           [m, n]  False   False     False
4           [q, r]  False   False     False
5              [s]  False   False     False
6              [e]  False   False      True
7     [e, g, h, f]   True    True      True
8  [e, g, h, f, i]  False    True     False

You can use the boolean series to subset the dataframe:

data[data['column'].apply(lambda x: S==set(x))]

Output:

         column
1  [e, f, g, h]
7  [e, g, h, f]

Performance

If performance is important, you can use list comprehensions instead of apply:

data['equal'] = [S==set(x) for x in data['column']]

data['subset'] = [S.issubset(x) for x in data['column']]

data['superset'] = [S.issuperset(x) for x in data['column']]

Filter data frame based on index value in Python

try this:

Filter_df  = df[df.index.isin(my_list)]

Filtering a large data frame based on column values using R

We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)

library(dplyr)
library(tidyr)
df1 %>%
   pivot_longer(cols = starts_with("code"), 
       values_to = 'code', names_to = NULL) %>%   
   filter(substr(code, 1, 1) == "E")

-output

# A tibble: 3 × 2
    IDs code 
  <int> <chr>
1     1 E109 
2     1 E341 
3     3 E131

If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column

df1 %>%
   filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>% 
   pivot_longer(cols = starts_with("code"), 
       values_to = 'code', names_to = NULL) %>%   
   filter(substr(code, 1, 1) == "E")

If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call

library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce, 
    lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E", 
      NA)))), .SDcols = patterns("code")])

-output

   IDs code
1:   1 E109
2:   1 E341
3:   3 E131

data

df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31", 
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")), 
class = "data.frame", row.names = c(NA, 
-4L))

How can I filter panda data-frame based on dynamic operator passed as a variable by user from front end

You could take advantage of Python built-in function eval and f-strings like this:

import pandas as pd

# Toy dataframe
df = pd.DataFrame(
    {
        "term": ["Apple", "Banana", "Orange", "Pear"],
        "amount": [100, 50, 200, 25],
    }
)

# Examples of values passed from front-end
operator_var1 = ">="
amount_threshold = 75

# Filter dataframe
df = df[eval(f"df['amount'] {operator_var1} {amount_threshold}")]

print(df)
# Outputs
     term  amount
0   Apple     100
2  Orange     200

Simply be aware that the use of eval is controversial.

R: Filter a dataframe based on another dataframe

If you are only wanting to keep the rownames in e that occur in pf (or that don't occur, then use !rownames(e)), then you can just filter on the rownames:

library(tidyverse)

e %>% 
  filter(rownames(e) %in% rownames(pf))

Another possibility is to create a rownames column for both dataframes. Then, we can do the semi_join on the rownames (i.e., rn). Then, convert the rn column back to the rownames.

library(tidyverse)

list(e, pf) %>% 
  map(~ .x %>% 
        as.data.frame %>%
        rownames_to_column('rn')) %>% 
  reduce(full_join, by = 'rn') %>%
  column_to_rownames('rn')

Output

        JHU_113_2.CEL JHU_144.CEL JHU_173.CEL JHU_176R.CEL JHU_182.CEL JHU_186.CEL JHU_187.CEL JHU_188.CEL JHU_203.CEL
2315374       6.28274     6.79161     6.11265      6.13997     6.68056     6.48156     6.45415     6.04542     5.99176
2315376       5.81678     5.71165     6.02794      5.37082     5.95527     5.75999     5.87863     5.54830     6.35571
2315587       8.88557     8.95699     8.36898      8.28993     8.41361     8.64980     8.74305     8.31915     8.43548
2315588       6.28650     6.66750     6.07503      6.76625     6.19819     6.84260     6.13916     6.40219     6.45059
2315591       6.97515     6.61705     6.51994      6.74982     6.60917     6.55182     6.62240     6.44394     5.76592
2315595       5.94179     5.39178     5.09497      4.96199     2.96431     4.95204     5.00979     4.06493     5.38048
2315598       4.99420     5.56888     5.57912      5.43960     5.19249     5.87991     5.60540     5.09513     5.43618
2315603       7.67845     7.90005     7.47594      6.75087     7.62805     8.00069     7.34296     6.81338     7.52014
2315604       6.20952     6.59687     6.14608      5.70518     6.49572     6.12622     6.23690     6.39569     6.70869
2315640       5.85307     6.07303     6.41875      6.07282     6.28283     6.13699     6.16377     6.48616     6.34162