Delete Columns/Rows with More Than X% Missing

Delete columns/rows with more than x% missing

To remove columns with some amount of NA, you can use
colMeans(is.na(...))

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]

## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]

## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]

How to remove rows in a dataframe with more than x number of Null values?

If I understand correctly, you need to remove rows only if total nan's in a row is more than 7:

df = df[df.isnull().sum(axis=1) < 7]

This will keep only rows which have nan's less than 7 in the dataframe, and will remove all having nan's > 7.

How to drop entire record if more than 90% of features have missing value in pandas

You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).

df.dropna(axis=0, thresh=50, inplace=True)

Remove % of Items in Columns

I would use dplyr here.

If you want to use select() with logical conditions, you are probably looking for the where() selection helper in dplyr.
It can be used like this: select(where(condition))

I used a 80% threshold because 90% would keep all columns and would therefore not illustrate the solution as well

library(dplyr)

df %>% select(where(~mean(is.na(.))<0.8))

It can also be done with base R and colMeans:

df[, c(TRUE, colMeans(is.na(df[-1]))<0.8)]

or with purrr:

library(purrr)

df %>% keep(~mean(is.na(.))<0.8)

Output:

  gene cell1 cell3
1 a 0.4 NA
2 b NA 0.1
3 c 0.4 0.5
4 d NA 0.5
5 e 0.5 0.6
6 f 0.6 NA

Data

df<-data.frame(gene=letters[1:6],
cell1=c(0.4, NA, 0.4, NA, 0.5, 0.6),
cell2=c(0.1, rep(NA, 5)),
cell3=c(NA, 0.1, 0.5, 0.5, 0.6, NA))

R: deleting columns where certain percentage of values is missing


x <- sample.df[ lapply( sample.df, function(x) sum(is.na(x)) / length(x) ) < 0.1 ]

Remove rows with missing data conditionally

you could use rowMeans:

df = read.table(text='     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
3 6 6 NA NA NA NA NA NA NA NA
4 5 2 2 1 7 NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA')

df[rowMeans(is.na(df))<.8,]

Output:

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
4 5 2 2 1 7 NA NA NA NA NA

Hope this helps!

Remove columns with more than x negative values

Using colSums

x=2
df = df[,colSums(df<0)<=x]

Remove rows with more than percentage of missing data for majority class samples only

In my opinion, concatenating the two DataFrames is not such a bad idea, but if you do not like it, here's my suggestion.

mask_majority = df.eval("y == 'No'")
mask_missing = df.isna().sum(axis="columns") >= x

import numpy as np

mask_drop = np.logical_and(mask_majority, mask_missing)
mask_keep = np.logical_not(mask_drop)
dfWithDroppedRows = df.loc[mask_keep, :]

Basically, I create a mask for the majority class, and a mask for all rows with more than x missing values.
Then I combine the two masks to get a mask with all the rows that must not be dropped, and I get the DataFrame containing only those rows using .loc.

By the way, if you decide to use the initial solution of concatenating the two DataFrame, I would use query method instead, it is more idiomatic:

df_majority_droppedRows = df.query("y == 'No'").dropna(thresh=x)
df_minority = df.query("y == 'Yes'")
dfWithDroppedRows = pd.concat([df_majority_droppedRows, df_minority])


Related Topics



Leave a reply



Submit