Delete columns/rows with more than x% missing
To remove columns with some amount of NA, you can usecolMeans(is.na(...))
## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]
## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]
## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]
How to remove rows in a dataframe with more than x number of Null values?
If I understand correctly, you need to remove rows only if total nan's in a row is more than 7
:
df = df[df.isnull().sum(axis=1) < 7]
This will keep only rows which have nan
's less than 7 in the dataframe, and will remove all having nan's > 7.
How to drop entire record if more than 90% of features have missing value in pandas
You can use df.dropna()
and set the thresh
parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
Remove % of Items in Columns
I would use dplyr
here.
If you want to use select()
with logical conditions, you are probably looking for the where()
selection helper in dplyr
.
It can be used like this: select(where(condition))
I used a 80% threshold because 90% would keep all columns and would therefore not illustrate the solution as well
library(dplyr)
df %>% select(where(~mean(is.na(.))<0.8))
It can also be done with base R and colMeans:
df[, c(TRUE, colMeans(is.na(df[-1]))<0.8)]
or with purrr:
library(purrr)
df %>% keep(~mean(is.na(.))<0.8)
Output:
gene cell1 cell3
1 a 0.4 NA
2 b NA 0.1
3 c 0.4 0.5
4 d NA 0.5
5 e 0.5 0.6
6 f 0.6 NA
Data
df<-data.frame(gene=letters[1:6],
cell1=c(0.4, NA, 0.4, NA, 0.5, 0.6),
cell2=c(0.1, rep(NA, 5)),
cell3=c(NA, 0.1, 0.5, 0.5, 0.6, NA))
R: deleting columns where certain percentage of values is missing
x <- sample.df[ lapply( sample.df, function(x) sum(is.na(x)) / length(x) ) < 0.1 ]
Remove rows with missing data conditionally
you could use rowMeans
:
df = read.table(text=' V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
3 6 6 NA NA NA NA NA NA NA NA
4 5 2 2 1 7 NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA')
df[rowMeans(is.na(df))<.8,]
Output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
4 5 2 2 1 7 NA NA NA NA NA
Hope this helps!
Remove columns with more than x negative values
Using colSums
x=2
df = df[,colSums(df<0)<=x]
Remove rows with more than percentage of missing data for majority class samples only
In my opinion, concatenating the two DataFrames is not such a bad idea, but if you do not like it, here's my suggestion.
mask_majority = df.eval("y == 'No'")
mask_missing = df.isna().sum(axis="columns") >= x
import numpy as np
mask_drop = np.logical_and(mask_majority, mask_missing)
mask_keep = np.logical_not(mask_drop)
dfWithDroppedRows = df.loc[mask_keep, :]
Basically, I create a mask for the majority class, and a mask for all rows with more than x
missing values.
Then I combine the two masks to get a mask with all the rows that must not be dropped, and I get the DataFrame containing only those rows using .loc
.
By the way, if you decide to use the initial solution of concatenating the two DataFrame, I would use query
method instead, it is more idiomatic:
df_majority_droppedRows = df.query("y == 'No'").dropna(thresh=x)
df_minority = df.query("y == 'Yes'")
dfWithDroppedRows = pd.concat([df_majority_droppedRows, df_minority])
Related Topics
Transforming a Time-Series into a Data Frame and Back
Use Grepl to Search Either of Multiple Substrings in a Text
Differencebetween [ ] and [[ ]] in R
Block-Diagonal Binding of Matrices
Data.Frame Without Ruining Column Names
Finding Out Which Functions Are Called Within a Given Function
How to Scrape Tables Inside a Comment Tag in HTML with R
Number of Significant Digits in Dplyr Summarise
How to Show Only Part of the Plot Area of Polar Ggplot with Facet
Replace Na Values by Row Means
Group Integer Vector into Consecutive Runs
Sorting Each Row of a Data Frame
Aggregate() Puts Multiple Output Columns in a Matrix Instead
Most Frequent Value (Mode) by Group
Ggplot, Drawing Multiple Lines Across Facets
Read.CSV Doesn't Seem to Detect Factors in R 4.0.0
Add Max Value to a New Column in R
What Leads the First Element of a Printed List to Be Enclosed with Backticks in R V3.5.1