Deleting Columns from a Data.Frame Where Na Is More Than 15% of the Column Length

Delete columns/rows with more than x% missing

To remove columns with some amount of NA, you can use
colMeans(is.na(...))

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]

## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]

## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]

R: deleting columns where certain percentage of values is missing

x <- sample.df[ lapply( sample.df, function(x) sum(is.na(x)) / length(x) ) < 0.1 ]

Remove all factorial variables with more than 50% NA

Try:

dff[colMeans(is.na(dff)) <= 0.5]

Should get:

 Num2 Fact2
 23   BBxv 
 456  BBxz 
 3    BBxx 
 100  BBxy 
 NA   <NA>

Edit:

If you're looking to remove columns with more than 50% of zeros in the same process, give the following a try:

dff[colMeans(is.na(dff)) <= 0.5 & colMeans((dff == 0), na.rm = T) <= 0.5]

I hope this helps.

How to delete columns with at least 20% missing values

You can use boolean indexing on the columns where the count of notnull values is larger then 80%:

df.loc[:, pd.notnull(df).sum()>len(df)*.8]

This is useful for many cases, e.g., dropping the columns where the number of values larger than 1 would be:

df.loc[:, (df > 1).sum() > len(df) *. 8]

Alternatively, for the .dropna() case, you can also specify the thresh keyword of .dropna() as illustrated by @EdChum:

df.dropna(thresh=0.8*len(df), axis=1)

The latter will be slightly faster:

df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
    df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan

%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop

%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop

Drop Columns with more than 60 Percent of empty Values in Pandas

Use DataFrame.isin for check all formats and then get mean for treshold and filter by boolean indexing with loc:

print (df.isin([' ','NULL',0]))
      c1     c2     c3     c4
0  False   True   True   True
1  False  False   True  False
2   True   True   True  False
3  False   True   True  False
4   True   True   True  False
5  False   True  False  False
6  False   True  False  False
7  False   True   True  False
8  False   True   True  False

print (df.isin([' ','NULL',0]).mean())
c1    0.222222
c2    0.888889
c3    0.777778
c4    0.111111
dtype: float64

df = df.loc[:, df.isin([' ','NULL',0]).mean() < .6]
print (df)
      c1     c4
0  Test1   NULL
1  Test2  Test2
2   NULL  Test1
3  Test3  Test1
4         Test2
5  Test4  Test2
6  Test4  Test1
7  Test1  Test1
8  Test3  Test2

For loop over columns in R

You are much better off using more vector-based calculations vice the more literal for loop.

na50 <- sapply(df_r, function(x) sum(is.na(x))) / nrow(df_r)
df_r[na50 > 0.5] <- NULL
# or
df_r <- df_r[na50 <= 0.5]

Deleting rows with missing data. How to omit rows from a data frame with missing values in either column

There are a few good ways of doing this - which have been well described elsewhere on SO (e.g. here). However, to use your example here:

I think na.omit is probably the simplest option for your purpose:

na.omit(DF)

#   rater.1 rater.2
# 1       1       1
# 4       3       2
# 5       2       3

There's also complete.cases which is a bit longer but allows you to restrict the NA search to specific columns. While this wasn't required in this question, for completeness it might help to know. For example if you only wanted to remove rows with NA in rater.1:

DF[complete.cases(DF$rater.1),]

#   rater.1 rater.2
# 1       1       1
# 2       4      NA
# 4       3       2
# 5       2       3

Also tidyr has drop_na which might be the easiest if you're already operating in the tidyverse and also has the same benefit as using complete.cases:

library(tidyverse)
DF %>% tidyr::drop_na(rater.1)

#   rater.1 rater.2
# 1       1       1
# 2       4      NA
# 3       3       2
# 4       2       3

Removal of constant columns in R

The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :

df <- data.frame(x=1:5, y=rep(1,5))
df
#   x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1

# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"

So if you want to exclude these columns, you can use :

df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]

EDIT : In fact it is simpler to use apply instead. Something like this :

df[,apply(df, 2, var, na.rm=TRUE) != 0]