Deleting Columns from a Data.Frame Where Na Is More Than 15% of the Column Length

Delete columns/rows with more than x% missing

To remove columns with some amount of NA, you can use
colMeans(is.na(...))

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]

## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]

## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]

R: deleting columns where certain percentage of values is missing


x <- sample.df[ lapply( sample.df, function(x) sum(is.na(x)) / length(x) ) < 0.1 ]

Remove all factorial variables with more than 50% NA

Try:

dff[colMeans(is.na(dff)) <= 0.5]

Should get:

 Num2 Fact2
23 BBxv
456 BBxz
3 BBxx
100 BBxy
NA <NA>

Edit:

If you're looking to remove columns with more than 50% of zeros in the same process, give the following a try:

dff[colMeans(is.na(dff)) <= 0.5 & colMeans((dff == 0), na.rm = T) <= 0.5]

I hope this helps.

How to delete columns with at least 20% missing values

You can use boolean indexing on the columns where the count of notnull values is larger then 80%:

df.loc[:, pd.notnull(df).sum()>len(df)*.8]

This is useful for many cases, e.g., dropping the columns where the number of values larger than 1 would be:

df.loc[:, (df > 1).sum() > len(df) *. 8]

Alternatively, for the .dropna() case, you can also specify the thresh keyword of .dropna() as illustrated by @EdChum:

df.dropna(thresh=0.8*len(df), axis=1)

The latter will be slightly faster:

df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan

%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop

%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop

Drop Columns with more than 60 Percent of empty Values in Pandas

Use DataFrame.isin for check all formats and then get mean for treshold and filter by boolean indexing with loc:

print (df.isin([' ','NULL',0]))
c1 c2 c3 c4
0 False True True True
1 False False True False
2 True True True False
3 False True True False
4 True True True False
5 False True False False
6 False True False False
7 False True True False
8 False True True False

print (df.isin([' ','NULL',0]).mean())
c1 0.222222
c2 0.888889
c3 0.777778
c4 0.111111
dtype: float64

df = df.loc[:, df.isin([' ','NULL',0]).mean() < .6]
print (df)
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2

For loop over columns in R

You are much better off using more vector-based calculations vice the more literal for loop.

na50 <- sapply(df_r, function(x) sum(is.na(x))) / nrow(df_r)
df_r[na50 > 0.5] <- NULL
# or
df_r <- df_r[na50 <= 0.5]

Deleting rows with missing data. How to omit rows from a data frame with missing values in either column

There are a few good ways of doing this - which have been well described elsewhere on SO (e.g. here). However, to use your example here:

I think na.omit is probably the simplest option for your purpose:

na.omit(DF)

# rater.1 rater.2
# 1 1 1
# 4 3 2
# 5 2 3

There's also complete.cases which is a bit longer but allows you to restrict the NA search to specific columns. While this wasn't required in this question, for completeness it might help to know. For example if you only wanted to remove rows with NA in rater.1:

DF[complete.cases(DF$rater.1),]

# rater.1 rater.2
# 1 1 1
# 2 4 NA
# 4 3 2
# 5 2 3

Also tidyr has drop_na which might be the easiest if you're already operating in the tidyverse and also has the same benefit as using complete.cases:

library(tidyverse)
DF %>% tidyr::drop_na(rater.1)

# rater.1 rater.2
# 1 1 1
# 2 4 NA
# 3 3 2
# 4 2 3

Removal of constant columns in R

The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :

df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1

# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"

So if you want to exclude these columns, you can use :

df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]

EDIT : In fact it is simpler to use apply instead. Something like this :

df[,apply(df, 2, var, na.rm=TRUE) != 0]


Related Topics



Leave a reply



Submit