Delete columns/rows with more than x% missing
To remove columns with some amount of NA, you can usecolMeans(is.na(...))
## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]
## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]
## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]
R: deleting columns where certain percentage of values is missing
x <- sample.df[ lapply( sample.df, function(x) sum(is.na(x)) / length(x) ) < 0.1 ]
Remove all factorial variables with more than 50% NA
Try:
dff[colMeans(is.na(dff)) <= 0.5]
Should get:
Num2 Fact2
23 BBxv
456 BBxz
3 BBxx
100 BBxy
NA <NA>
Edit:
If you're looking to remove columns with more than 50% of zeros in the same process, give the following a try:
dff[colMeans(is.na(dff)) <= 0.5 & colMeans((dff == 0), na.rm = T) <= 0.5]
I hope this helps.
How to delete columns with at least 20% missing values
You can use boolean indexing
on the columns
where the count of notnull
values is larger then 80%
:
df.loc[:, pd.notnull(df).sum()>len(df)*.8]
This is useful for many cases, e.g., dropping the columns where the number of values larger than 1
would be:
df.loc[:, (df > 1).sum() > len(df) *. 8]
Alternatively, for the .dropna()
case, you can also specify the thresh
keyword of .dropna()
as illustrated by @EdChum:
df.dropna(thresh=0.8*len(df), axis=1)
The latter will be slightly faster:
df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan
%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop
%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop
Drop Columns with more than 60 Percent of empty Values in Pandas
Use DataFrame.isin
for check all formats and then get mean
for treshold and filter by boolean indexing
with loc
:
print (df.isin([' ','NULL',0]))
c1 c2 c3 c4
0 False True True True
1 False False True False
2 True True True False
3 False True True False
4 True True True False
5 False True False False
6 False True False False
7 False True True False
8 False True True False
print (df.isin([' ','NULL',0]).mean())
c1 0.222222
c2 0.888889
c3 0.777778
c4 0.111111
dtype: float64
df = df.loc[:, df.isin([' ','NULL',0]).mean() < .6]
print (df)
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
For loop over columns in R
You are much better off using more vector-based calculations vice the more literal for
loop.
na50 <- sapply(df_r, function(x) sum(is.na(x))) / nrow(df_r)
df_r[na50 > 0.5] <- NULL
# or
df_r <- df_r[na50 <= 0.5]
Deleting rows with missing data. How to omit rows from a data frame with missing values in either column
There are a few good ways of doing this - which have been well described elsewhere on SO (e.g. here). However, to use your example here:
I think na.omit
is probably the simplest option for your purpose:
na.omit(DF)
# rater.1 rater.2
# 1 1 1
# 4 3 2
# 5 2 3
There's also complete.cases
which is a bit longer but allows you to restrict the NA search to specific columns. While this wasn't required in this question, for completeness it might help to know. For example if you only wanted to remove rows with NA
in rater.1
:
DF[complete.cases(DF$rater.1),]
# rater.1 rater.2
# 1 1 1
# 2 4 NA
# 4 3 2
# 5 2 3
Also tidyr
has drop_na
which might be the easiest if you're already operating in the tidyverse
and also has the same benefit as using complete.cases
:
library(tidyverse)
DF %>% tidyr::drop_na(rater.1)
# rater.1 rater.2
# 1 1 1
# 2 4 NA
# 3 3 2
# 4 2 3
Removal of constant columns in R
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply
instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
Related Topics
Group Integer Vector into Consecutive Runs
How to Generate Distributions Given, Mean, Sd, Skew and Kurtosis in R
Dynamic Column Names in Data.Table
Do You Use Attach() or Call Variables by Name or Slicing
Can't Download Data from Yahoo Finance Using Quantmod in R
How to Avoid: Read.Table Truncates Numeric Values Beginning with 0
Creating a Unique Sequence of Dates
Add an Index (Numeric Id) Column to Large Data Frame
How to Show the Y Value on Tooltip While Hover in Ggplot2
Generate Markdown Comments Within for Loop
Sorting Each Row of a Data Frame
Is There a Way of Manipulating Ggplot Scale Breaks and Labels
Change Row Order in a Matrix/Dataframe
Making a Stacked Area Plot Using Ggplot2
Cowplot Made Ggplot2 Theme Disappear/How to See Current Ggplot2 Theme, and Restore the Default