Remove Columns of Dataframe Based on Conditions in R

Remove columns of dataframe based on conditions in R

I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1

This works because if there are no consecutive values in a column, the whole column will become NAs.

Then, just

df[, cond, drop = FALSE]
# A E
# 1 0.018 NA
# 2 0.017 NA
# 3 0.019 NA
# 4 0.018 NA
# 5 0.018 NA
# 6 0.015 0.037
# 7 0.016 0.031
# 8 0.019 0.025
# 9 0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042

Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
df[, c(TRUE, cond), with = FALSE]

Some explanations:

  • We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
  • .N is just the rows count (similar to nrow(df)
  • Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
  • Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

A better way though, would be just to remove the column by reference using := NULL

cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]

R: delete columns from data.frame if condition fulfilled

That should be quite easily accomplished with the following command:

df[colMeans(df)==1] <- NULL

Removing columns from a data.table in R based on conditions

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)

not0 = function(x) is.numeric(x) && !anyNA(x) && all(x!=0)
dt[, .(
## your categorical columns
col1, col2, col3, col4, col5,
## new column pasted from non-0 numeric columns
new = as.numeric(paste0(unlist(.SD), collapse=""))
),
## this filters columns to be provided in .SD column subset
.SDcols = not0,
## we group by each row so it will handle input of multiple rows
by = .(row=seq_len(nrow(dt)))
][, row:=NULL ## this removes extra grouping column
][] ## this prints
# col1 col2 col3 col4 col5 new
#1: a b c d e 9799

Alternatively if you want to update in place existing table

is0 = function(x) is.numeric(x) && !anyNA(x) && all(x==0)
## remove columns that has 0
dt[, which(sapply(dt, is0)) := NULL]

## add new column
dt[, new := as.numeric(
paste0(unlist(.SD), collapse="")
), .SDcols=is.numeric, by=.(row=seq_len(nrow(dt)))
][]
# col1 col2 col3 col4 col5 col6 col8 col10 new
#1: a b c d e 9 7 99 9799

Remove a column in dataframe if a particular value meets a condition in R

You probably want an if clause.

df1 <- if (df1[nrow(df1), 2] < 4) {
df1[, -2, drop=FALSE]
} else {
df1
}
df1
# V1
# 1 1
# 2 2

Using column names:

n <- 'V2'
df1 <- if (df1[nrow(df1), n] < 4) {
df1[, setdiff(names(df1), n), drop=FALSE]
} else {
df1
}
df1
# V1
# 1 1
# 2 2

Drop multiple columns based on a condition

We can use colSums and keep column which has at least 2 values greater than 0. We use [-1] here to ignore Date column and check the greater than 0 condition for remaining columns.

cbind(df[1], df[-1][colSums(df[-1] > 0) >= 2])

# Date Item2 Item3
#1 10/10/12 1 1
#2 10/11/12 5 2
#3 10/12/12 3 0
#4 10/13/12 2 0
#5 10/14/12 2 0

Item1 and Item4 columns are removed since both of them have only one observation greater than 0.


Another option is select_if from dplyr using the same logic

library(dplyr)
bind_cols(df[1], df[-1] %>% select_if(funs(sum(. > 0) >= 2)))

Remove multiple columns and replace values of columns of dataframe based on condition in R

Here's a similar approach (perhaps more vectorized?)

is.na(df[-1]) <- df[-1] < 1 # Convert all values < 1 to NAs.
df[colSums(is.na(df)) != nrow(df)] # Select only the columns that have values.
# Date A C
# 1 01/01/2000 NA NA
# 2 02/01/2000 NA NA
# 3 03/01/2000 NA NA
# 4 04/01/2000 NA NA
# 5 05/01/2000 5 NA
# 6 06/01/2000 6 1
# 7 07/01/2000 7 1
# 8 08/01/2000 8 NA
# 9 09/01/2000 9 NA

Or alternatively, second step could be

df[c(TRUE, colSums(df[-1], na.rm = TRUE) > 0)]
## OR
## df[c(TRUE, sapply(df[-1], sum, na.rm = TRUE) > 0)] # as already sugggested


Related Topics



Leave a reply



Submit