Remove columns of dataframe based on conditions in R
I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA
values in a column, obviously the whole column aren't NA
s. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff
per column- vecotrize the whole thing):
cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1
This works because if there are no consecutive values in a column, the whole column will become NA
s.
Then, just
df[, cond, drop = FALSE]
# A E
# 1 0.018 NA
# 2 0.017 NA
# 3 0.019 NA
# 4 0.018 NA
# 5 0.018 NA
# 6 0.015 0.037
# 7 0.016 0.031
# 8 0.019 0.025
# 9 0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042
Per your edit, it seems like you have a data.table
object and you also have a Date
column so the code would need some modifications.
cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1]
df[, c(TRUE, cond), with = FALSE]
Some explanations:
- We want to ignore the first column in our calculations so we specify
.SDcols = -1
when operating on our.SD
(which means Sub Data indata.table
is) .N
is just the rows count (similar tonrow(df)
- Next step is to subset by condition. We need not forget to grab the first column too so we start with
c(TRUE,...
- Finally,
data.table
works with non standard evaluation by default, hence, if you want to select column as if you would in adata.frame
you will need to specifywith = FALSE
A better way though, would be just to remove the column by reference using := NULL
cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]
R: delete columns from data.frame if condition fulfilled
That should be quite easily accomplished with the following command:
df[colMeans(df)==1] <- NULL
Removing columns from a data.table in R based on conditions
dt = data.table("col1" = "a", "col2" = "b", "col3" = "c",
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)
not0 = function(x) is.numeric(x) && !anyNA(x) && all(x!=0)
dt[, .(
## your categorical columns
col1, col2, col3, col4, col5,
## new column pasted from non-0 numeric columns
new = as.numeric(paste0(unlist(.SD), collapse=""))
),
## this filters columns to be provided in .SD column subset
.SDcols = not0,
## we group by each row so it will handle input of multiple rows
by = .(row=seq_len(nrow(dt)))
][, row:=NULL ## this removes extra grouping column
][] ## this prints
# col1 col2 col3 col4 col5 new
#1: a b c d e 9799
Alternatively if you want to update in place existing table
is0 = function(x) is.numeric(x) && !anyNA(x) && all(x==0)
## remove columns that has 0
dt[, which(sapply(dt, is0)) := NULL]
## add new column
dt[, new := as.numeric(
paste0(unlist(.SD), collapse="")
), .SDcols=is.numeric, by=.(row=seq_len(nrow(dt)))
][]
# col1 col2 col3 col4 col5 col6 col8 col10 new
#1: a b c d e 9 7 99 9799
Remove a column in dataframe if a particular value meets a condition in R
You probably want an if
clause.
df1 <- if (df1[nrow(df1), 2] < 4) {
df1[, -2, drop=FALSE]
} else {
df1
}
df1
# V1
# 1 1
# 2 2
Using column names:
n <- 'V2'
df1 <- if (df1[nrow(df1), n] < 4) {
df1[, setdiff(names(df1), n), drop=FALSE]
} else {
df1
}
df1
# V1
# 1 1
# 2 2
Drop multiple columns based on a condition
We can use colSums
and keep column which has at least 2 values greater than 0. We use [-1] here to ignore Date
column and check the greater than 0 condition for remaining columns.
cbind(df[1], df[-1][colSums(df[-1] > 0) >= 2])
# Date Item2 Item3
#1 10/10/12 1 1
#2 10/11/12 5 2
#3 10/12/12 3 0
#4 10/13/12 2 0
#5 10/14/12 2 0
Item1
and Item4
columns are removed since both of them have only one observation greater than 0.
Another option is select_if
from dplyr
using the same logic
library(dplyr)
bind_cols(df[1], df[-1] %>% select_if(funs(sum(. > 0) >= 2)))
Remove multiple columns and replace values of columns of dataframe based on condition in R
Here's a similar approach (perhaps more vectorized?)
is.na(df[-1]) <- df[-1] < 1 # Convert all values < 1 to NAs.
df[colSums(is.na(df)) != nrow(df)] # Select only the columns that have values.
# Date A C
# 1 01/01/2000 NA NA
# 2 02/01/2000 NA NA
# 3 03/01/2000 NA NA
# 4 04/01/2000 NA NA
# 5 05/01/2000 5 NA
# 6 06/01/2000 6 1
# 7 07/01/2000 7 1
# 8 08/01/2000 8 NA
# 9 09/01/2000 9 NA
Or alternatively, second step could be
df[c(TRUE, colSums(df[-1], na.rm = TRUE) > 0)]
## OR
## df[c(TRUE, sapply(df[-1], sum, na.rm = TRUE) > 0)] # as already sugggested
Related Topics
Implementation of Skyline Query or Efficient Frontier
Installing Package from a Local .Tar.Gz File on Linux
Additional Metrics in Caret - Ppv, Sensitivity, Specificity
Let Ggplot2 Histogram Show Classwise Percentages on Y Axis
How to Create a Bar and Line Plot with R Dygraphs
Find the Source File Containing R Function Definition
Back-To-Back Barplot with Independent Axes R
How to Insert Missing Observations on a Data Frame
How to Replace Lower/Upper Triangular Elements of a Matrix
What Is Your Preferred Style for Naming Variables in R
Compute Projection/Hat Matrix via Qr Factorization, Svd (And Cholesky Factorization)
Repeat the Re-Sampling Function for 1000 Times? Using Lapply
R - Svd() Function - Infinite or Missing Values in 'X'
Force a Regular Plot Object into a Grob for Use in Grid.Arrange
R Data.Table Fread Command:How to Read Large Files with Irregular Separators