Quickly Remove Zero Variance Variables from a Data.Frame

Quickly remove zero variance variables from a data.frame

Don't use table() - very slow for such things. One option is length(unique(x)):

foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324

Simon's solution here is similarly quick on this example:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395

but you'll have to see if they scale similarly to real problem sizes.

Remove variable with zero variance

Removing features with low variance

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

There are 3 boolean features here, each with 6 instances. Suppose we wish to remove those that are constant in at least 80% of the instances. Some probability calculations show that these features will need to have variance lower than 0.8 * (1 - 0.8). Consequently, we can use Ref: Scikit link

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

Output will be:

      array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])

Zero-Variance Removal

You can use groupby().transform() to mask the variance:

df[df.groupby('Sen').Temp.transform('var') > 0]

Output:

  Sen   Temp
0 A 2.045
1 A 3.056

However, this might fail if you have some groups with only one valid data point. On the other hand, since variance 0 means only one value across the group, you can use nunique:

df[df.groupby('Sen').Temp.transform('nunique') > 1]

Remove rows with zero-variance in R

Assuming you have data like this.

survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))

You can do the following.

idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]

This will remove rows where all values equal 5.

Drop column with low variance in pandas

There are some non numeric columns, so std remove this columns by default:

baseline = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,1,1,1,1,1],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})

#no A, F columns
m = baseline.std() > 0.0
print (m)
B True
C True
D False
E True
dtype: bool

So possible solution for add or remove strings columns is use DataFrame.reindex:

baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=True) ]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b

baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=False) ]
print (baseline_filtered)
B C E
0 4 7 5
1 5 8 3
2 4 9 6
3 5 4 9
4 5 2 2
5 4 3 4

Another idea is use DataFrame.nunique working with strings and numeric columns:

baseline_filtered=baseline.loc[:,baseline.nunique() > 1]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b


Related Topics



Leave a reply



Submit