Quickly remove zero variance variables from a data.frame
Don't use table()
- very slow for such things. One option is length(unique(x))
:
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
Remove variable with zero variance
Removing features with low variance
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
There are 3 boolean features here, each with 6 instances. Suppose we wish to remove those that are constant in at least 80% of the instances. Some probability calculations show that these features will need to have variance lower than 0.8 * (1 - 0.8). Consequently, we can use Ref: Scikit link
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
Output will be:
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
Zero-Variance Removal
You can use groupby().transform()
to mask the variance:
df[df.groupby('Sen').Temp.transform('var') > 0]
Output:
Sen Temp
0 A 2.045
1 A 3.056
However, this might fail if you have some groups with only one valid data point. On the other hand, since variance 0
means only one value across the group, you can use nunique
:
df[df.groupby('Sen').Temp.transform('nunique') > 1]
Remove rows with zero-variance in R
Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.
Drop column with low variance in pandas
There are some non numeric columns, so std
remove this columns by default:
baseline = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,1,1,1,1,1],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
#no A, F columns
m = baseline.std() > 0.0
print (m)
B True
C True
D False
E True
dtype: bool
So possible solution for add or remove strings columns is use DataFrame.reindex
:
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=True) ]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=False) ]
print (baseline_filtered)
B C E
0 4 7 5
1 5 8 3
2 4 9 6
3 5 4 9
4 5 2 2
5 4 3 4
Another idea is use DataFrame.nunique
working with strings and numeric columns:
baseline_filtered=baseline.loc[:,baseline.nunique() > 1]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b
Related Topics
Rank a Vector Based on Order and Replace Ties with Their Average
How to Learn How to Write C Code to Speed Up Slow R Functions
Hyperlinking Text in a Ggplot2 Visualization
R Change All Columns of Type Factor to Numeric
Error Calling Serialize R Function
Raster Package Taking All Hard Drive
R: Insert a Vector as a Row in Data.Frame
How to Interpret Lm() Coefficient Estimates When Using Bs() Function for Splines
R Partial Reshape Data from Long to Wide
How to Remove Rows with All Zeros Without Using Rowsums in R
Circular Heatmap That Looks Like a Donut
Creating a Pareto Chart with Ggplot2 and R
Traceback() for Interactive and Non-Interactive R Sessions
Is There a Function to Add Aov Post-Hoc Testing Results to Ggplot2 Boxplot