How to Solve Prcomp.Default(): Cannot Rescale a Constant/Zero Column to Unit Variance

PCA and Constant-Zero Column Error

PCA only uses complete observations. In your second definition of df above, a PCA analysis will drop the last row due to missingness. And column c is constant within the remaining rows.

Note: my answer is around PCA generally and not specific to the caret package.

Removal of constant columns in R

The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :

df <- data.frame(x=1:5, y=rep(1,5))
df
#   x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1

# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"

So if you want to exclude these columns, you can use :

df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]

EDIT : In fact it is simpler to use apply instead. Something like this :

df[,apply(df, 2, var, na.rm=TRUE) != 0]

R command which(apply(data, 2, var)==0) in Python

Essentially, the command apply(data, 2, var) in R runs on two-dimensional structures such as matrices or data frames (but not advised for latter) to compute a variance of all columns:

Data frame

set.seed(73120)

random_df <- data.frame(
  num1 = runif(500, 1, 100),
  num2 = runif(500, 1, 100),
  num3 = runif(500, 1, 100),
  num4 = runif(500, 1, 100),
  num5 = runif(500, 1, 100)
)

apply(random_df, 2, var)
#     num1     num2     num3     num4     num5 
# 822.9465 902.5558 782.4820 804.1448 830.1097

And once which is applied, the index of named vector (i.e., 1-D array) is returned according to logic.

which(apply(random_df, 2, var) > 900)
# num2 
#    2

Matrix

set.seed(73120)

random_mat <- replicate(5, runif(500, 1, 100))

apply(random_mat, 2, var)
# [1] 822.9465 902.5558 782.4820 804.1448 830.1097

which(apply(random_mat, 2, var) > 900)
# [1] 2

Pandas

In Python, using pandas (data analytics library), the equivalent is also apply: DataFrame.apply with axis set to index to run operations on all columns. Equivalently, you can run DataFrame.aggregate. The return is a Pandas Series, similar to R's named vector as a 1-D array.

import numpy as np
import pandas as pd

np.random.seed(7312020)

random_df = pd.DataFrame({'num1': np.random.uniform(1, 100, 500),
                          'num2': np.random.uniform(1, 100, 500),
                          'num3': np.random.uniform(1, 100, 500),
                          'num4': np.random.uniform(1, 100, 500),
                          'num5': np.random.uniform(1, 100, 500)
                         })

agg1 = random_df.apply('var', axis='index')
print(agg1)
# num1    828.538378
# num2    810.755215
# num3    820.480400
# num4    811.728108
# num5    885.514924
# dtype: float64

agg2 = random_df.aggregate('var')
print(agg2)
# num1    828.538378
# num2    810.755215
# num3    820.480400
# num4    811.728108
# num5    885.514924
# dtype: float64

R's which can be achieved with simple bracketed [...] (also doable in R), .loc, or where (keeping original dimensions):

agg[agg > 850]
# num5    885.514924
# dtype: float64

agg.loc[agg > 850]
# num5    885.514924
# dtype: float64

agg.where(agg > 850)
# num1           NaN
# num2           NaN
# num3           NaN
# num4           NaN
# num5    885.514924
# dtype: float64

Numpy

Additionally using Python's numpy (the numeric computing library that supports arrays), you can use numpy.apply_along_axis. And to equate to Pandas' var, adjust default ddof accordingly:

random_arry = random_df.to_numpy()

agg = np.apply_along_axis(lambda x: np.var(x, ddof=1), 0, random_arry)
print(agg)
# [828.53837793 810.75521479 820.48039962 811.72810753 885.51492378]

print(agg[agg > 850])
# [885.51492378]

Principal Component Analysis throws constant/zero column Error

Sorry, I don't have the rep to comment, so posting as an answer, but after running your code, in particular this line:

 log10(training1[, -13]+1)

returns NaN values in some columns (IL_1alpha and IL_3 actually):

 Warning messages:
 1: In lapply(X = x, FUN = .Generic, ...) : NaNs produced

So that seems to be the source of the error. Maybe you shouldn't take log's of negative numbers and think of other transformation instead (or whether it is necessary at all)?

How to Solve Prcomp.Default(): Cannot Rescale a Constant/Zero Column to Unit Variance

PCA and Constant-Zero Column Error

Removal of constant columns in R

R command which(apply(data, 2, var)==0) in Python

Principal Component Analysis throws constant/zero column Error

Related Topics

Leave a reply