PCA and Constant-Zero Column Error
PCA only uses complete observations. In your second definition of df
above, a PCA analysis will drop the last row due to missingness. And column c
is constant within the remaining rows.
Note: my answer is around PCA generally and not specific to the caret package.
Removal of constant columns in R
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply
instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
R command which(apply(data, 2, var)==0) in Python
Essentially, the command apply(data, 2, var)
in R runs on two-dimensional structures such as matrices or data frames (but not advised for latter) to compute a variance of all columns:
Data frame
set.seed(73120)
random_df <- data.frame(
num1 = runif(500, 1, 100),
num2 = runif(500, 1, 100),
num3 = runif(500, 1, 100),
num4 = runif(500, 1, 100),
num5 = runif(500, 1, 100)
)
apply(random_df, 2, var)
# num1 num2 num3 num4 num5
# 822.9465 902.5558 782.4820 804.1448 830.1097
And once which
is applied, the index of named vector (i.e., 1-D array) is returned according to logic.
which(apply(random_df, 2, var) > 900)
# num2
# 2
Matrix
set.seed(73120)
random_mat <- replicate(5, runif(500, 1, 100))
apply(random_mat, 2, var)
# [1] 822.9465 902.5558 782.4820 804.1448 830.1097
which(apply(random_mat, 2, var) > 900)
# [1] 2
Pandas
In Python, using pandas
(data analytics library), the equivalent is also apply: DataFrame.apply
with axis set to index
to run operations on all columns. Equivalently, you can run DataFrame.aggregate
. The return is a Pandas Series, similar to R's named vector as a 1-D array.
import numpy as np
import pandas as pd
np.random.seed(7312020)
random_df = pd.DataFrame({'num1': np.random.uniform(1, 100, 500),
'num2': np.random.uniform(1, 100, 500),
'num3': np.random.uniform(1, 100, 500),
'num4': np.random.uniform(1, 100, 500),
'num5': np.random.uniform(1, 100, 500)
})
agg1 = random_df.apply('var', axis='index')
print(agg1)
# num1 828.538378
# num2 810.755215
# num3 820.480400
# num4 811.728108
# num5 885.514924
# dtype: float64
agg2 = random_df.aggregate('var')
print(agg2)
# num1 828.538378
# num2 810.755215
# num3 820.480400
# num4 811.728108
# num5 885.514924
# dtype: float64
R's which
can be achieved with simple bracketed [...]
(also doable in R), .loc
, or where
(keeping original dimensions):
agg[agg > 850]
# num5 885.514924
# dtype: float64
agg.loc[agg > 850]
# num5 885.514924
# dtype: float64
agg.where(agg > 850)
# num1 NaN
# num2 NaN
# num3 NaN
# num4 NaN
# num5 885.514924
# dtype: float64
Numpy
Additionally using Python's numpy
(the numeric computing library that supports arrays), you can use numpy.apply_along_axis
. And to equate to Pandas' var
, adjust default ddof
accordingly:
random_arry = random_df.to_numpy()
agg = np.apply_along_axis(lambda x: np.var(x, ddof=1), 0, random_arry)
print(agg)
# [828.53837793 810.75521479 820.48039962 811.72810753 885.51492378]
print(agg[agg > 850])
# [885.51492378]
Principal Component Analysis throws constant/zero column Error
Sorry, I don't have the rep to comment, so posting as an answer, but after running your code, in particular this line:
log10(training1[, -13]+1)
returns NaN
values in some columns (IL_1alpha
and IL_3
actually):
Warning messages:
1: In lapply(X = x, FUN = .Generic, ...) : NaNs produced
So that seems to be the source of the error. Maybe you shouldn't take log's of negative numbers and think of other transformation instead (or whether it is necessary at all)?
Related Topics
How to Extract Elements from a List with Mixed Elements
Changing Title in Multiplot Ggplot2 Using Grid.Arrange
How to Display Widgets Inline in Shiny
Remove Fill Around Legend Key in Ggplot
How to Read CSV Data with Unknown Encoding in R
Order of Legend Entries in Ggplot2 Barplots with Coord_Flip()
How to Convert Mm:Ss.00 to Seconds.00
Highlighting Individual Axis Labels in Bold Using Ggplot2
Multiple Ggplot Linear Regression Lines
Replacing Nas in R with Nearest Value
Is There an R Markdown Equivalent to \Sexpr{} in Sweave
Arrange a Grouped_Df by Group Variable Not Working
Finding Elements That Do Not Overlap Between Two Vectors
In R, What Does "Loaded via a Namespace (And Not Attached)" Mean