When doing a PCA analysis, how can we know which principal components were selected?
To understand, you need to know a little bit more about PCA. In fact, PCA returns all principal components that shape the whole space of vectors, i.e., eigenvalues and eigenvectors of the covariance matrix of features. Hence, you can select eigenvectors based on the size of their corresponding eigenvalues. Hence, you need to pick up the biggest eigenvalues and their corresponding eigenvectors.
Now if you look at the documentation of PCA method in scikit learn, you find some useful properties like the following:
components_ ndarray of shape (n_components, n_features): Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
explained_variance_ratio_ ndarray of shape (n_components,)
Percentage of variance explained by each of the selected components.
If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
explained_variance_ratio_
is a very useful property that you can use it to select principal components based on the desired threshold for the percentage of the covered variance. For example, take values in this array are [0.4, 0.3, 0.2, 0.1]
. If we take the first three components, the covered variance is 90%
of the whole variance of the original data.
Implementation of Principal Component Analysis from Scratch Orients the Data Differently than scikit-learn
When calculating an eigenvector you may change its sign and the solution will also be a valid one.
So any PCA axis can be reversed and the solution will be valid.
Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.
Principal Component Analysis (PCA) in Python
You can find a PCA function in the matplotlib module:
import numpy as np
from matplotlib.mlab import PCA
data = np.array(np.random.randint(10,size=(10,3)))
results = PCA(data)
results will store the various parameters of the PCA.
It is from the mlab part of matplotlib, which is the compatibility layer with the MATLAB syntax
EDIT:
on the blog nextgenetics I found a wonderful demonstration of how to perform and display a PCA with the matplotlib mlab module, have fun and check that blog!
Run a Principal Component Analysis (PCA) on the dataset to reduce the number of features (components) from 64 to 2
It's not actually the PCA that is problematic, but just the renaming of your columns: the digits
dataset has 64 columns, and you are trying to name the columns according to the column names for the 4 columns in the iris
dataset.
Because of the nature of the digits dataset (pixels), there isn't really an appropriate naming scheme for the columns. So just don't rename them.
digits = datasets.load_digits()
x = pd.DataFrame(digits.data)
pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)
# Here is the result of your PCA (2 components)
>>> x
array([[ -1.25946636, 21.27488332],
[ 7.95761139, -20.76869904],
[ 6.99192268, -9.9559863 ],
...,
[ 10.80128366, -6.96025224],
[ -4.87210049, 12.42395326],
[ -0.34438966, 6.36554934]])
Then you can plot the first pc against the second, if that's what you're going for (what I gathered from your code)
plt.scatter(x[:,0], x[:,1], s=40)
plt.show()
Related Topics
How to Separate the Functions of a Class into Multiple Files
How to Prevent Python's Urllib(2) from Following a Redirect
How to Group a List of Tuples/Objects by Similar Index/Attribute in Python
Which Is Faster in Python: X**.5 or Math.Sqrt(X)
How Does _Contains_ Work for Ndarrays
What Does "Bound Method" Error Mean When I Call a Function
How to Make Scipy.Interpolate Give an Extrapolated Result Beyond the Input Range
Python Read from Subprocess Stdout and Stderr Separately While Preserving Order
Why Don't Methods Have Reference Equality
How to Get 2.X-Like Sorting Behaviour in Python 3.X