Principal Component Analysis (Pca) in Python

When doing a PCA analysis, how can we know which principal components were selected?

To understand, you need to know a little bit more about PCA. In fact, PCA returns all principal components that shape the whole space of vectors, i.e., eigenvalues and eigenvectors of the covariance matrix of features. Hence, you can select eigenvectors based on the size of their corresponding eigenvalues. Hence, you need to pick up the biggest eigenvalues and their corresponding eigenvectors.

Now if you look at the documentation of PCA method in scikit learn, you find some useful properties like the following:

components_ ndarray of shape (n_components, n_features): Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

explained_variance_ratio_ ndarray of shape (n_components,)
Percentage of variance explained by each of the selected components.
If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

explained_variance_ratio_ is a very useful property that you can use it to select principal components based on the desired threshold for the percentage of the covered variance. For example, take values in this array are [0.4, 0.3, 0.2, 0.1]. If we take the first three components, the covered variance is 90% of the whole variance of the original data.

Implementation of Principal Component Analysis from Scratch Orients the Data Differently than scikit-learn

When calculating an eigenvector you may change its sign and the solution will also be a valid one.

So any PCA axis can be reversed and the solution will be valid.

Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.

Principal Component Analysis (PCA) in Python

You can find a PCA function in the matplotlib module:

import numpy as np
from matplotlib.mlab import PCA

data = np.array(np.random.randint(10,size=(10,3)))
results = PCA(data)

results will store the various parameters of the PCA.
It is from the mlab part of matplotlib, which is the compatibility layer with the MATLAB syntax

EDIT:
on the blog nextgenetics I found a wonderful demonstration of how to perform and display a PCA with the matplotlib mlab module, have fun and check that blog!

Run a Principal Component Analysis (PCA) on the dataset to reduce the number of features (components) from 64 to 2

It's not actually the PCA that is problematic, but just the renaming of your columns: the digits dataset has 64 columns, and you are trying to name the columns according to the column names for the 4 columns in the iris dataset.

Because of the nature of the digits dataset (pixels), there isn't really an appropriate naming scheme for the columns. So just don't rename them.

digits = datasets.load_digits()      

x = pd.DataFrame(digits.data)

pca = decomposition.PCA(n_components=2)
pca.fit(x)
x = pca.transform(x)

# Here is the result of your PCA (2 components)
>>> x
array([[ -1.25946636, 21.27488332],
[ 7.95761139, -20.76869904],
[ 6.99192268, -9.9559863 ],
...,
[ 10.80128366, -6.96025224],
[ -4.87210049, 12.42395326],
[ -0.34438966, 6.36554934]])

Then you can plot the first pc against the second, if that's what you're going for (what I gathered from your code)

plt.scatter(x[:,0], x[:,1], s=40)
plt.show()

Sample Image



Related Topics



Leave a reply



Submit