Principal component analysis (PCA) is one of the most used techniques for exploratory data analysis and preprocessing.
We start this lab off by introducing some fundamental concepts regarding covariance matrices. We then look at several methods for identifying the principal component (PC) directions and scores for an artificial dataset.
Let's start by importing the packages we will need throughout the lab.
from __future__ import print_function, division
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
Covariance matrices are by definition symmetric positive semi-definite. One property of positive semi-definite matrices is that their eigenvalues are all non-negative. You are provided with the function generate_spsd_matrix()
which generates a random symmetric positive semi-definite matrix. The random_seed
parameter can be used to set the numpy random seed in order to ensure reproducible results.
def generate_spsd_matrix(d, random_seed=None):
"""Reproducible random generation of symmetric
positive semi-definite matrices.
Parameters
----------
d : integer
Number of dimensions.
random_seed : integer
Random seed number.
Returns
----------
A : ndarray, shape (n,n)
Symmetric positive definite matrix.
"""
if random_seed is not None:
np.random.seed(random_seed)
A = np.random.randn(d,d)
return np.dot(A.T, A)
Write a function which checks if a matrix is positive semi-definite.
Generate a random 10 x 10 matrix by using the generate_spd_matrix()
function and double-check that it is positive semi-definite by using the function you just wrote.
Hint: you might find the numpy.linalg.eigh()
function useful.
# Your code goes here
def is_positive_semi_definite(a):
"""Tests whether a matrix is symmetric positive
semi-definite.
Parameters
----------
a : ndarray, shape (n,n)
Symmetric matrix.
Returns
----------
True if matrix is positive semi-definite.
Raises
----------
ValueError
If the provided matrix is not real or symmetric.
"""
a = np.asarray(a)
# Check that matrix is real
is_real = np.isreal(a)
if is_real.sum() / a.size != 1:
raise ValueError("The provided matrix is \
not real.")
# Check that matrix is symmetric
is_symmetric = np.array_equal(a, a.T)
if is_symmetric is not True:
raise ValueError("The provided matrix is \
not symmetric.")
# Eigenvalue decomposition
eigval, _ = np.linalg.eigh(a)
if np.all(eigval >= 0):
return True
else:
return False
# Example
a = generate_spsd_matrix(d=10, random_seed=10)
if is_positive_semi_definite(a):
print("Matrix is positive semi-definite.")
else:
print("Matrix is not positive semi-definite.")
Suppose that you want to sample (i.e. generate) some data points from a multivariate normal distribution with known mean vector and covariance matrix. We will assume that we have access to the tool required to sample from a standard univariate normal distribution, that is, one with zero mean and unit variance (all you need for that is just the cumulative function of the normal distribution).
We know that if a univariate gaussian random variable $x$ has zero mean and unit variance, then $y = \sigma x + \mu$ has mean $\mu$ and variance $\sigma^2$ . Therefore, we can use this property to sample from an arbitrary univariate normal distribution with mean $\mu$ and variance $\sigma^2$.
By repeating the process described above $d$ times, we can sample from a $d$-dimensional normal distribution with an arbitrary mean vector and diagonal covariance matrix. But, how can we sample from a multivariate normal distribution with given covariance matrix $\mathbf{C}$? The answer is through decomposing the covariance matrix. One such method is the eigen (or spectral) decomposition which, for a real symmetric matrix (such as covariance matrices), takes the following form: $$\mathbf{C} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^T$$ where $\mathbf{U}$ is an orthogonal matrix containing the (right) eigenvectors of $\mathbf{C}$ and $\mathbf{\Lambda}$ is a diagonal matrix whose entries are the eigenvalues of $\mathbf{C}$. Now, you might wonder how this decomposition can help us sample from a multivariate distribution.
Assume $\mathbf{x}$ is sampled from a standard multivariate normal distribution (i.e. zero mean vector, identity covariance matrix). We can now create a new random variable $\mathbf{y}$ by transforming $\mathbf{x}$ as follows: $$\mathbf{y} = \mathbf{U} \mathbf{\Lambda}^{1/2} \mathbf{x} + \mathbf{\mu}$$.
Assume that $\mathbf{x}$ is a multivariate random variable with zero mean and unit covariance matrix, and $\mathbf{C} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^T$ is the eigendecomposition of $\mathbf{C}$.
If $\mathbf{y} = \mathbf{U} \mathbf{\Lambda}^{1/2} \mathbf{x} + \mathbf{\mu}$, show that $\mathbf{y}$ has mean $\mathbf{\mu}$ and covariance $\mathbf{C} $.
Your answer goes here
$E\left[ \mathbf{y} \right] = E \left[ \mathbf{U} \mathbf{\Lambda}^{1/2} \mathbf{x} + \mathbf{\mu} \right] = \mathbf{U} \mathbf{\Lambda}^{1/2} E \left[ \mathbf{x} \right] + E \left[\mathbf{\mu} \right] = \mathbf{0} + \mathbf{\mu} = \mathbf{\mu}$.
$Var\left[ \mathbf{y} \right] = E \left[ \left( \mathbf{y} - E \left[ \mathbf{y} \right] \right) \left( \mathbf{y} - E \left[ \mathbf{y} \right] \right)^T \right] = E \left[ \left( \mathbf{y} - \mathbf{\mu} \right) \left( \mathbf{y} - \mathbf{\mu} \right)^T \right] = E \left[ \mathbf{U} \mathbf{\Lambda}^{1/2} \mathbf{x} \left( \mathbf{U} \mathbf{\Lambda}^{1/2} \mathbf{x} \right) ^T \right] = \mathbf{U} \mathbf{\Lambda}^{1/2} E\left[ \mathbf{x} \mathbf{x}^T \right] \mathbf{\Lambda}^{1/2} \mathbf{U} ^T = \mathbf{U} \mathbf{\Lambda} \mathbf{U} ^T = \mathbf{C}$, where we have used $ \left( \mathbf{\Lambda}^{1/2} \right) ^T=\mathbf{\Lambda}^{1/2}$, since $\mathbf{\Lambda}$ is diagonal, and $E\left[ \mathbf{x} \mathbf{x}^T \right] = Var \left[ \mathbf{x} \right] = \mathbf{I}$.
Please note that the above holds without any assumptions on the probability distribution of $\mathbf{x}$. As a special case, we focus on the multivariate normal distribution.
The procedure of sampling one data point from a multivariate normal distribution with mean $\mathbf{\mu}$ and covariance matrix $\mathbf{C}$ is as follows:
Write a function that generates random samples from a multivariate normal distribution with given mean vector and covariance matrix. You should make use of the numpy.random.randn()
function that generates samples from a standard multivariate gaussian distribution and the eigendecomposition of the covariance matrix. For computing the eigendecomposition of a symmetric matrix you should use the numpy.linalg.eigh()
function.
Important:
Most scientific computing packages like scikit-learn
follow the $n \times d$ convention for a data matrix, where $n$ is the number of samples and $d$ the dimensionality, that is, $\mathbf{X} \in \mathbb{R} ^ {n \times d}$. Please note that many textbooks and the lecture notes for this course follow reverse notation (i.e. $d \times n$). Here, you should follow the $n \times d$ convention. If the standard normal data are stored in a matrix $\mathbf{X}$, then you should compute $\mathbf{Y}$ by $$\mathbf{Y} = \mathbf{X} \mathbf{\Lambda}^{\frac{1}{2}} \mathbf{U}^T $$
Finally, generate a 3 x 3 random covariance matrix C
by using the generate_positive_semi_definite()
function. Use the function you just wrote to generate 1 million random samples with zero mean and covariance matrix C
. Compute the empirical covariance matrix of the data (you can use numpy.cov()
by setting rowvar=False
) and double-check that it is a good estimate of the true covariance matrix C
.
Tip: don't get stuck on this, move on after 10 mins or so (it's not critical)
# Your code goes here
def sample_multivariate_normal_eig(mean, cov, n, random_seed=None):
"""Sample from multivariate normal distribution
with given mean and covariance matrix by
using eigendecomposition of covariance matrix.
Parameters
----------
mean : array, shape (d,)
Mean vector.
cov : array, shape (d,d)
Covariance matrix.
n : integer
Number of samples.
random_seed : integer (optional)
Random seed.
Returns
----------
X : array, shape(n, d)
Random samples.
Raises
----------
ValueError
If the provided matrix is not positive definite.
"""
# Check that covariance matrix is positive definite.
is_pd = is_positive_semi_definite(cov)
if is_pd is not True:
raise ValueError("Covariance matrix must be positive definite.")
# Check that mean vector has the same dimensionality
# as the covariance matrix.
if mean.size != cov.shape[0]:
raise ValueError("Mean vector and covariance matrix must have the same dimensionality.")
if random_seed is not None:
np.random.seed(random_seed)
d = mean.size
X_standard = np.random.randn(n, d)
eigvals, eigvecs = np.linalg.eigh(cov)
return np.dot(X_standard, (eigvecs * (eigvals**0.5)).T)
# Numerical example
n_sam = int(1e6)
n_dim = 3
C = generate_spsd_matrix(d=n_dim, random_seed=10)
data = sample_multivariate_normal_eig(mean=np.zeros((n_dim,)), cov=C, n=n_sam)
print("True covariance matrix:\n{}\nEmpirical (sample) covariance matrix:\n{}"
.format(C, np.cov(data, rowvar=False)))
Numpy implements the numpy.random.multivariate_normal()
function that can be used to generate samples from a multivariate normal distribution with given mean vector and covariance matrix. You are encouraged to use such built-in functions whenever available, as they will most likely be highly optimised, and bug-free. Nevertheless, it is some times very useful to know what these functions do under the hood.
You are provided with the following function generate_gaussian_data()
that can be used to generate a multivariate gaussian dataset from a given mean and covariance. When the mean and covariance are not defined, they are generated at random. The random_seed
parameter can be used to ensure reproducible results. The function returns a tuple containing three items; the dataset, the true mean, and the true covariance matrix of the probability distribution the data were sampled from. Execute the cell below to load this function.
def generate_gaussian_data(n_samples, n_features=None, mu=None, cov=None, random_seed=None):
"""Generates a multivariate gaussian dataset.
Parameters
----------
n_samples : integer
Number of samples.
n_features : integer
Number of dimensions (features).
mu : array, optional (default random), shape (n_features,)
Mean vector of normal distribution.
cov : array, optional (default random), shape (n_features,n_features)
Covariance matrix of normal distribution.
random_seed : integer
Random seed.
Returns
-------
x : array, shape (n_samples, n_features)
Data matrix arranged in rows (i.e.
columns correspond to features and
rows to observations).
mu : array, shape (n_features,)
Mean vector of normal distribution.
cov : array, shape (n_features,n_features)
Covariance matrix of normal distribution.
Raises
------
ValueError when the shapes of mu and C are not compatible
with n_features.
"""
if random_seed is not None:
np.random.seed(random_seed)
if mu is None:
mu = np.random.randn(n_features,)
else:
if n_features is None:
n_features = mu.shape[0]
else:
if mu.shape[0] != n_features:
raise ValueError("Shape mismatch between mean and number of features.")
if cov is None:
cov = generate_spsd_matrix(n_features, random_seed=random_seed)
else:
if (cov.shape[0] != n_features) or (cov.shape[1] != n_features):
raise ValueError("Shape mismatch between covariance and number of features.")
x = np.random.multivariate_normal(mu, cov, n_samples)
return (x, mu, cov)
By using the function provided above, generate a 2-dimensional gaussian dataset consisting of 1000 samples. Use a zero mean and the following covariance matrix:
$$ \mathbf{C} = \left( \begin{array}{ccc} 1 & 0.3 \\ 0.3 & 2 \\ \end{array} \right) $$Print the empirical mean, covariance and correlation matrices. You should use numpy built-in functions for this purpose, but you are also free to implement your own functions if you are keen.
Finally, use the seaborn jointplot()
function to produce a joint scatter plot of the two variables. This function also shows the marginal histograms on the top and right hand sides of the plot. Label axes appropriately.
# Your code goes here
# By using numpy functions
mu = np.zeros((2,))
cov = np.array([[1, 0.3], [0.3, 2]])
x_2d, _, _ = generate_gaussian_data(n_samples=1000, mu=mu, cov=cov, random_seed=10)
print("Estimated mean:\n{}".format(np.mean(x_2d, axis=0)))
print("Estimated covariance matrix:\n{}".format(np.cov(x_2d, rowvar=False)))
print("Estimated correlation matrix:\n{}".format(np.corrcoef(x_2d, rowvar=False)))
g = sns.jointplot(x_2d[:,0],x_2d[:,1], size=4, stat_func=None, ratio=5, joint_kws={"s" : 1}, color='grey')
g.set_axis_labels("Dim 1", "Dim 2")
plt.show()
# By using custom-written functions
def data_mean(x):
x = np.asarray(x)
if x.ndim != 2: # Check that data are 2D
raise ValueError('Array x must be 2-dimensional.')
n = x.shape[0] # Number of samples
return np.sum(x, axis=0) / n
def data_cov(x):
x = np.asarray(x)
if x.ndim != 2: # Check that data are 2D
raise ValueError('Array x must be 2-dimensional.')
n = x.shape[0] # Number of samples
mean = data_mean(x) # mean
return np.dot((x - mean).T, (x-mean)) / (n-1)
def data_corr(x):
cov = data_cov(x)
d = x.shape[1] # Number of dimensions
corr = np.zeros_like(cov)
for dim_i in range(d):
for dim_j in range(d):
corr[dim_i, dim_j] = cov[dim_i, dim_j] / np.sqrt(cov[dim_i, dim_i]*cov[dim_j, dim_j])
return corr
mu = np.zeros((2,))
cov = np.array([[1, 0.3], [0.3, 2]])
x_2d, _, _ = generate_gaussian_data(n_samples=1000, mu=mu, cov=cov, random_seed=10)
print("Estimated mean:\n{}".format(data_mean(x_2d)))
print("Estimated covariance matrix:\n{}".format(data_cov(x_2d)))
print("Estimated correlation matrix:\n{}".format(data_corr(x_2d)))
Repeat the same procedure as above, but now modify the covariance matrix such that the true correlation ceofficient between the two variables is 0.6 while the variances stay unchanged.
Visualise the jointplot()
of the new dataset and compare to the one from the previous question. Do your observations agree with the modification you just applied to the covariance matrix?
# Your code goes here
rho_prime = 0.6 # New correlation coefficient
cov_prime = np.copy(cov) # New covariance matrix (copy and modify below)
cov_prime[0,1] = np.sqrt(cov[0,0]*cov[1,1])*rho_prime # rho = cov(x1,x2) / (std(x1)*std(x2))
cov_prime[1,0] = cov_prime[0,1]
if is_positive_semi_definite(cov_prime): # Double-check that the new covariance matrix is positive semi-definite
print("Original (true) covariance matrix:\n{}".format(cov))
print("Modified (true) covariance matrix:\n{}".format(cov_prime))
x_2d_prime, _, _ = generate_gaussian_data(n_samples=1000, mu=mu, cov=cov_prime, random_seed=10)
print("\nEstimated mean:\n{}".format(np.mean(x_2d_prime, axis=0)))
print("Estimated covariance matrix:\n{}".format(np.cov(x_2d_prime, rowvar=False)))
print("Estimated correlation matrix:\n{}".format(np.corrcoef(x_2d_prime, rowvar=False)))
g = sns.jointplot(x_2d_prime[:,0],x_2d_prime[:,1], size=4, stat_func=None, ratio=5, joint_kws={"s" : 1}, color='grey')
g.set_axis_labels("Dim 1", "Dim 2")
plt.show()
else:
print("Modified matrix is not positive semi-definite.")
Let us now dive into PCA.
For this section, we will use a 3-dimensional Gaussian dataset. Execute the cell below to generate the dataset and print the true mean and covariance matrix of the distribution the data was sampled from.
# Generates a 3D dataset and prints true mean and covariance
x_3d, mu_true, C_true = generate_gaussian_data(n_samples=1000, n_features=3, random_seed=10)
print("Dataset consists of {} samples and {} features.".format(x_3d.shape[0], x_3d.shape[1]))
print("True mean:\n{}".format(mu_true))
print("True covariance matrix:\n{}".format(C_true))
Estimate the empirical mean and covariance matrix from this dataset and save them in variables mu_est
and C_est
respectively.
# Your code goes here
mu_est = np.mean(x_3d, axis=0)
C_est = np.cov(x_3d, rowvar=False)
print("Estimated mean:\n{}".format(mu_est))
print("Estimated covariance matrix:\n{}".format(C_est))
Sklearn offers a class implementation of pca
. Please spend a minute to read the documentation of this class. The principal component (PC) directions of a dataset are computed by using the fit()
method and stored into the components_
attribute. The PC scores can be computed by using the transform()
method. The amount of variance explained by each of the selected components is stored into the explained_variance_
attribute.
Please note, this method centers the input data (i.e. removes the mean) before computing the PC directions.
Initialise a pca
object and "fit" it using the dataset x_3d
. Print the three PC directions. Store the PC scores for x_3d
in an array called pc_scores
.
Hint: according to the documentation the components are stored row-wise, i.e. the components_ array
has shape [n_components, n_features]
. You should print the PC directions column-wise, i.e. take the transpose of the components_
array.
# Your code goes here
from sklearn.decomposition import PCA
pca = PCA().fit(x_3d)
pc_scores = pca.transform(x_3d)
print(pca.components_.T)
Most often, we do not want to compute all PC directions, but only a few (i.e. dimensionality reduction). We can define the desired number of PCs by setting the n_components
parameter appropriately when we initialise the pca
object.
Initialise a pca_new
object with 2 PCs and "fit" it using the dataset x_3d
. Print the two PC directions.
Hint: the 2 PC directions should be the same as the first 2 directions you computed in the previous question. The reason for this is ultimately that PCA by sequential and simultaneous variance maximisation give the same result.
# Your code goes here
pca_new = PCA(n_components=2).fit(x_3d)
print(pca_new.components_.T)
p_pca = PCA(n_components=2).fit(x_3d,True)
print(p_pca.score_samples(x_3d))
print(pc_scores)
print(p_pca.noise_variance_)
Now we want to implement PCA from scratch. The procedure can be summarised as follows:
k
eigenvectors corresponding to the k
largest eigenvalues (k
< d
).(Regarding 3 and 4: some algorithms for eigendecompositions allow you to specify the number of eigenvectors and eigenvalues that should be computed. You then do not have to compute the complete eigendecomposition, which is wasteful if you are only interested in a few principle components.)
Compute and print all three PC directions in the dataset x_3d
by using the procedure described above. Hopefully, you will have already performed the first two steps in question 6. Then compute the PC scores for the centered data matrix.
As happens very often when writing code, it is likely that there will be a few bugs in your implementation. To double-check that your code is correct, compare the computed PC directions and scores to the ones achieved in question 7 with scikit-learn.
Hint: you might (or might not) find that some of the PC directions/scores you have computed have opposite signs to the ones returned by the sklearn implementation. Do not worry about this, the two solutions are equivalent (why?). To make debugging easier, you are provided with the following function, solutions_equivalent()
which tests wether two solutions are equivalent, regardless of their signs. Execute the following cell to load this function.
def solutions_equivalent(b1, b2):
"""Checks whether two PC directions/scores
solutions are equivalent regardless of their .
respective signs.
Parameters
----------
s1 : array, shape(n_axes,)
First solution.
s2 : array, shape(n_axes,)
Second solution.
Returns
-------
True if solutions are equivalent.
Raises
------
ValueError if the two bases do not have
the same dimensionality.
"""
s1 = np.asarray(b1)
s2 = np.asarray(b2)
if s1.shape != s2.shape:
raise ValueError("Solutions must have the same dimensionality.")
dims_equal = []
for dim in range(s1.shape[1]):
if (np.allclose(s1[:,dim],s2[:,dim]) or np.allclose(s1[:,dim],-s2[:,dim])):
dims_equal.append(True)
else:
dims_equal.append(False)
return all(element for element in dims_equal)
# Your code goes here
def sort_eigvals_eigvec(eigvals, eigvecs):
"""Sorts eigenvalues and eigenvectors
in eigenvalue descending order."""
order = np.argsort(eigvals)[::-1]
eigvals_sorted = eigvals[order]
eigvecs_sorted = eigvecs[:,order]
return eigvals_sorted, eigvecs_sorted
x_3d_centered = x_3d - mu_est
# Eigendecomposition of C
eigvals, eigvecs = np.linalg.eigh(C_est)
eigvals_sorted, eigvecs_sorted = sort_eigvals_eigvec(eigvals, eigvecs)
my_pca_1_directions = eigvecs_sorted
my_pca_1_scores = np.dot(x_3d_centered, my_pca_1_directions)
# Tests
print("Principal components:\n{}".format(my_pca_1_directions))
print(solutions_equivalent(pca.components_.T, my_pca_1_directions)) # PC directions
print(solutions_equivalent(pc_scores, my_pca_1_scores)) # PC scores
The SVD decomposition of the centered data matrix is: $$\mathbf{X} = \mathbf{V} \mathbf{S} \mathbf{U}^T $$ where the columns $\mathbf{V} \in \mathbb{R}^{n\times n}$ and $\mathbf{U} \in \mathbb{R}^{d\times d}$ contain the left- and right-singular vectors of $\mathbf{X}$, and $\mathbf{S} \in \mathbb{R}^{n\times d}$ is a diagonal matrix containing its singular values.
Compute the PC directions and scores in the dataset x_3d
by using the SVD decomposition of the centered data matrix (cf. lecture notes). Double-check that these are correct by using the solutions_equivalent()
function.
Hint 1: make sure you center the data before computing the SVD of x_3d
.
Hint 2: Remember, we are dealing with n
x d
data matrices here, as opposed to d
x n
matrices in the lecture notes (that is why we have written $\mathbf{X} = \mathbf{V} \mathbf{S} \mathbf{U}^T$ as opposed to $\mathbf{X} = \mathbf{U} \mathbf{S} \mathbf{V}^T$ in the lecture notes). Which singular vectors of the centered data matrix correspond to the PC directions/scores in this case?
# Your code goes here
# Since we are dealing with the n x d matrices
# the right singular vectors correspond to PC directions
# and the left singular vectors correspond to PC scores
x_3d_centered = x_3d - mu_est
v_x, s_x, u_x_T = np.linalg.svd(x_3d_centered, full_matrices=False)
u_x = u_x_T.T # Recover U from its transpose that was returned by svd
my_pca_2_directions = u_x
my_pca_2_scores = v_x * s_x
# Tests
print("Principal components:\n{}".format(my_pca_2_directions))
print(solutions_equivalent(pca.components_.T, my_pca_2_directions)) # PC directions
print(solutions_equivalent(pc_scores, my_pca_2_scores)) # PC scores
Assume now that you want to do dimensionality reduction by using k
PCs only. One option would be to follow the same procedure as above and discard the last d-k
PC directions. This, however, can be rather wasteful. There exist SVD variations that approximate the matrix $\mathbf{X}$ with a low-rank matrix.
Scipy implements the function svds
that computes only the k
largest singular values/vectors of a given matrix.
Compute and print the first 2 PC directions by using the svds
decomposition of the centered data matrix.
Hint: as opposed to the numpy function numpy.linalg.svd
, svds
does not automatically sort the singular values/vectors in singular value descending order. You should do this operation manually.
from scipy.sparse.linalg import svds
# Your code goes here
v_x_k, s_x_k, u_x_k_T = svds(x_3d_centered, k=2, which='LM')
u_x_k = u_x_k_T.T # Recover U from its transpose that was returned by svds
_, u_x_k_sorted = sort_eigvals_eigvec(s_x_k, u_x_k) # Sort by singular value descending order
my_pca_low_rank_directions = u_x_k_sorted
# Tests
print("Principal components:\n{}".format(my_pca_low_rank_directions))
print(solutions_equivalent(pca.components_.T[:,:2], my_pca_low_rank_directions)) # PC directions
We have seen in the lectures that we can compute the PC directions and scores by performing the eigendecomposition of the Gram matrix $\mathbf{G}$.
The procedure of computing the PC directions and scores based on the eigendecomposition of the Gram matrix is as follows:
k
eigenvectors. This is required even if we do not to perform dimensionality reduction, since the eigendecomposition of the Gram matrix will yield n
eigenvectors.Compute the PC directions and scores in the dataset x_3d
by using the eigendecomposition of the Gram matrix. Double-check that these are correct by using the solutions_equivalent()
function.
# Your code goes here
def data_gram(x):
x = np.asarray(x)
if x.ndim != 2: # Check that data are 2D
raise ValueError('Array x must be 2-dimensional.')
N = x.shape[0] # Number of samples
return np.dot(x, x.T)
n_dim = x_3d.shape[1]
K = data_gram(x_3d_centered)
eigvals_k, eigvecs_k = np.linalg.eigh(K)
eigvals_k_sorted, eigvecs_k_sorted = sort_eigvals_eigvec(eigvals_k, eigvecs_k)
my_pca_3_scores = eigvecs_k_sorted[:,:n_dim] * eigvals_k_sorted[:n_dim]**(0.5)
my_pca_3_directions = np.dot(x_3d_centered.T, eigvecs_k_sorted[:,:n_dim]) * eigvals_k_sorted[:n_dim]**(-0.5)
# Tests
print("Principal components:\n{}".format(my_pca_3_directions))
print(solutions_equivalent(pca.components_.T, my_pca_3_directions))
print(solutions_equivalent(pc_scores, my_pca_3_scores)) # PC scores
Can you think of a particular case where it would be beneficial to compute the PC scores by using the eigendecomposition of the Gram matrix, as opposed to SVD decomposition of the data matrix?
Your answer goes here
One particular case where we would use the eigendecomposition of the Gram matrix, is when we only have access to that matrix and not the data matrix. In this case, it is still possible to identify the PC scores and hence do dimensionality reduction, as we will see in Lab 3.
In addition, by working directly with the Gram matrix, we can use the kernel trick and identify non-linear manifolds by using almost the same bit of code. This is exactly what kernel principal component analysis (kPCA) does, but more on this in the next lab.
PCA can be used to reduce the dimensionality of a data from $d$ to $k$.
Use any implementation of PCA of your choice to project the dataset x_3d
onto 1, 2, and 3 dimensions. This projection should just take the form $$\mathbf{X}_{red} = \mathbf{X} \mathbf{W}_{1:k}$$ where $\mathbf{X}$ is the centered data matrix and $\mathbf{W}$ is the matrix containing the PC directions as columns. The approximation of the data by the $k$ principle components is $$\mathbf{X}_{recon} = \mathbf{X}_{red} \mathbf{W}_{1:k} ^ T + \mathbf{\mu}$$.
Plot the mean-squared-error and explained variance of the approximation as a function of the number of components used. Label axes appropriately. You should make use of the sklearn mean_squared_error()
and explained_variance_score()
metrics. For explained_variance_score()
, you should set the multioutput
parameter to variance_weighted
.
# Your code goes here
from sklearn.metrics import mean_squared_error, explained_variance_score
ev = []
mse = []
num_components = range(1,4)
for k in num_components:
x_red = np.dot(x_3d_centered, my_pca_1_directions[:,:k])
x_recon = np.dot(x_red, my_pca_1_directions[:,:k].T) + mu_est
mse.append(mean_squared_error(x_3d, x_recon))
ev.append(explained_variance_score(x_3d, x_recon, multioutput='variance_weighted'))
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
for metric, ax, label in zip([ev, mse], [ax1, ax2], ["explained variance", "mean squared error"]):
ax.scatter(num_components, metric, color='k', s=30)
ax.plot(num_components, metric, linestyle='--', color='k')
ax.set_ylabel(label)
ax2.set_xlabel('Number of components')
ax2.set_xticks(num_components)
ax1.set_ylim(top=1.01)
ax2.set_ylim(bottom=-0.01)
plt.show()
# We can also visualise the original original and the approximation with a 3D plot.
# NB: you were not required to do this.
x_orig = x_3d
x_orig_centered = x_3d - mu_est
n_samples = 500 # Number of data points to plot
from mpl_toolkits.mplot3d import Axes3D
s = 10 # Marker size for scatter plot
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x_orig[:n_samples,0], x_orig[:n_samples,1], x_orig[:n_samples,2],
c='b', s=s, marker='o', label='original')
for (k, c, marker,label) in zip(np.arange(1,3), ['k', 'r'], ['^', '*'], ['approximation, 1 PC', 'approximation, 2 PCs']):
x_proj = np.dot(x_orig, my_pca_1_directions[:,:k])
x_rec = np.dot(x_proj, my_pca_1_directions[:,:k].T) + mu_est
ax.scatter(x_rec[:n_samples,0], x_rec[:n_samples,1], x_rec[:n_samples,2],
c=c, s=s, marker=marker, label=label)
ax.set_xlabel('Dim 1')
ax.set_ylabel('Dim 2')
ax.set_zlabel('Dim 3')
ax.azim = 400
ax.elev = 30 # Add a small elevation
plt.legend()
plt.show()
The black points lie on a line (the 1st PC direction). The red points lie in a 2D plane (spanned by the first two PC directions).
How can you compute the variance explained by k principal components by only looking at the eigenvalues of the covariance matrix?
Your answer goes here
The fraction of explained variance can be computed by looking at the cumulative sum of the sorted eigenvalues of the covariance matrix.
# Your code goes here
print("Cumulative sum of eigenvalues:\n{}".format(np.cumsum(eigvals_sorted)/np.sum(eigvals_sorted)))
print("Explained variance:\n{}".format(ev))
In lecture, we have seen that the SVD allows us to find a low rank approximation of the data matrix. We here exemplify the low rank approximation property of the SVD on a image compression task.
As you might already know, grey-scale images are represented in the digital world as 2D matrices, whose elements correspond to pixel intensities. We here approximate this matrix by a low-rank approximation through the SVD. If there are correlations between the pixels in the image (which happens to be the case for natural images), then we should be able to achieve a relatively good reconstruction of the image by using only a few components.
Let us first load a sample image from the scipy package:
# Load sample image
from scipy.misc import face
img = face(gray=True)
print("Image array dimensionality: {}".format(img.shape))
We can visualise the image by using the matplotlib imshow function:
# Show image
sns.set_style("white")
plt.figure()
plt.imshow(img, cmap=plt.cm.gray)
plt.axis("off")
plt.show()
Write a function image_low_rank_approx() that takes as input an image (i.e. 2-dimensional array) and an integer k and reconstructs the image by using a k-rank SVD approximation.
# Your code goes here
def image_low_rank_approx(img, k):
"""Approximates a two-dimensional array by low-rank SVD.
Parameters
----------
img : array, ndim = 2
Image array.
k : integer
Rank of approximation.
Returns
-------
img_recon : array, ndim = 2
Reconstructed image.
"""
if img.ndim != 2:
raise ValueError("Image array should be two-dimensional.")
u, s, v = np.linalg.svd(img, full_matrices=False)
img_recon = np.dot(u[:,:k]*s[:k], v[:k,:]) # Low-rank approximation
return img_recon
Perform a low-rank approximation of the image stored in img by using a varying number of ranks (i.e. from 1 to 500) and visualise the approximation. What do you observe?
# Your code goes here
k_range = [1, 5, 10, 20, 50, 100, 200, 500]
f = plt.figure(figsize=(10, 10))
for ii, k in enumerate(k_range):
recon = image_low_rank_approx(img, k)
plt.subplot(4,2,ii+1)
plt.imshow(recon, cmap=plt.cm.gray)
plt.axis('off')
plt.title("Approximation with {} singular vector(s)".format(k), fontsize=9)
f.tight_layout()
Your answer goes here.
We observe that a good approximation of the image can be achieved by using as few as 50 singular vectors and values.