Dimensionality Reduction with PCA

Dimensionality Reduction with PCA#

Dimensionality Reduction#

While working with data, it is common to have access to very high-dimensional unstructured informations (e.g. images, sounds, …). To work with them, it is necessary to find a way to project them into a low-dimensional space where data which is semantically similar is close. This approach is called dimensionality reduction.

For example, assume our data can be stored in an \(d \times N\) array,

\[ X = [ x^1 x^2 \dots x^N ] \in \mathbb{R}^{d \times N} \]

where each datapoint \(x^j \in \mathbb{R}^d\). The idea of dimensionality reduction techniques in ML is to find a projector operator \(P: \mathbb{R}^d \to \mathbb{R}^k\), with \(k \ll d\), such that in the projected space \(P(x)\), images semantically similar are close together. If the points in a projected space forms isolated popoulations such that inside of each popoulation the points are close, while the distance between popoulations is large, we call them clusters. A clusering algorithm is an algorithm which is able to find clusters from high-dimensional data.

Principal Component Analysis (PCA)#

Principal Componenti Analyisis (PCA) is probabily the simplest yet effective technique to perform dimensionality reduction and clustering. It is an unsupervised algorithm, thus it does not require any label.

The idea is the following: consider a dataset \(X \in \mathbb{R}^{d \times N}\) of high-dimensional data and assume we want to project it into a low-dimensional space \(\mathbb{R}^k\). Define:

\[ Z = [z^1 z^2 \dots z^N] \in \mathbb{R}^{k \times N} \]

the projected version of \(X\). We want to find a matrix \(P \in \mathbb{R}^{k \times d}\) such that \(Z = PX\), with the constraint that in the projected space we want to keep as much information as possible from the original data \(X\).

You already studied that, when you want to project a matrix by keeping informations, a good idea is to use the Singular Value Decomposition (SVD) of it and, in particular, the Truncated SVD (TSVD). Let \(X \in \mathbb{R}^{d \times N}\), then

\[ X = U \Sigma V^T \]

is the SVD of \(X\), where \(U \in \mathbb{R}^{d \times d}\), \(V \in \mathbb{R}^{N \times N}\) are orthogonal matrices (\(U^T U = U U^T = I\) and \(V V^T = V^T V = I\)), while \(\Sigma \in \mathbb{R}^{d \times N}\) is a diagonal matrix whose diagonal elements \(\sigma_i\) are the singular values of \(X\), in decreasing order (\(\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_d\)). Since the singular values represents the quantity of informations contained in the corresponding singular vectors, keeping the first \(k\) singular values and vectors can be the solution to our projection problem. Indeed, given \(k < d\), we define the Truncated SVD of \(X\) as

\[ X_k = U_k \Sigma_k V^T_k \]

where \(U_k \in \mathbb{R}^{d \times k}\), \(\Sigma_k \in \mathbb{R}^{k \times k}\), and \(V_k \in \mathbb{R}^{k \times N}\).

The PCA use this idea and defines the projection matrix as \(P = U_k^T\), and consequently,

\[ Z = U_k^T X \]

is the projected space. Here, the columns of \(U_k\) are called feature vectors, while the columns of \(Z\) are the principal components of \(X\).

Implementation#

To implement PCA, we first need to center the data. This can be done by defining its centroid.

Given a set \(X = [x^1 x^2 \dots x^N]\), its centroid is defined as \( c(X) = \frac{1}{N} \sum_{i=1}^N x^i\).

Thus, the implementation of PCA is as follows:

Consider the dataset \(X\);
Compute the centered version of \(X\) as \(X_c = X - c(X)\), where the subtraction between matrix and vector is executed column-by-column;
Compute the SVD of \(X_c\), \(X_c = U\Sigma V^T\);
Given \(k < n\), compute the Truncated SVD of \(X_c\): \(X_{c, k} = U_k \Sigma_k V_k^T\);
Compute the projected dataset \(Z_k = U_k^T X_c\);

Python example#

In the following, we consider as an example the MNIST dataset, which can be download from Kaggle (kaggle.com/datasets/animatronbot/mnist-digit-recognizer). For simplicity, I renamed it as data.csv and I placed it a folder named data into the current project folder.

import numpy as np
import pandas as pd

# Load data into memory
data = pd.read_csv('./data/data.csv')

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 5
import pandas as pd
# Load data into memory
----> 5 data = pd.read_csv('./data/data.csv')

File /opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)   1022     dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: './data/data.csv'

After that, it is important to inspect the data, i.e. to look at its structure and understand how it is distributed. This can be done either by reading at the documentation of the website where the data has been downloaded or by using the pandas method .head().

# Inspect the data
print(f"Shape of the data: {data.shape}")
print("")
print(data.head())

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 2
      1 # Inspect the data
----> 2 print(f"Shape of the data: {data.shape}")
      3 print("")
      4 print(data.head())

NameError: name 'data' is not defined

which prints out all the columns of data and the first 5 rows of each column. With this command, we realize that our dataset is a \(42000 \times 785\) frame, where the columns from the second to the last are the pixels of an image representing an handwritten digit, while the first column is the target, i.e. the integer describing the represented digit.

# Convert data into a matrix
data = np.array(data)

# Split data into a matrix X and a vector Y where:
#
# X is dimension (42000, 784)
# Y is dimension (42000, )
# Y is the first column of data, while X is the rest
X = data[:, 1:]
X = X.T

Y = data[:, 0]

print(X.shape, Y.shape)

d, N = X.shape

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 # Convert data into a matrix
----> 2 data = np.array(data)
      4 # Split data into a matrix X and a vector Y where:
      5 #
      6 # X is dimension (42000, 784)
      7 # Y is dimension (42000, )
      8 # Y is the first column of data, while X is the rest
      9 X = data[:, 1:]

NameError: name 'data' is not defined

Here, we convert the dataframe into a matrix with numpy, and then we split the input matrix \(X\) and the corresponding target vector \(Y\). Finally, we note that \(X\) is an \(N \times d\) matrix, where \(N = 42000\) and \(d = 784\). Since in our notations the shape of \(X\) should be \(d \times N\), we have to transpose it.

Visualizing the digits#

We already said that \(X\) is a dataset of images representing handwritten digits. We can clearly visualize some of them. In the documentation, we can read that each datapoint is a \(28 \times 28\) grey-scale image, which has been flattened. Flattening is the operation of taking a 2-dimensional array (a matrix) and converting it to a 1-dimensional array, by concatenating the rows of it. This can be implemented in numpy with the function a.flatten(), where a is a 2-dimensional numpy array.

Since we know that the dimension of each image was \(28 \times 28\) before flattening, we can invert this procedure by reshaping them. After that, we can simply visualize it with the function plt.imshow() from matplotlib, by setting the cmap to 'gray' since the images are grey-scale.

import matplotlib.pyplot as plt

def visualize(X, idx):
    # Visualize the image of index 'idx' from the dataset 'X'

    # Load an image in memory
    img = X[:, idx]
    
    # Reshape it
    img = np.reshape(img, (28, 28))

    # Visualize
    plt.imshow(img, cmap='gray')
    plt.show()

# Visualize image number 10 and the corresponding digit.
idx = 10
visualize(X, idx)
print(f"The associated digit is: {Y[idx]}")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 18
     16 # Visualize image number 10 and the corresponding digit.
     17 idx = 10
---> 18 visualize(X, idx)
     19 print(f"The associated digit is: {Y[idx]}")

NameError: name 'X' is not defined

Splitting the dataset#

Before implementing the algorithm performing PCA, you are required to split the dataset into a training set and a test set. Remember that to correctly split \(X\), it has to be random.

def split_data(X, Y, N_train):
    d, N = X.shape

    idx = np.arange(N)
    np.random.shuffle(idx)

    train_idx = idx[:N_train]
    test_idx = idx[N_train:]

    X_train = X[:, train_idx]
    Y_train = Y[train_idx]
    
    X_test = X[:, test_idx]
    Y_test = Y[test_idx]

    return (X_train, Y_train), (X_test, Y_test)

# Test it
(X_train, Y_train), (X_test, Y_test) = split_data(X, Y, 30_000)

print(X_train.shape, X_test.shape)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 19
     16     return (X_train, Y_train), (X_test, Y_test)
     18 # Test it
---> 19 (X_train, Y_train), (X_test, Y_test) = split_data(X, Y, 30_000)
     21 print(X_train.shape, X_test.shape)

NameError: name 'X' is not defined

Now the dataset \((X, Y)\) is divided into the train and test components, and we can implement the PCA algorithm on \(X_{train}\). Remember to not access \(X_{test}\) during training.

# Compute centroid
cX = np.mean(X, axis=1)

# Make it a column vector
cX = np.reshape(cX, (d, 1))
print(cX.shape)

# Center the data
Xc = X - cX

# Compute SVD decomposition
U, s, VT = np.linalg.svd(Xc, full_matrices=False)

# Given k, compute reduced SVD
k = 2
Uk = U[:, :k]

# Define projection matrix
P = Uk.T

# Project X_train -> Z_train
Z_train = P @ X_train
print(Z_train.shape)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 # Compute centroid
----> 2 cX = np.mean(X, axis=1)
      4 # Make it a column vector
      5 cX = np.reshape(cX, (d, 1))

NameError: name 'X' is not defined

Warning

When the full SVD decomposition is not required (as it is the case for PCA), one can compute the (reduced) SVD decomposition of a matrix \(X\) as np.linalg.svd(X, full_matrices=False), to save computation time.

Visualizing clusters#

When \(k=2\), it is possible to visualize clusters in Python. In particular, we want to plot the datapoints, with the color of the corresponding class, to check how well the clustering algorithm performed in 2-dimensions. This can be done by the matplotlib function plt.scatter. In particular, if \(Z_{train} = [z^1 z^2 \dots z^N] \in \mathbb{R}^{2 \times N}\) is the projected dataset and \(Y_{train} \in \mathbb{R}^N\) is the vector of the corresponding classes, then the \(Z_{train}\) can be visualized as:

# Visualize the clusters
ax = plt.scatter(Z_train[0, :], Z_train[1, :], c=Y_train)
plt.legend(*ax.legend_elements(), title="Digit") # Add to the legend the list of digits
plt.xlabel(r"$z_1$")
plt.ylabel(r"$z_2$")
plt.title("PCA projection of MNIST digits 0-9")
plt.grid()
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 2
      1 # Visualize the clusters
----> 2 ax = plt.scatter(Z_train[0, :], Z_train[1, :], c=Y_train)
      3 plt.legend(*ax.legend_elements(), title="Digit") # Add to the legend the list of digits
      4 plt.xlabel(r"$z_1$")

NameError: name 'Z_train' is not defined

Due to the high number of represented digits, a few of them appear to be overlapped. This is due to \(k=2\) being too low, and PCA being too simple to be able to capture the complexity of such a large number of digits.

However, a few digits, such as numbers 0 and 1, appear to be well-separated, showing that the algorithm is performing well.

Filtering digits#

To simplify the dimensionality reduction task provided by PCA, we can consider filtering out just a few digits from MNIST dataset. This can be simply done by defining a boolean ndarray whose value in position \(i\) is True if and only if the digit in the corresponding index represents one of the selected digit.

In the following, we show how this can be implemented for the digits \(3\) and \(4\).

# Define the boolean array to filter out digits
filter_3or4 = (Y==3) | (Y==4)

# Define the filtered data
X_3or4 = X[:, filter_3or4]
Y_3or4 = Y[filter_3or4]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 # Define the boolean array to filter out digits
----> 2 filter_3or4 = (Y==3) | (Y==4)
      4 # Define the filtered data
      5 X_3or4 = X[:, filter_3or4]

NameError: name 'Y' is not defined

Exercise: Repeat the PCA analysis for \(k=2\) on the filtered dataset. Remember the train_test_split().

Dimensionality Reduction with PCA

Contents

Dimensionality Reduction with PCA#