Dimensionality reduction methods

Author

Payam Emami

Published

September 19, 2024

Introduction

In this chapter we are going to learn about a few commonly used dimensionality reduction methods in life science research. We are first going to define what dimensionality reduction is, why and when we should use it. We will then learn how to use these methods and how they work.

Dimentionality reduction

Life science is full of different data types. While most of this data is in tabular form (or reduced to such form), others can be images, continuous signals, sequences, or even text-based data, each requiring specialized techniques for analysis and interpretation. But what seems to be common is that we keep measuring more and more variables. By ‘variable’, I mean any measurable characteristic or feature, such as gene expression levels, metabolite concentrations, physiological signals, or environmental factors and so on. In a dataset, each of these variables is called a dimension. For example, if we have gene expression data, each gene would represent a dimension, with its expression level acting as the variable being measured across different samples. As we add more genes or other molecular features, the number of dimensions grows, leading to what is known as high-dimensional data. All these dimensions together reveal patterns and relationships among the samples, helping us understand how they behave or differ from one another.

Dimensionality reduction is a way to simplify high-dimensional data by reducing the number of variables while preserving the essential patterns and structures. This helps make the data easier to visualize and interpret, while still capturing the key relationships between samples. It is important to note that we don’t mean that most methods actively remove original variables (although it can be the case for some methods like feature selection), but rather they transform the data into a lower-dimensional space. This transformation retains the most important information (from the perspective of the algorithm), capturing the underlying patterns and relationships without losing critical details. So for example, instead of having thousands of genes showing the original pattern after such transformation we might have just a few new variables (e.g components) that summarize the key information in the data. One question, though, is if we’ve spent time and money measuring many variables, why would we want to reduce this number? Does it even make sense?

Why reducing dimentions

There are many reasons why we might want to reduce data dimensions.

  1. Data Visualization: This is probably one of the most common reasons we do dimensionality reduction. High-dimensional data is difficult to visualize in its raw form. Dimensionality reduction techniques like PCA and t-SNE allow us to reduce this data into two or three dimensions. We can then plot the data and see trends, clusters, or outliers. Obviously, having our data summarized to a couple of variables is easier for the human eye to comprehend.

  2. Removing Noise and Redundancy: Despite that we have measured many variables, not all of them contribute equally to the information contained in the data. Some variables may be noisy, irrelevant, or redundant. Dimensionality reduction methods can help eliminate these less useful dimensions, giving us a cleaner, more informative dataset. Similary, one can use these method to adjust for unwanted trends in the data such as batch effects etc.

  3. Uncovering Patterns and Trends: Having a lot of variables is not always good, often, the true underlying structure of the data is hidden within many dimensions. Dimensionality reduction helps to reveal the most important patterns and trends by summarizing the data by some form of combinition of the raw variables making it easier to detect relationships between samples and uncover valuable insights.

  4. Improving Model Performance: We know that in machine learning, too many variables can lead to overfitting, where a model performs well on training data but poorly on unseen data. Dimensionality reduction can help prevent this by focusing on the most important features. So we can improve the model’s generalizability and predictive performance.

There might also be other reasons to reduce the dimention of the data. For example working with large, high-dimensional datasets can be computationally expensive. Dimensionality reduction lowers the number of variables, which reduces the memory and processing power needed to analyze the data.

In general, specially in OMICS data analysis, dimensionality reduction is often performed at some point in the analysis workflow. It might be the case that the results of it might not be the main interest but still might affect the overall decision making process. The example can be quality check, outlier detection etc.

How dimentionality reduction works

Admittedly, the answer to this question is not so simple. There are many different approaches to dimensionality reduction, each with its own principles and techniques. Some methods, like Principal Component Analysis (PCA), focus on finding directions in the data that capture the most variance. Others, such as t-SNE and UMAP, are more concerned with preserving the local structure and distances between data points. Some like autoencoders learn compact representations of the data by compressing it into a lower-dimensional form and reconstructing the original inputs. There are even methods like Linear Discriminant Analysis (LDA) and Non-negative Matrix Factorization (NMF) also offer unique ways to reduce dimensions by focusing on class separation or non-negative decomposition.

Most of these methods, however, work in one way or another with the concept of distances or similarities between data points. For example, PCA seeks to maximize the variance (which is linked to the spread, or “distance,” between data points in the dataset), while t-SNE and UMAP preserve relative distances so that points close together in high-dimensional space remain close after dimensionality reduction. Even methods like autoencoders rely on optimization processes that capture patterns of similarity in the data.

It is however very important to pay attend to what the selected method is seeking to show in lower dimentions. This will directly affect the interpretation and usage of the lower dimentional space. So while the goal of these methods remains the same (preserving the structure in the original dataset), the definition of “structure” varies between this methods. For example, PCA is more focused on capturing global structure, meaning it seeks to maximize the overall variance across the entire dataset. It tries to find the directions in which the data varies the most, but it might overlook subtle local relationships between data points. Methods like t-SNE and UMAP focus on local structure, meaning they try to keep the relative distances between nearby points, which is great for understanding clusters but may not accurately show large-scale patterns in the data.

The point that i wanted to make is that the choice of dimensionality reduction method directly influences what information from the original data are retained and what are lost. Understanding each method’s “structure” definition is very important for making decisions about how to interpret and use the reduced data.

What we are going to cover?

PCA is probably the most well-known dimensionality reduction method. You can learn more about PCA (https://payamemami.com/pca_basics/). In the rest of this book, we are going to learn some other useful dimensionality reduction methods such as t-SNE and UMAP. We might bring up PCA just to compare the results and mathematical formulation. So if you are still learning about dimensionality reduction, please have a look athe PCA chapter first.