Introduction to Self-Organizing Maps
Modern life science technologies can measure thousands to millions of variables in a single experiment. Genomics, proteomics, and metabolomics datasets often consist of extremely high-dimensional matrices, for example, gene expression profiles with tens of thousands of gene features measured across hundreds of samples. While these rich datasets hold the promise of new biological insights, they also present a curse of dimensionality: making sense of such massive data is challenging. Researchers are confronted with complex, multidimensional measurements where meaningful patterns are buried in noise and sheer volume. As Tamayo et al. noted at the dawn of functional genomics, “array technologies have made it straightforward to monitor thousands of genes…, [but] the challenge now is to interpret such massive data sets”, and the first step is often to extract fundamental patterns or lower-dimensional structure inherent in the data. In other words, we need ways to reduce dimensionality while preserving the most informative aspects of the data.
Dimensionality reduction serves multiple purposes in biological data analysis. First, it enables visualization of high-dimensional data in 2 or 3 dimensions, so that researchers can intuitively inspect sample groupings, outliers, or gradients (for instance, plotting cells or patients to see if they form clusters corresponding to biological conditions). Second, it can denoise and compress data by identifying underlying composite variables or “factors” that summarize the behavior of many original features. This helps tackle redundancy and correlation among features, a common situation where many genes or metabolites co-vary as part of the same pathway or process. Third, it can improve modeling and analysis downstream: by reducing thousands of features to a few informative dimensions, one can avoid overfitting in predictive models and highlight the major sources of variation for further biological interpretation.
In practice, a variety of dimensionality reduction techniques have become staples in life science research. Each technique has its own assumptions and strengths, and understanding their differences is important for choosing the right approach. Below we talk about Self-Organizing Maps (SOMs) in the context of several popular methods: Principal Component Analysis (PCA), Independent Component Analysis (ICA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP).
Overview of Dimensionality Reduction Techniques in Biology
Principal Component Analysis (PCA): PCA is a classical linear technique that finds a new set of orthogonal axes (principal components) capturing the greatest variance in the data. It projects high-dimensional data into a smaller number of dimensions such that each successive component explains the largest remaining variance, subject to being uncorrelated with previous components. PCA has long been used in genomics, for example, to summarize genome-wide expression or genotype data, because it provides a convenient way to identify major patterns like batch effects or population structure. However, PCA’s reliance on capturing variance with orthogonal components means it may miss subtler nonlinear relationships; it only leverages second-order statistics (covariances) and assumes components are uncorrelated Gaussian-like patterns. In fact, the strict orthogonality constraint “may not be appropriate for biomedical data” that can have complex correlated signals. Despite this, PCA’s simplicity, speed, and interpretability (each principal component is a linear combination of original features) make it a common first step in exploring high-throughput biological datasets.
Independent Component Analysis (ICA): ICA is another linear approach, but more flexible than PCA in that it does not require components to be orthogonal. Instead, ICA seeks a linear transformation of the data into statistically independent components, which often corresponds to finding underlying source signals in the data. In practice, ICA looks for components that are non-Gaussian and as independent as possible, capturing higher-order statistics. This has been useful in genomics and other fields for separating superimposed signals,for example, disentangling co-regulated gene modules or technical artifacts that vary independently. Studies have shown that ICA can sometimes outperform PCA in biomedical data mining by identifying biologically relevant features that PCA misses. However, ICA components can be more challenging to interpret directly, and the method requires careful tuning (e.g. estimating the number of components). In summary, ICA provides a complementary perspective to PCA meaning that instead of maximizing explained variance, it maximizes statistical independence, which can reveal latent factors (e.g., pathways, cell types, experimental effects) affecting gene expression patterns.
t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a popular nonlinear dimensionality reduction technique particularly well-suited for visualizing high-dimensional data. Unlike PCA and ICA, t-SNE does not produce an explicit linear mapping or components; instead it focuses on preserving local neighborhoods. It converts pairwise distances between points in high-dimensional space into conditional probabilities, and then finds a low-dimensional arrangement of points that best matches those probability distributions. The result is a map (typically 2D) where points that were close in the original space tend to cluster together, and distant points tend to be apart. This often results to intuitive cluster visualizations of complex data, for example, t-SNE has revolutionized single-cell transcriptomics by enabling scientists to see clusters of cells corresponding to cell types or states. One caveat is that t-SNE primarily preserves local structure; distances between clusters in a t-SNE plot are not reliably meaningful globally. It can also sometimes break continuous trajectories into artificial clusters due to its emphasis on local density. Nonetheless, t-SNE’s ability to reveal natural groupings without any parametric assumptions has made it a go-to tool for exploring omics data.
Uniform Manifold Approximation and Projection (UMAP): UMAP is a more recent nonlinear technique that, like t-SNE, is widely used for visualizing complex biological data such as single-cell RNA sequencing atlases. UMAP can be seen as a manifold learning approach: it assumes the data lie on a manifold of lower dimension embedded in the high-dimensional space, and it tries to learn that manifold’s structure. In practical terms, UMAP constructs a graph of nearest neighbors in the high-dimensional space and then optimizes a low-dimensional layout that preserves as much of the manifold’s structure as possible. One of UMAP’s advantages is that it seeks to preserve more of the global structure compared to t-SNE, while still maintaining local neighbor relations. In other words, UMAP often keeps broader group relationships and continuous trajectories more intact (tending to reflect true distances to some extent), and it tends to produce more stable embeddings (less run-to-run variability). It’s also computationally efficient and scalable. Because of these features, UMAP has quickly become another standard for dimensionality reduction in life sciences.
In summary, PCA and ICA provide linear projections that can be easier to interpret in terms of original variables (each component is a combination of genes/metabolites), but they may fail to capture complex nonlinear relationships. t-SNE and UMAP provide nonlinear embeddings excellent for visualization of clusters and continuums in data, though they sacrifice explicit interpretability of axes and can distort global distances. All these methods reduce dimensionality but with different perspectives: PCA/ICA focus on capturing signals in a single mathematical projection, while t-SNE/UMAP focus on preserving neighborhood structure in a reconstructed map.
Self-Organizing Map (SOM) were introduced even earlier (1980s) than these modern nonlinear methods, yet they include principles of both clustering and projection. SOM is an unsupervised neural network algorithm that produces a lower-dimensional (typically 2D) representation of the data while preserving the original topological relationships as much as possible. In practice, SOM combines aspects of clustering with a geometric structure (mapping those prototypes onto a grid). This results to a “map” of the high-dimensional data that is easier to visualize and interpret, making SOM an attractive technique for high-dimensional omics analysis.
Self-Organizing Maps: Non-linear Dimentionality Reduction with Interpretability
A Self-Organizing Map (SOM), also known as a Kohonen map (after its inventor Teuvo Kohonen), is a type of artificial neural network trained using unsupervised learning to produce a low-dimensional (usually 2D) discretized representation of the input space. Unlike PCA or ICA, an SOM is not a formula-based projection but rather a grid of “neurons” or nodes that learns to represent the data through a competitive learning process. The key idea is that each node in the map is associated with a prototype vector (sometimes called a codebook vector) of the same dimensionality as the input data. During training, the SOM algorithm adjusts these prototype vectors to approximate the data distribution, while maintaining a predefined topology (neighbor relationships) among the nodes.
Topology preservation means that nodes which are near each other on the map grid will come to represent similar prototypes in the data space. In other words, the SOM tries to arrange its prototype vectors such that neighboring map units respond to (or “win” for) similar inputs. The result is that the two-dimensional grid of nodes becomes an organized map reflecting the structure of the data: clusters of similar data items end up mapped to nearby or adjacent nodes, whereas very different items end up far apart on the map. This is often described as projecting the data into a 2D space in a way that preserves the original structure as faithfully as possible. This gives SOM its power to visualize high-dimensional data in a manner that we can comprehend.
How SOMs Work
Training a self-organizing map involves an iterative procedure of competition and adaptation. At the start, the SOM consists of a lattice (grid) of nodes, for example, a 10×10 grid (the size can be chosen based on the problem). Each node has an associated weight vector initially set to small random values or perhaps sampled from the data distribution. These weight vectors live in the same feature space as the input data. The SOM learning algorithm then proceeds roughly as follows:
Competition, finding the Best Matching Unit (BMU): When a data sample (a high-dimensional vector) is presented, the algorithm computes its distance to all the node weight vectors. The node whose weight is closest (typically in Euclidean distance) to the input sample is identified as the “winner”, this is the best matching unit (BMU) for that input. The BMU is the map’s current best approximation for that particular data point.
Cooperation, neighborhood update: Once the BMU is found, the algorithm not only updates that winning node’s weight vector to better match the input, but also updates the weights of nodes in its neighborhood on the map. The idea is that nodes within a certain radius of the BMU on the grid should also move their weights slightly toward the input vector. This neighborhood radius is large at the beginning of training and gradually shrinks over time. The cooperation phase ensures that the map develops smoothly: instead of each node learning independently, neighboring nodes learn to represent similar inputs, creating a continuous ordering on the map surface.
Adaptation, weight update rule: The actual update for a node’s weight vector involves moving it a fraction of the way toward the input vector. For the BMU (winner), the move is largest, and for nodes in the neighborhood, smaller adjustments are made (often weighted by a Gaussian or other kernel that decreases with distance from the BMU). Over many iterations of presenting inputs (often cycling through the dataset multiple times), the map’s weight vectors gradually “self-organize” to approximate the distribution of the data. The learning rate (how big the weight adjustments are) is also gradually decreased to fine-tune convergence. By the end of training, each node’s weight vector can be seen as the prototype representing a cluster of similar data points.