Introduction to Self-Organizing Maps

Modern life science technologies can measure thousands to millions of variables in a single experiment. Genomics, proteomics, and metabolomics datasets often consist of extremely high-dimensional matrices, for example, gene expression profiles with tens of thousands of gene features measured across hundreds of samples. While these rich datasets hold the promise of new biological insights, they also present a curse of dimensionality: making sense of such massive data is challenging. Researchers are confronted with complex, multidimensional measurements where meaningful patterns are buried in noise and sheer volume. As Tamayo et al. noted at the dawn of functional genomics, “array technologies have made it straightforward to monitor thousands of genes…, [but] the challenge now is to interpret such massive data sets”, and the first step is often to extract fundamental patterns or lower-dimensional structure inherent in the data. In other words, we need ways to reduce dimensionality while preserving the most informative aspects of the data.

Dimensionality reduction serves multiple purposes in biological data analysis. First, it enables visualization of high-dimensional data in 2 or 3 dimensions, so that researchers can intuitively inspect sample groupings, outliers, or gradients (for instance, plotting cells or patients to see if they form clusters corresponding to biological conditions). Second, it can denoise and compress data by identifying underlying composite variables or “factors” that summarize the behavior of many original features. This helps tackle redundancy and correlation among features, a common situation where many genes or metabolites co-vary as part of the same pathway or process. Third, it can improve modeling and analysis downstream: by reducing thousands of features to a few informative dimensions, one can avoid overfitting in predictive models and highlight the major sources of variation for further biological interpretation.

In practice, a variety of dimensionality reduction techniques have become staples in life science research. Each technique has its own assumptions and strengths, and understanding their differences is important for choosing the right approach. Below we talk about Self-Organizing Maps (SOMs) in the context of several popular methods: Principal Component Analysis (PCA), Independent Component Analysis (ICA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP).

Overview of Dimensionality Reduction Techniques in Biology

Principal Component Analysis (PCA): PCA is a classical linear technique that finds a new set of orthogonal axes (principal components) capturing the greatest variance in the data. It projects high-dimensional data into a smaller number of dimensions such that each successive component explains the largest remaining variance, subject to being uncorrelated with previous components. PCA has long been used in genomics, for example, to summarize genome-wide expression or genotype data, because it provides a convenient way to identify major patterns like batch effects or population structure. However, PCA’s reliance on capturing variance with orthogonal components means it may miss subtler nonlinear relationships; it only leverages second-order statistics (covariances) and assumes components are uncorrelated Gaussian-like patterns. In fact, the strict orthogonality constraint “may not be appropriate for biomedical data” that can have complex correlated signals. Despite this, PCA’s simplicity, speed, and interpretability (each principal component is a linear combination of original features) make it a common first step in exploring high-throughput biological datasets.
Independent Component Analysis (ICA): ICA is another linear approach, but more flexible than PCA in that it does not require components to be orthogonal. Instead, ICA seeks a linear transformation of the data into statistically independent components, which often corresponds to finding underlying source signals in the data. In practice, ICA looks for components that are non-Gaussian and as independent as possible, capturing higher-order statistics. This has been useful in genomics and other fields for separating superimposed signals,for example, disentangling co-regulated gene modules or technical artifacts that vary independently. Studies have shown that ICA can sometimes outperform PCA in biomedical data mining by identifying biologically relevant features that PCA misses. However, ICA components can be more challenging to interpret directly, and the method requires careful tuning (e.g. estimating the number of components). In summary, ICA provides a complementary perspective to PCA meaning that instead of maximizing explained variance, it maximizes statistical independence, which can reveal latent factors (e.g., pathways, cell types, experimental effects) affecting gene expression patterns.
t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a popular nonlinear dimensionality reduction technique particularly well-suited for visualizing high-dimensional data. Unlike PCA and ICA, t-SNE does not produce an explicit linear mapping or components; instead it focuses on preserving local neighborhoods. It converts pairwise distances between points in high-dimensional space into conditional probabilities, and then finds a low-dimensional arrangement of points that best matches those probability distributions. The result is a map (typically 2D) where points that were close in the original space tend to cluster together, and distant points tend to be apart. This often results to intuitive cluster visualizations of complex data, for example, t-SNE has revolutionized single-cell transcriptomics by enabling scientists to see clusters of cells corresponding to cell types or states. One caveat is that t-SNE primarily preserves local structure; distances between clusters in a t-SNE plot are not reliably meaningful globally. It can also sometimes break continuous trajectories into artificial clusters due to its emphasis on local density. Nonetheless, t-SNE’s ability to reveal natural groupings without any parametric assumptions has made it a go-to tool for exploring omics data.
Uniform Manifold Approximation and Projection (UMAP): UMAP is a more recent nonlinear technique that, like t-SNE, is widely used for visualizing complex biological data such as single-cell RNA sequencing atlases. UMAP can be seen as a manifold learning approach: it assumes the data lie on a manifold of lower dimension embedded in the high-dimensional space, and it tries to learn that manifold’s structure. In practical terms, UMAP constructs a graph of nearest neighbors in the high-dimensional space and then optimizes a low-dimensional layout that preserves as much of the manifold’s structure as possible. One of UMAP’s advantages is that it seeks to preserve more of the global structure compared to t-SNE, while still maintaining local neighbor relations. In other words, UMAP often keeps broader group relationships and continuous trajectories more intact (tending to reflect true distances to some extent), and it tends to produce more stable embeddings (less run-to-run variability). It’s also computationally efficient and scalable. Because of these features, UMAP has quickly become another standard for dimensionality reduction in life sciences.

In summary, PCA and ICA provide linear projections that can be easier to interpret in terms of original variables (each component is a combination of genes/metabolites), but they may fail to capture complex nonlinear relationships. t-SNE and UMAP provide nonlinear embeddings excellent for visualization of clusters and continuums in data, though they sacrifice explicit interpretability of axes and can distort global distances. All these methods reduce dimensionality but with different perspectives: PCA/ICA focus on capturing signals in a single mathematical projection, while t-SNE/UMAP focus on preserving neighborhood structure in a reconstructed map.

Self-Organizing Map (SOM) were introduced even earlier (1980s) than these modern nonlinear methods, yet they include principles of both clustering and projection. SOM is an unsupervised neural network algorithm that produces a lower-dimensional (typically 2D) representation of the data while preserving the original topological relationships as much as possible. In practice, SOM combines aspects of clustering with a geometric structure (mapping those prototypes onto a grid). This results to a “map” of the high-dimensional data that is easier to visualize and interpret, making SOM an attractive technique for high-dimensional omics analysis.

Self-Organizing Maps: Non-linear Dimentionality Reduction with Interpretability

A Self-Organizing Map (SOM), also known as a Kohonen map (after its inventor Teuvo Kohonen), is a type of artificial neural network trained using unsupervised learning to produce a low-dimensional (usually 2D) discretized representation of the input space. Unlike PCA or ICA, an SOM is not a formula-based projection but rather a grid of “neurons” or nodes that learns to represent the data through a competitive learning process. The key idea is that each node in the map is associated with a prototype vector (sometimes called a codebook vector) of the same dimensionality as the input data. During training, the SOM algorithm adjusts these prototype vectors to approximate the data distribution, while maintaining a predefined topology (neighbor relationships) among the nodes.

Topology preservation means that nodes which are near each other on the map grid will come to represent similar prototypes in the data space. In other words, the SOM tries to arrange its prototype vectors such that neighboring map units respond to (or “win” for) similar inputs. The result is that the two-dimensional grid of nodes becomes an organized map reflecting the structure of the data: clusters of similar data items end up mapped to nearby or adjacent nodes, whereas very different items end up far apart on the map. This is often described as projecting the data into a 2D space in a way that preserves the original structure as faithfully as possible. This gives SOM its power to visualize high-dimensional data in a manner that we can comprehend.

How SOMs Work

Training a self-organizing map involves an iterative procedure of competition and adaptation. At the start, the SOM consists of a lattice (grid) of nodes, for example, a 10×10 grid (the size can be chosen based on the problem). Each node has an associated weight vector initially set to small random values or perhaps sampled from the data distribution. These weight vectors live in the same feature space as the input data. The SOM learning algorithm then proceeds roughly as follows:

Competition, finding the Best Matching Unit (BMU): When a data sample (a high-dimensional vector) is presented, the algorithm computes its distance to all the node weight vectors. The node whose weight is closest (typically in Euclidean distance) to the input sample is identified as the “winner”, this is the best matching unit (BMU) for that input. The BMU is the map’s current best approximation for that particular data point.
Cooperation, neighborhood update: Once the BMU is found, the algorithm not only updates that winning node’s weight vector to better match the input, but also updates the weights of nodes in its neighborhood on the map. The idea is that nodes within a certain radius of the BMU on the grid should also move their weights slightly toward the input vector. This neighborhood radius is large at the beginning of training and gradually shrinks over time. The cooperation phase ensures that the map develops smoothly: instead of each node learning independently, neighboring nodes learn to represent similar inputs, creating a continuous ordering on the map surface.
Adaptation, weight update rule: The actual update for a node’s weight vector involves moving it a fraction of the way toward the input vector. For the BMU (winner), the move is largest, and for nodes in the neighborhood, smaller adjustments are made (often weighted by a Gaussian or other kernel that decreases with distance from the BMU). Over many iterations of presenting inputs (often cycling through the dataset multiple times), the map’s weight vectors gradually “self-organize” to approximate the distribution of the data. The learning rate (how big the weight adjustments are) is also gradually decreased to fine-tune convergence. By the end of training, each node’s weight vector can be seen as the prototype representing a cluster of similar data points.

Illustration of the self-organizing map training process. Each orange circle represents a data point in a high-dimensional space. For a given data sample (purple cross \(x\)), the SOM finds the closest map unit (best matching unit, shown in red on the grid). The BMU’s weight vector and its neighbors within a certain radius are then adjusted slightly toward the input point. Over many iterations, this process leads to a topologically ordered map where neighboring nodes on the grid respond to similar patterns in the data.

Through this process, the SOM learns a set of prototype patterns (the node weight vectors) that collectively approximate the overall dataset.These prototypes are organized on the map in a meaningful way. The map typically develops clusters and gradients meaning that regions of the map specialize to certain types of inputs, and similar regions lie adjacent to each other. We end up with what Kohonen described as a “feature map” which is a two-dimensional mosaic that reflects high-dimensional relationships.

One can think of a trained SOM as a kind of nonlinear projection of the data into a 2D space, but one that comes with an inherent clustering and neighborhood structure. Each node in the map can be viewed as a cluster centroid (prototype), and all data samples can be mapped to the node that best represents them (their BMU). This yields a partition of the dataset: each sample “belongs” to a particular node (or to that node’s local region). At the same time, because neighboring nodes have been pulling on similar samples, the map preserves a continuity: adjacent clusters on the map are more similar than distant clusters. This is SOM’s distinctive feature. it is simultaneously performing clustering and dimension reduction with spatial organization. The low-dimensional representation (positions on the grid) can be plotted and visualized, and the spatial proximity on this map is meaningful with respect to original data similarity.

Topological Structure and Interpretability

The structured nature of a self-organizing map gives it a unique kind of interpretability. Unlike methods such as t-SNE or UMAP that yield a scatterplot of points, an SOM produces an organized lattice of nodes which can be visualized in various informative ways. One common visualization is the so-called U-Matrix (unified distance matrix), which shows the distances between neighboring node weight vectors on the map. This often appears as a relief map or a contour map where low-distance regions (clusters of similar data) form valleys or flat areas, and high-distance boundaries between clusters form ridges or mountains. It effectively shows the cluster landscape of the map.

*Example of a Self-Organizing Map visualized as a topographic map (U-Matrix). This map was trained on a simulated dataset of 15 clusters. The peaks (“mountains”) represent large distances (dissimilarities) between neighboring nodes, i.e. cluster boundaries. Flat regions or basins correspond to clusters of similar items.

These visualizations provide a global view of the data topology. You can literally see clusters and how they relate to each other on the map. For example, a trained SOM can show that certain samples form one cluster but lie right next to another cluster, suggesting a gradient or continuum rather than a hard separation. This information might be lost in hierarchical clustering or in a t-SNE plot where distances between clusters are not reliable. We will see an example of these visualizations later.

Another aspect of interpretability is that each SOM node’s prototype vector can itself be interpreted in terms of the original features. In a gene expression dataset, for instance, each node corresponds to a metagene profile, the weights are essentially an average expression profile of a group of genes or samples. Researchers often visualize heatmaps of the prototype vector values (sometimes called component planes for each feature across the map) to see how each gene or metabolite contributes to different regions of the map. Such analysis can pinpoint which features drive the separation between clusters. The localized nature of SOM clusters means one can identify modules of co-expressed genes or co-varying metabolites represented by single map units, and then examine their functional enrichment.

Furthermore, interpretability extends to the sample level: each sample can be represented by the pattern of activation on the map (often called a “fingerprint”). When a new sample is mapped to a trained SOM, it will excite certain nodes strongly (those most similar to it). One can visualize a sample’s data by coloring the SOM nodes according to that sample’s values, producing a fingerprint image. Samples with similar overall patterns will have similar fingerprint images. This approach has been used to give each sample an “individual visual identity”.

How to perform SOMs in R

Before we cover the technical details of SOMs we want to learn how to effectively apply, visualize and interpret SOMs in R.

Let’s first create our database and then we go through how to apply SOMS:

library(mixOmics)
data(breast.TCGA)
data_input <- rbind(breast.TCGA$data.train$mirna,breast.TCGA$data.test$mirna)
group_labels <-c(breast.TCGA$data.train$subtype,breast.TCGA$data.test$subtype)

This data set is a small subset of the full data set from The Cancer Genome Atlas that is used together with mixOmics package. It contains the expression or abundance of three matching omics data sets: mRNA, miRNA and proteomics for 150 breast cancer samples (Basal, Her2, Luminal A) in the training set, and 70 samples in the test set. The test set is missing the proteomics data set. In order to increase the number of samples, we concatenate the training and testing data and focus on miRNA dataset.

Defining the grid

As said before, Self-Organizing Maps (SOMs) arrange their neurons (units) on a low-dimensional grid, most often 2D. Each unit is associated with a weight vector of the same dimension as the input data. The SOM algorithm organizes these units such that similar data points are mapped to nearby neurons, preserving topological properties. We define a 2D grid layout using the kohonen package:

# load library
library(kohonen)
# Define a 5x5 hexagonal grid (you can change size depending on your data)
som_grid <- somgrid(xdim = 5, ydim = 5, topo = "hexagonal")

# create a fake grid
fakesom <- list(grid = som_grid)
class(fakesom) <- "kohonen"
dists <- unit.distances(som_grid)
plot(fakesom, type="property", property = dists[1,],
     main="Distances to unit 1", zlim=c(0,6),
     palette = rainbow, ncolors = 7)

Let’s take a look at how this works:

som_grid <- somgrid(xdim = 5, ydim = 5, topo = "hexagonal")

This creates a grid with 25 units arranged in a hexagonal pattern. The arguments control several aspects:

xdim and ydim: determine the number of units along the horizontal and vertical axes.
topo: defines the topology. “hexagonal” results in a more compact, natural layout often used in SOMs. The alternative is “rectangular”, where neurons are aligned in a strict square grid.
Other optional arguments include neighbourhood.fct, which affects how updates are spread during training, and toroidal, which makes the grid wrap around (like a video game map with no edges). These will become relevant later during training.

When defining a Self-Organizing Map (SOM) grid using the somgrid() function in R, one of the important decisions you need to make is whether to use a hexagonal or rectangular topology. This choice affects how neurons are connected on the grid and how neighborhood relationships are defined during training.

A rectangular topology places the neurons in a classic grid-like structure, similar to a chessboard. Each unit typically has four immediate neighbors: one above, one below, one to the left, and one to the right. This kind of layout is simple and computationally efficient. The way the neighborhood expands from a winning neuron during training follows a square or boxy pattern. While this approach is straightforward and easy to interpret, it can lead to slightly abrupt transitions between regions of the map, especially in cases where the input data varies smoothly.

In contrast, a hexagonal topology arranges the neurons in a staggered pattern resembling a honeycomb. Each unit in this setup has six neighboring units. This hexagonal layout creates a more circular neighborhood shape around each neuron, which results in smoother transitions across the map during training. The hexagonal pattern better captures gradual variations and tends to provide more natural-looking cluster boundaries.

The difference between these two topologies also affects how the SOM learns. Because the hexagonal layout allows more neighbors to be influenced by each update, it can lead to more robust and gradual learning behavior. It is often favored in situations where you care about smooth transitions, continuous data, or visualizing subtle differences in the input space. On the other hand, rectangular grids may be sufficient when the data structure is more discrete or when simplicity and speed are priorities.

Another important setting in the SOM grid is whether the map should be toroidal or not. A toroidal grid means that the edges of the grid are connected, effectively wrapping around like the surface of a donut. In a toroidal SOM, the left and right edges are considered neighbors, and so are the top and bottom edges. This creates a continuous, borderless topology where there are no true edges on the map.

In contrast, a non-toroidal grid has boundaries. Units at the edges or corners of the map have fewer neighbors than those in the center. This can lead to edge effects during training, where neurons at the borders behave differently simply because they are less connected.

Making the SOM toroidal can be helpful in some contexts, especially when the data has a cyclical or repeating structure, or when you want to avoid giving any neuron a special position due to being at the edge. It helps ensure that all neurons are treated equally in terms of neighborhood structure. However, in most biological applications, such as clustering gene expression or clinical profiles, toroidal maps are rarely necessary and can make interpretation harder, especially when visualizing the map.

Train SOMs

Now that we have our grid ready and have explored its structure and geometry, we are ready to train the Self-Organizing Map using our data. Training a SOM involves presenting the data to the map, finding the best matching unit (BMU) for each data point, and then updating not only the BMU but also its neighbors according to their distance on the grid. This process repeats for a number of iterations, gradually organizing the neurons in a way that reflects the underlying structure of the data.

In R, we use the som() function from the kohonen package to perform this training. The function takes the scaled data, the grid object we created earlier, and a few additional parameters that control how the learning is performed. One of the key parameters is rlen, which defines how many times the algorithm should cycle through the entire dataset. Another important parameter is alpha, which sets the learning rate at the beginning and end of training. Typically, the learning rate decreases over time to allow the map to stabilize and fine-tune its representation.

Here is an example of how to train the SOM using our miRNA data:

set.seed(101)
# scaled_data<-scale(data_input) # don't forget to scale your data. we leave for demonstratiin purpose
som_model <- som(X = data_input, 
                 grid = som_grid, 
                 rlen = 1000,
                 #alpha = c(0.05, 0.01), 
                 keep.data = TRUE)

In this code, we skip the scaling but remember that scaling is essential because SOMs are sensitive to differences in variable ranges; without scaling, variables with large values can dominate the learning process. The grid argument specifies the structure we defined earlier. The rlen argument sets the number of training iterations, and alpha controls how aggressively the neuron weights are updated, starting from 0.05 and gradually reducing to 0.01. The keep.data argument allows us to retain the original data within the trained SOM object, which is useful later for visualization and interpretation.

Once training is complete, the SOM will have adjusted the positions of the neurons (or rather, their associated weight vectors) to reflect the structure of the data. Similar samples will be mapped to nearby neurons on the grid, while dissimilar samples will be placed further apart. After training the som() function returns an object of class kohonen, which contains several important components.

The output includes a vector called unit.classif, which contains the index of the winning neuron for each observation. Another important field is distances. This gives the distance between each data point and its winning neuron. These distances are often used to calculate the quantization error or identify poorly mapped samples.

The grid component simply contains the grid structure we defined using somgrid(), including its size, topology, and neighborhood function. The core of the SOM is stored in codes, which is a list of matrices containing the codebook vectors. These vectors define the positions of the neurons in the data space and are what the algorithm adjusts during training to fit the data.

The changes matrix shows how much the codebook vectors changed during each training iteration. Each column corresponds to one map (if multiple layers were used), and each row to an iteration. By inspecting this matrix, you can see whether the map has stabilized, smaller changes toward the end indicate convergence.

If some rows of your input data had too many missing values (based on the maxNA.fraction threshold), they are excluded from training, and their indices are stored in na.rows. This helps you keep track of how much data was used and whether any important samples were dropped.

Finally, the output also includes the arguments used during training: alpha, radius, user.weights, whatmap, maxNA.fraction, and dist.fcts. These are returned for transparency and reproducibility, allowing you to recall exactly how the model was configured. If you enabled normalizeDataLayers, the resulting layer weights will be available in distance.weights, showing how the influence of each data layer was balanced during training.

Evaluating the Quality of a Trained SOM

After training a SOM, it’s important to assess whether the map has effectively organized the data. Although SOMs are unsupervised and do not have an explicit “loss function” like in supervised learning, there are several ways to evaluate their performance and decide whether the training was sufficient.

One of the most basic indicators is the quantization error, which measures how far each input data point is from its best matching unit (BMU) on the map. This gives an idea of how well the codebook vectors represent the input space. You can calculate the average quantization error using:

mean(som_model$distances)

[1] 124.9259

This value should be as low as possible, though what counts as “low” depends on the scale of the data. You can also inspect the distribution of distances to get a sense of whether some data points are poorly represented:

hist(som_model$distances, main = "Quantization Error Distribution", xlab = "Distance to BMU")

If the distribution is wide or has many high values, it might indicate that the grid is too small or that the number of training iterations was insufficient.

Another important diagnostic tool is the mapping plot, which shows how the input samples are distributed across the grid. If many samples are crowded into the same neuron while others remain empty, it could mean that the SOM hasn’t had enough time to spread out or that the learning rate was too low. It can also be that the grid is too large or too small. This can be visualized using:

plot(som_model, type = "mapping")

You can also use the node count plot, which shows how many data points were assigned to each neuron. A well-trained SOM usually has most nodes occupied with some variation in density. Too many empty neurons may suggest underfitting (too short training or too large grid), while very few highly occupied nodes may indicate overfitting or too small a grid:

plot(som_model, type = "count")

We can also plot the changes value to check whether the distance from each node’s weights to the samples represented by that node is reduced. This figure shows the progress over the iterations If the curve is decreasing without an obvious plateau, more iterations are required.

plot(som_model, type = "changes", main = "Distances")

Additionally, the U-matrix or unified distance matrix is a powerful tool for evaluating the topological quality of the map. It shows the distances between neighboring neurons on the map. Clear valleys and ridges in the U-matrix usually reflect meaningful cluster boundaries. A very flat U-matrix with little contrast might suggest insufficient learning or an overly uniform dataset:

plot(som_model, type = "dis", main = "U-Matrix")

If the U-matrix reveals too little structure or the quantization error remains high, you may consider increasing the number of training iterations (rlen), adjusting the learning rate schedule (alpha), or modifying the radius and grid size. It’s often useful to run the SOM multiple times with different parameter settings and compare these diagnostics to find the best configuration for your data.

Interpreting the SOM

Once the SOM has been trained, it becomes a powerful tool for visualizing and interpreting high-dimensional data in a structured and intuitive way. Unlike many other unsupervised learning methods, SOMs preserve the topological relationships between data points, meaning that samples that are similar in the input space will tend to be mapped to nearby neurons on the grid.

One of the most common and informative plots is the mapping plot. This shows where each input sample has landed on the grid. In other words, which neuron acts as its best matching unit. If you provide labels or group information (such as known classes or subtypes), these can be used to color the samples and identify whether the SOM has successfully separated them. Clear separation of different groups across the map indicates that the SOM has captured meaningful structure. Overlap or mixing between groups might suggest that the data is not easily separable, or that further tuning of the SOM is needed.

group_levels <- levels(group_labels)
colors <- setNames(c("tomato", "steelblue","black"), group_levels)  
sample_colors <- colors[group_labels]


set.seed(101)
# Plot with colored samples
plot(som_model, type="map",
     pchs = 21,
     main = "SOM Mapping by Group (Basal, Her2, Luminal A)",
     shape = "straight",
     col = sample_colors)
legend(x = "top",legend = group_levels,col = c("tomato", "steelblue","black"),pch = 21)

What we see in this plot is that different subtypes, Luminal A, Her2, and Basal, are roughly distributed from left to right across the SOM grid. On the left side of the map, most of the samples are labeled as Luminal A. As we move toward the center, we observe more Her2 samples appearing, and on the right side, the map becomes increasingly dominated by Basal samples. This gradient-like separation suggests that the SOM has successfully captured the underlying structure of the data and organized it in a meaningful way.

The fact that the subtypes are not randomly mixed throughout the map but instead appear in distinct regions implies that the SOM has learned to associate specific patterns in the miRNA expression data with these known biological categories. Even though SOM is an unsupervised algorithm and was not told anything about the subtype labels during training, the organization of the map reflects real biological distinctions between the groups.

What the SOM has done, at its core, is to take complex, high-dimensional data and project it onto a two-dimensional grid in such a way that similar samples are placed close to each other, and dissimilar ones are pushed farther apart. Each hexagonal unit on the grid represents a prototype (a kind of average of the data points mapped to it) and through iterative learning, the SOM adjusts these prototypes so that they best represent the structure of the data. The grid layout, combined with the neighborhood updates during training, ensures that the learned representation not only captures individual data patterns but also preserves the topological relationships between them. This is why related subtypes appear near one another, and transitions between clusters occur gradually across the map.

Cluster Boundaries

A crucial step in interpreting a Self-Organizing Map is identifying where meaningful clusters are located on the grid. While SOM itself does not explicitly define clusters, it provides rich information that can be used to infer them through the U-matrix, or unified distance matrix.

The U-matrix visualizes the average distance between each neuron and its immediate neighbors. These distances reflect how different the neurons are from one another in terms of their codebook vectors. Regions of the map where neighboring neurons are similar will have small distances and appear in darker shades (red or orange), while areas where neurons differ significantly will have larger distances, typically colored in lighter shades (yellow or white).

In the plot shown above, we use:

plot(som_model, type="dist.neighbours",
     pchs = 21,
     main = "SOM distances to all immediate neighbours",
     shape = "straight")

This produces a color-coded map of pairwise distances between neighboring units. Each hexagon represents a neuron, and its color corresponds to how different it is from the neurons around it. What you can observe here is that most areas are relatively uniform and dark, meaning the neighboring units are similar. However, there are distinct bright regions, especially in the upper-left part of the grid, where neurons differ more sharply from their neighbors. These light-colored boundaries act as natural dividers between clusters, even though no explicit clustering has been applied yet.

Such high-distance boundaries often indicate a transition between different groups of samples. For instance, a patch of neurons with low distances may represent samples from one biological subtype, while a sharp shift to higher distances could mark the start of a new group. This makes the U-matrix an essential tool for exploring cluster structure visually and determining where to draw meaningful divisions on the map.

Later, when we apply clustering algorithms like k-means to the codebook vectors, these distance-based boundaries often align well with the resulting cluster assignments. In that sense, the U-matrix provides a data-driven, visually intuitive way to guide or validate clustering decisions. It is especially helpful in exploratory analysis, where there is no ground truth and the goal is to discover natural groupings in the data.

Property or Heatmap Plot

To understand the property plot, also known as the component plane or heatmap, it’s helpful to briefly recall what the SOM learns during training. Each neuron on the grid is associated with a codebook vector, which is essentially a synthetic prototype representing the average profile of all data points mapped to that neuron. These codebook vectors have the same number of dimensions (variables) as the input data. In the case of our miRNA data, each neuron’s codebook vector contains a value for every miRNA feature.

The property plot displays the values of one specific variable across all the neurons on the grid. In other words, it shows how a single dimension of the codebook vectors is distributed spatially across the map. This plot transforms a column of the high-dimensional codebook matrix into a two-dimensional color-coded surface allowing us to see, where that variable is high or low across the SOM.

This type of visualization is particularly important in SOMs because it reveals how individual features contribute to the organization of the map. It can highlight whether a particular gene or miRNA is associated with specific clusters, subtypes, or regions on the grid, and whether it helps to distinguish between different biological patterns.

For example, to visualize the expression of a specific miRNA such as hsa-mir-130b, we can use the following code:

coolBlueHotRed <- function(n, alpha = 1) {
  colors <- colorRampPalette(c("blue", "white", "red"))(n)
  adjustcolor(colors, alpha.f = alpha)
}

plot(som_model, type = "property", 
     property = som_model$codes[[1]][, "hsa-mir-130b"], 
     main = "Expression of hsa-mir-130b",palette.name = coolBlueHotRed)

This generated a heatmap over the SOM grid where each hexagon’s color corresponds to the expression level of hsa-mir-130b in that neuron’s codebook vector. Warm colors (e.g., red or yellow) indicate high expression, while cool colors (e.g., blue) indicate low expression.

If the variable is biologically important or strongly associated with a sample group, it will often show a distinct spatial pattern or perhaps with high values clustered in one part of the grid and low values in another. This can help in identifying which features drive separation between subtypes or clusters. Conversely, if the feature is weakly informative or noisy, it will appear uniformly distributed across the grid.

Component planes can also be compared side by side for several variables, making it possible to identify correlated features, co-expressed genes, or group-specific biomarkers. When combined with cluster or U-matrix plots, property plots offer deep insight into both the structure of the data and the role of individual features.

Clustering the SOM Using k-Means and Visualizing Boundaries

While a Self-Organizing Map arranges the samples into a low-dimensional grid that preserves topological relationships, it does not explicitly assign cluster labels. To convert this organization into distinct groups, we can apply k-means clustering (or any other clustering) to the codebook vectors.

To begin, we extract the codebook vectors from the trained SOM model. These vectors capture the patterns that the SOM has learned during training and are ideal for clustering. We then apply k-means clustering. We use the NBClust package to guide us selecting the number of clusters. NBClust systematically compares many clustering validity indices to recommend the optimal number of clusters.

library(NbClust)

# Extract codebook vectors
codes <- som_model$codes[[1]]

# Use NBClust to determine the optimal number of clusters
set.seed(123)
nb <- NbClust(data = codes, distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans",index = "gap")
optimal_k <- nb$Best.nc[1]

The NbClust function tests different values of k (from 2 to 10 in this example) using a range of statistical criteria and returns a consensus for the best number of clusters. The result, stored in optimal_k, can then be used to run k-means:

# Perform k-means with the optimal number of clusters
set.seed(123)
km <- kmeans(codes, centers = optimal_k, nstart = 25)

At this point, each neuron has been assigned to a cluster. We can visualize the clustering results directly on the SOM grid by coloring neurons according to their cluster and drawing black boundary lines between them:

# Assign a color to each cluster
cluster_colors <- rainbow(optimal_k)[km$cluster]

# Plot SOM with background colored by cluster
plot(som_model, type = "mapping", 
     bgcol = cluster_colors,pch=sample_colors,shape = "straight",
     main = paste("SOM Clustering with", optimal_k, "Clusters"))

# Add boundaries around clusters
add.cluster.boundaries(som_model, km$cluster)

This visualization brings together the topological layout learned by the SOM and the discrete structure imposed by clustering. Each colored region on the map now represents a distinct cluster, defined by similarities in the codebook vectors. The black lines mark boundaries where transitions occur between clusters. These often correspond to natural divisions in the data, as suggested by the SOM’s organization.

A few more things

Before moving forward to math section. I want to mention a few more things you can do with SOMs that have not been covered here but you can try it yourself easiy by reading the help page of the SOM function.

SOMs provide methods for projection and prediction of new datasets. Once a SOM is trained, new samples (such as test data or external datasets) can be projected onto the same grid without retraining. This allows you to classify new observations based on where they fall on the trained map. This can be done using the predcit function. The algorithm works by comparing each new sample to the codebook vectors in the trained SOM and assigning it to the best matching unit (BMU), the neuron whose weights are closest to the new input. This assignment is returned as unit.classif, indicating the location of each sample on the grid.

You can train the SOMs using supervised approach. If the SOM was trained with associated labels or response variables (for example, a target outcome, clinical class, or molecular measurement), then predictions can also be generated using the positions of these new samples. This is done by computing the average response (or any other summary) of the training samples that were previously mapped to each neuron. These neuron-level averages are stored in unit.predictions.

Finally, Self-Organizing Maps can also be extended to perform data integration. The idea behind data integration in SOMs is to treat each data type as a separate layer, and to jointly train the SOM on all layers simultaneously. In the kohonen package, this is accomplished by passing a list of matrices (one per data type) to the som() or xyf() function and optionally assigning weights to each layer using the user.weights argument. Each neuron then has a multi-layer codebook vector, with separate sub-vectors corresponding to each data type.

During training, distances between input samples and neuron codebook vectors are computed layer-wise, and then combined either equally or using customized weights. This allows the SOM to find neurons that reflect shared patterns across all datasets, while also allowing emphasis on one layer over another when necessary.

In summary, SOMs are not only powerful for organizing and visualizing high-dimensional data but also serve as a flexible framework for integrating multiple sources of information, allowing complex biological signals to be explored in a unified and interpretable manner.

Math behind SOMs

In this chapter, we will build a SOM step-by-step, explaining the mathematics in plain language and implementing each step in R. By the end, we will have a working SOM code that can be used to reduce dimension of the data.

Data as a Matrix and Distance Computation

To start from the very basics, let’s consider our dataset. We assume we have n observations (also called samples or data points), each described by p features (also called variables). We can organize this data into an \(n \times p\) matrix \(X\). Each row of \(X\) is a point in a p-dimensional space, and each column is one feature across all observations. For example, in a gene expression dataset, each row might be a patient and each column a gene, with the matrix entries being expression levels.

In a SOM, we need to measure similarity between points. We typically use the Euclidean distance the straight-line distance between two points. If we have a data point \(x = (x_1, x_2, \dots, x_p)\) and another point \(y = (y_1, y_2, \dots, y_p)\), the Euclidean distance between them is defined as:

\[\text{dist}(x, y) = \sqrt{\sum_{i=1}^{p} (x_i - y_i)^2 }.\]

This formula subtracts each feature of \(y\) from the corresponding feature of \(x\), squares those differences, sums them up, and then takes the square root. The larger this distance, the more different the two points are in terms of their features. (Sometimes we omit the final square root and use the squared distance for efficiency, since the smallest distance will also have the smallest squared distance avoiding the square root does not change which point is closest.)

The SOM Grid of Neurons

In a self-organizing map, we have a network of neurons arranged in a grid (usually 2-dimensional). Each neuron is also sometimes called a unit or a node. For simplicity, imagine a grid of size 5 x 5 (which gives us 25 neurons). We can picture this as a 5-by-5 chessboard where each square is a neuron.

Each neuron has a weight vector (also called a codebook vector) of the same dimension as our data points (length p). You can think of this weight vector as the neuron’s current position or prototype in the data’s feature space. Initially, these weights can be set randomly, since before training the map doesn’t know anything about the data. As training progresses, the weights will be adjusted to better represent the data.

Let’s set up a 5x5 grid of neurons in R and initialize each neuron’s weight vector randomly. We will use our data’s range to initialize weights so that they start within a reasonable scale of the data:

X<-data_input
# Define grid dimensions
grid_x <- 5
grid_y <- 5
num_neurons <- grid_x * grid_y  # total number of neurons

# Create a grid of neuron coordinates (row, col positions for each neuron)
neuron_coords <- expand.grid(row = 1:grid_x, col = 1:grid_y)

head(neuron_coords, 5)  # show first 5 neuron coordinates

# Initialize weight vectors for each neuron
set.seed(42)  # for reproducibility
p <- ncol(X)  # dimension of data (number of features)
# Randomly initialize weights in the range [min, max] of each feature
weight_matrix <- matrix(NA, nrow = num_neurons, ncol = p)
for (j in 1:p) {
  min_j <- min(X[, j])
  max_j <- max(X[, j])
  weight_matrix[, j] <- runif(num_neurons, min_j, max_j)
}

Here:

neuron_coords is a 25x2 data frame giving each neuron’s grid position. For example, neuron 1 is be at row=1, col=1; neuron 2 at row=1, col=2; … neuron 5 at row=5, col=1; neuron 6 at row=1, col=2, etc.
weight_matrix is a 25x184 matrix (since p=184 features) holding the weight vectors for the 25 neurons. We filled each column with random values between the min and max of that feature in the dataset. This means initially each neuron’s weight is a random point somewhere within the range of the data. (Other initialization methods exist, such as sampling from a normal distribution or even using a PCA-based initialization, but random uniform initialization is straightforward and commonly used.)

If we look at the output of neuron_coords and the first few weight_matrix rows, it look like this:

head(neuron_coords)

(weight_matrix[1:3,1:4])

         [,1]     [,2]     [,3]     [,4]
[1,] 15.08110 14.31066 12.52099 16.33687
[2,] 15.18037 13.76164 12.57973 12.26886
[3,] 12.27843 16.04406 12.80791 14.37073

The first output block lists neuron 1 at grid position (1,1), neuron 2 at (2,1), etc. The second block shows example initial weights for neurons 1, 2, 3 (each with four numbers corresponding to the first four features). These are random starting points.

Best Matching Unit (BMU): Finding the Neuron Closest to an Input

With the map initialized, we start the training. The training process will present each data point to the map and adjust the neurons. The first step for any given input data point \(x\) is to find which neuron is the best match for it, i.e., whose weight vector is closest to \(x\) in Euclidean distance. This neuron is called the Best Matching Unit (BMU) for that data point.

Mathematically, if our neurons are indexed by \(j = 1,2,\dots,\text{num_neurons}\) and \(w_j\) is the weight vector of neuron \(j\), then the BMU for input \(x\) is the neuron \(b\) such that:

\[b = \arg\min_{j} \; \|\,x - w_j\,\|,\]

which means the index \(b\) gives the minimum distance between \(x\) and \(w_j\). We compute the distance from \(x\) to every neuron’s weight vector and pick the neuron with the smallest distance. The BMU’s weight vector is the closest in value to the input vector \(x\).

Let’s do this in R for a single example. We’ll take, say, the first data point and find which neuron (out of our 25) is its BMU given the current weights:

# Pick an input data point (e.g., first observation)
x <- X[1, ]  # a vector of length 4
# Compute distance from x to each neuron's weight vector
dists <- apply(weight_matrix, 1, function(w) sqrt(sum((x - w)^2)))
# Find the index of the minimum distance
bmu_index <- which.min(dists)
bmu_distance <- min(dists)
bmu_index; bmu_distance

[1] 2

[1] 30.17694

# Also find the BMU's grid coordinate for clarity
bmu_coord <- neuron_coords[bmu_index, ]
bmu_coord

  row col
2   2   1

In this code:

We used apply to compute the distance between x and each row of weight_matrix (each neuron’s weight). Alternatively, we could use vectorized operations or the dist() function, but this approach makes it clear: we calculate all 25 distances.
bmu_index is the index of the neuron with smallest distance, and bmu_distance is the distance value itself.
bmu_coord translates that index back into a grid position (row, col).

This tells us that for the first data point, the BMU is neuron number 2 (in our numbering scheme) with a distance of ~30.17694 Neuron 2 corresponds to grid position (row=2, col=1) on our 5x5 map. So, at this stage of training, the neuron at (2,1) has the weight vector closest to this particular measurements.

In a sense, the BMU is the neuron that currently “best represents” the input data point. We will adjust this neuron (and its neighbors) to make an even better representation of that input. Repeating this for many data points will distribute the neurons across the data space.

Neighborhood Function: Updating Neighbors of the BMU

Unlike k-means clustering (where only the closest centroid to a point is updated), SOMs also update neurons that are neighbors of the BMU on the grid. The reason is that neurons that are near each other in the 2D grid should also end up near each other in the data space, to preserve the topology (the neighborhood relationships). By updating neighbors, we ensure the map doesn’t become a set of independent points, but rather a smooth mesh that can approximate the data distribution.

Defining neighbors: We need a notion of distance between neurons on the grid. Since our grid is 2D, we can use the grid coordinates to compute distances between any two neurons. Common choices include:

Rectangular (grid) distance: If using a square grid topology, one might use Manhattan distance or Chebyshev distance on the grid coordinates.
Hexagonal grid distance: If using a hexagonal layout, distance is computed slightly differently (often effectively Euclidean distance on hex coordinates).

For simplicity, we’ll treat our grid as a flat 2D plane and use Euclidean distance between neuron coordinates. This treats neighbors in a circular sense (which is suitable for a hexagonal or “Gaussian” neighborhood). The kohonen package, by default, sets the initial neighborhood radius to cover about 2/3 of the map, meaning initially a BMU’s neighborhood is large, including much of the map, and it shrinks over time.

Neighborhood radius (\(\sigma\)): We introduce a radius \(\sigma\) which defines how far the neighborhood extends at the current time. Initially, \(\sigma\) is large (perhaps covering most of the map), and it will decrease as training progresses. At any iteration, all neurons within distance \(\sigma\) of the BMU (on the grid) are considered “neighbors” to be updated.

Neighborhood function: We also define a function \(h(u,v)\) that gives the influence of the BMU at grid position \(u\) on another neuron at position \(v\). One simple choice is a bubble: every neuron within radius \(\sigma\) gets full influence (h=1) and others get 0. A smoother choice is a Gaussian kernel, where the influence decreases with distance. With a Gaussian neighborhood, we set:

\[h(u,v) = \exp\!\Big(-\frac{d^2(u,v)}{2\,\sigma^2}\Big),\]

where \(d(u,v)\) is the distance between the BMU (at position \(u\)) and neuron \(v\) on the grid, and \(\sigma\) is the current radius (acting like the standard deviation of the Gaussian). This yields \(h(u,v)=1\) for \(d=0\) (the BMU itself) and smaller values for farther neurons, approaching 0 for neurons far away. A Gaussian neighborhood is smooth and tends to produce more natural gradual changes in the map than a sharp bubble function, this can help the map converge more gently and preserve topology better (the updates change gradually over space). The downside is a bit more computation, but for our purposes this is fine.

Let’s calculate the neighborhood influence for the BMU we found (neuron 2 at grid (2,1)) for a given radius. Initially, we might use a fairly large radius. Suppose at the very start \(\sigma \approx 2\) (this is just an example; we’ll formalize the schedule later). We will compute the distance from the BMU to every neuron on the grid and then compute the Gaussian \(h\) value for each:

# Choose a current neighborhood radius (for demonstration, say 2.0)
sigma <- 2.0

# Compute grid distances from the BMU to all neurons
bmu_coord <- as.numeric(bmu_coord)  # (row, col) of BMU as numeric vector
grid_dists <- apply(neuron_coords, 1, function(coord) {
  sqrt(sum((coord - bmu_coord)^2))
})

# Compute Gaussian neighborhood influence for each neuron
neigh_influence <- exp(- (grid_dists^2) / (2 * sigma^2) )

# Look at the influence values for a few neurons:
head(round(neigh_influence, 3), 10)  # first 10 neurons, rounded

 [1] 0.882 1.000 0.882 0.607 0.325 0.779 0.882 0.779 0.535 0.287

# Optionally, reshape into matrix form to see neighborhood pattern:
inf_matrix <- matrix(round(neigh_influence, 2), nrow = grid_x, ncol = grid_y)
inf_matrix

     [,1] [,2] [,3] [,4] [,5]
[1,] 0.88 0.78 0.54 0.29 0.12
[2,] 1.00 0.88 0.61 0.32 0.14
[3,] 0.88 0.78 0.54 0.29 0.12
[4,] 0.61 0.54 0.37 0.20 0.08
[5,] 0.32 0.29 0.20 0.11 0.04

Here

grid_dists is a vector of length 25 giving the distance from the BMU (at (2,2)) to each neuron on the grid.
neigh_influence applies the Gaussian formula to those distances.
We then inspect some values. The inf_matrix reshapes the influence values back into a 5x5 grid layout, so we can see the pattern around the BMU.

Here, the BMU itself (which is at row 2, col 1 in our grid labeling, corresponding to matrix position [2,1] after how we expanded the grid,the highest value 1.00 is the BMU) has influence 1. Its immediate neighbors have high influence, and neurons further away (bottom-right of this matrix) have lower influence (down to ~0.04 for the farthest corner). This matrix shows a smooth Gaussian hill centered at the BMU, with a radius of 2 producing a fairly broad hill.

library(plotly)

plot_ly(z = ~inf_matrix) %>%
  add_surface(colorscale = "Reds") %>%
  layout(title = "Neighborhood Influence (Gaussian Hill)",
         scene = list(zaxis = list(title = "Influence"),
                      xaxis = list(title = "X"),
                      yaxis = list(title = "Y")))

Please note that we want the map to preserve neighborhood relations, pulling neighbors along with the BMU prevents neurons from chasing individual data points independently. If only the BMU moved, the map would fragment (and effectively it would become similar to k-means clustering). Updating a neighborhood ensures the map’s surface smooths out and adjacent neurons end up representing similar inputs. This helps the SOM form an organized representation: points that are close in the data will end up activating nearby neurons on the map.

Using a Gaussian neighborhood means closer neurons move almost as much as the BMU, while slightly farther ones move a bit less, etc., creating a gentle gradient. This often leads to better convergence and topology preservation than a binary cutoff. A hard cutoff (bubble) can work too, but it may cause more abrupt changes; Gaussian is one common choice for more gradual learning. (Another sometimes-used function is the Mexican hat or difference-of-Gaussians, which can slightly push far neighbors in the opposite direction, but that’s beyond our scope.)

Weight Update Rule

We have identified the BMU for an input and determined how strongly each neuron should be influenced (via the neighborhood function). Now we perform the weight update, which is the learning step. The rule for updating weights in a SOM is:

\[w_j^{\text{new}} = w_j^{\text{old}} + \alpha \cdot h(bmu,\; j) \cdot (x - w_j^{\text{old}}),\]

where:

\(w\_j\) is the weight vector of neuron \(j\),
\(x\) is the input data vector,
\(\alpha\) is the learning rate (a factor between 0 and 1 that controls how big the change is),
\(h(bmu, j)\) is the neighborhood influence of the BMU on neuron \(j\) that we computed (for neurons far from the BMU, this will be near 0; for the BMU itself, this is 1; for other neighbors, somewhere in between).

This formula might look complicated, but it’s basically doing: new weight = old weight + (some fraction) * (difference between input and old weight).

If \(j\) is the BMU, \(h(bmu,j) = 1\), so the update is \(w_{bmu}^{new} = w_{bmu}^{old} + \alpha (x - w_{bmu}^{old})\). This is a move of the BMU weight toward the input \(x\). If \(\alpha = 0.1\), for example, the BMU moves 10% of the way from its old position toward the input.
If \(j\) is a neighbor, \(0 < h(bmu,j) < 1\), so it also moves toward \(x\), but by a smaller factor (like maybe 5% of the way, etc., depending on distance and \(\alpha\)).
If \(j\) is far away (outside the neighborhood), \(h(bmu,j) \approx 0\), so that neuron’s weight stays almost the same for this input.

By applying this rule, the BMU and nearby neurons all shift a bit closer to the input vector \(x\). Intuitively, the map is trying to “pull” its local region toward the data point. Repeated many times for many data points, the neurons will spread out to cover the distribution of the data.

Let’s perform one weight update step in R for our example input (the first data point). We will use a small learning rate (e.g. \(\alpha=0.5\) just for illustration) and the neighborhood influences we computed (with \(\sigma=2\)). We will update the weight matrix accordingly:

# Set a learning rate for this iteration (e.g., 0.5 for demonstration)
alpha <- 0.5

# Current BMU index we found earlier for x
bmu_index <- bmu_index  # (we already have this from before)
# Use neigh_influence we computed for radius 2

# Store old weights of BMU and a neighbor for comparison
old_bmu_weight <- weight_matrix[bmu_index, ]
old_neighbor_weight <- weight_matrix[1, ]  # weight of neuron 1 for example (which might be far)

# Update each neuron's weight
for (j in 1:num_neurons) {
  weight_matrix[j, ] <- weight_matrix[j, ] + alpha * neigh_influence[j] * (x - weight_matrix[j, ])
}

# Check how the BMU's weight changed
new_bmu_weight <- weight_matrix[bmu_index, ]
new_neighbor_weight <- weight_matrix[1, ]

head(round(old_bmu_weight, 3))

[1] 15.180 13.762 12.580 12.269  9.124 11.280

head(round(new_bmu_weight, 3))

[1] 13.508 13.306 12.249 13.535 10.030 10.575

head(round(old_neighbor_weight, 3))

[1] 15.081 14.311 12.521 16.337 11.348  8.891

head(round(new_neighbor_weight, 3))

[1] 13.649 13.667 12.255 15.659 11.166  9.324

We use the neigh_influence vector (calculated for BMU neuron 2 and \(\sigma=2\)) to scale the update for each neuron.
We stored old_bmu_weight and old_neighbor_weight (for neuron 1, just as an example of another neuron) to see the change.

This single update nudged the BMU (and to a lesser extent its neighbors) closer to our data point \(x\).

The learning rate (which will also decrease over time) controls how aggressive each update is. A high \(\alpha\) in the beginning allows the map to quickly adjust to the data. Over time, \(\alpha\) is lowered to fine-tune adjustments. If \(\alpha\) stayed high, the weights might oscillate or never settle (overshooting back and forth); if it’s gradually reduced, the map’s changes become more fine-grained as it converges.

Iterating the Training Loop

So far, we walked through one input and one update. A SOM training involves repeating this many times with all our data points (usually for multiple epochs or iterations through the dataset). The full training loop typically looks like:

Initialization: Set initial weights (done) and choose initial learning rate \(\alpha\_0\) and initial neighborhood radius \(\sigma\_0\).
For each training iteration:
1. For each data point \(x\) (one iteration can mean one pass through the whole dataset):
2. Find the BMU for \(x\) (closest neuron).
1. Compute the neighborhood influence \(h(bmu, j)\) for every neuron \(j\) (based on current \(\sigma\)).
2. Update each neuron’s weight using the formula above.
1. Update (decrease) the learning rate \(\alpha\) and neighborhood radius \(\sigma\) over time.
Stopping: Repeat for a fixed number of iterations or until changes become negligible. In practice, one might use a fixed number of epochs (passes through data). The R kohonen package by default uses 100 iterations (epochs) if not specified.

We need to decide how \(\alpha\) and \(\sigma\) decrease (their schedule). In our implementation, we’ll use a simple linear decay for both, which is actually what the kohonen package defaults to: the learning rate declines linearly from 0.05 to 0.01 over the course of training, and the neighborhood radius declines from its initial value to 0 (meaning no neighborhood) by the end. Linear decay is easy to implement and understand:

\(\alpha\) at iteration \(t\): \(\alpha(t) = \alpha_0 + \frac{t}{T}(\alpha_T - \alpha_0)\), where \(T\) is the total number of iterations (so it interpolates from \(\alpha_0\) to \(\alpha_T\)).
\(\sigma\) at iteration \(t\): similarly, \(\sigma(t) = \sigma_0 + \frac{t}{T}(\sigma_T - \sigma_0)\). If we set \(\sigma_T = 0\), this just linearly shrinks to 0.

Alternatively, one can use exponential decay (commonly \(\alpha(t) = \alpha_0 \exp(-t/\tau)\) for some time constant \(\tau\), and similarly for \(\sigma\)), but we’ll stick to linear as in Kohonen’s R implementation.

As mentioned, a good choice is one that covers a large portion of the map. A practical way to choose \(\sigma_0\) is to set it to the distance covering about 2/3 of the map’s diameter. We can compute all distances between neurons in our grid and take, say, the 67th percentile. For our 5x5 grid, let’s do that:

# Compute all pairwise distances between neurons on the grid
dist_matrix <- as.matrix(dist(neuron_coords))  # Euclidean distances on grid
# Find the 67th percentile distance
sigma0 <- as.numeric(quantile(dist_matrix, 0.67))
sigma0

[1] 3.162278

This will give an initial radius \(\sigma_0\).

This value (around 3.16) corresponds to about 2/3 of the maximum distance on the grid. We can use this as our starting radius. By the end of training, we’ll reduce the radius to ~0, meaning that in the final iterations only the BMU itself is updated (no neighbor influence).

Now we have all the pieces to train the SOM: initial weights, a loop over epochs and data points, and formulas for BMU, neighborhood, and weight updates, along with schedules for \(\alpha\) and \(\sigma\). Let’s put it all together in R. We’ll use:

\(\alpha_0 = 0.05\) and \(\alpha_{\text{final}} = 0.01\) (as defaults),
\(\sigma_0\) as calculated above and \(\sigma_{\text{final}} \approx 0\),
number of epochs (rlen) = 200 (as a typical default).

Full training loop implementation:

# Set training parameters
rlen <- 200               # number of epochs (full passes over the data)
alpha_start <- 0.05        # initial learning rate
alpha_end   <- 0.01        # final learning rate
sigma_start <- sigma0      # initial neighborhood radius (from calculation above)
sigma_end   <- 0.0         # final neighborhood radius

# Re-initialize weights (to start training from scratch)
set.seed(42)
for (j in 1:p) {
  min_j <- min(X[, j]); max_j <- max(X[, j])
  weight_matrix[, j] <- runif(num_neurons, min_j, max_j)
}

# Training loop
n <- nrow(X)  # number of data points
for(epoch in 1:rlen) {
  # Optionally, shuffle the data order each epoch for fairness
  data_index_order <- sample(1:n, n, replace = T)
  
 
  
  for(i in data_index_order) {
    x <- X[i, ]
    ## 1. Find BMU for x
    dists <- apply(weight_matrix, 1, function(w) sum((x - w)^2))  # using squared distance for speed
    bmu_index <- which.min(dists)
    ## 2. Compute neighbor influences (Gaussian)
    bmu_coord <- as.numeric(neuron_coords[bmu_index, ])
    grid_dists <- apply(neuron_coords, 1, function(coord) {
      sqrt(sum((coord - bmu_coord)^2))
    })
          # Linearly interpolate alpha and sigma for this epoch

 alpha <- alpha_start + (epoch - 1) * (alpha_end - alpha_start) / (rlen - 1)
  sigma <- sigma_start + (epoch - 1) * (sigma_end - sigma_start) / (rlen - 1)
  sigma<-max(0.5,sigma)
  
 h <- exp(- (grid_dists) / (2 * sigma^2))
  
    ## 3. Update weights of all neurons
    for(j in 1:num_neurons) {
      weight_matrix[j, ] <- weight_matrix[j, ] + alpha * h[j] * (x - weight_matrix[j, ])
    }
  }
}

So here what we have done:

We set rlen = 200 (meaning 200 epochs through the dataset).
We re-initialized the weight_matrix to random values so we start fresh (using the same seed for reproducibility).
In each epoch, we shuffle the order of data (data_index_order). Shuffling is generally a good practice to avoid any bias from data ordering.
We compute the current alpha and sigma by linear interpolation between start and end values, based on the epoch number.
Then for each data point in this epoch, we find the BMU (using squared distances for efficiency – no need to sqrt every time, the smallest squared distance gives the BMU just the same), compute the neighbor influence vector h for the current sigma, and update each neuron’s weights accordingly.
After all epochs, weight_matrix contains the trained codebook vectors.

# Training completed. Let's examine the resulting weight vectors (codebook vectors) for a few neurons:
round(weight_matrix[1:5,1:5 ], 3)

       [,1]   [,2]   [,3]   [,4]   [,5]
[1,] 12.622 13.610 12.644 14.744 10.728
[2,] 12.293 13.281 12.321 14.363 11.027
[3,] 12.341 13.332 12.371 14.288 11.485
[4,] 12.438 13.426 12.462 14.306 10.997
[5,] 12.619 13.613 12.635 14.206 10.820

Each row is a neuron’s weight vector in the data space. It’s not obvious by just looking at the numbers, but these should be representative of different clusters of the measurements.

We can have a look at the mapping plot to see if our implementation is ok.

bmu <- apply(
  X, 1,
  function(xi) {
    # squared Euclidean distance to every neuron
    d2 <- colSums((t(weight_matrix) - xi)^2)
    which.min(d2)
  }
)




## neuron_coords is an N × 2 matrix / data-frame with columns (col, row)
rect_df <- transform(
  as.data.frame(neuron_coords),
  xmin = row - 0.5,
  xmax = row + 0.5,
  ymin = col - 0.5,
  ymax = col + 0.5
)

hits          <- tabulate(bmu, nbins = nrow(neuron_coords))
rect_df$hits  <- hits                    # integer

set.seed(123)
jitter_amount <- 0.45

map_df <- data.frame(
  x     = neuron_coords[bmu, 1] + runif(length(bmu), -jitter_amount,  jitter_amount),
  y     = neuron_coords[bmu, 2] + runif(length(bmu), -jitter_amount,  jitter_amount)
)
plot( NA, xlim = c(0.5, max(rect_df$xmax)),
           ylim = c(max(rect_df$ymax), 0.5),
           asp  = 1, xlab = "Grid column", ylab = "Grid row",
           main = "SOM mapping" )

## rectangles
with(rect_df, rect(xmin, ymax, xmax, ymin,
                   col = gray(1 - hits / max(hits)), border = "grey60"))
## jittered points
points(map_df$x, map_df$y, pch = 19, col = group_labels)

That gives our relatively ok clustering of the samples. Please note that there are normally so many different implementation tricks and randomization etc that make the algorithm diverge from official packages. So we might get different results depending on for example weight initization or random sampling.

This brings us to the end of this chapter. We built a Self-Organizing Map from scratch in R, step by step. Starting from the concept of a data matrix and Euclidean distance, we constructed a grid of neurons with weight vectors, found best matching units, updated neighborhoods of neurons, and iteratively trained the map with decaying learning rate and neighborhood size.

SOMs are powerful for visualizing and clustering high-dimensional, non-linear data while preserving topological relationships. Their structured grid layout provides intuitive interpretation compared to scatter-based methods like t-SNE or UMAP. However, they require careful tuning of parameters such as learning rate and neighborhood radius, and they may struggle with very large datasets or overlapping clusters without proper initialization and normalization.