1  t-distributed stochastic neighbor embedding (t-SNE)

t-SNE is often used when the goal is to uncover patterns or clusters within complex datasets that aren’t easily seen in higher dimensions. Unlike some methods like PCA that focus on capturing global patterns across all data points, t-SNE emphasizes local relationships. By local relationships, we mean that t-SNE is primarily working towards preserving the relative distances between points that are already close to each other in the original dataset.

It is important to already mention that t-SNE does not focus as much on global structure, which means how distant points across the entire dataset are preserved. So basically we want give priority to accurately reflecting the relationships of nearby points and that comes most of the time at the expense of the larger global arrangement. We will later say in detail why such a thing happens but for now let’s accept that we have a method called t-sne that is capable of preserving local structures.

Basically, we want to use t-SNE because we believe there are tightly clustered groups in the data. When we suspect that the key information or insights lie in these tightly clustered regions (such as distinct subpopulations or small patterns), t-SNE is great for revealing these local structures. For example, in genomics, t-SNE is often used to identify subtypes of cells based on gene expression data. This is obviously based on the assumption that certain groups of cells will have similar expression profiles, forming tightly clustered groups. There are many other fields such as image analysis that can benefit revealing these local structures.

2 t-sne in R

Without going further into the details, we are going to start using t-SNE in R. The main function used for this is Rtsne() from the Rtsne package. Before that i want to simulate some data so we can check later how t-SNE is doing.

Code
# Set seed for reproducibility
set.seed(123)

# Number of points per subcluster
n <- 25

# Manually define main cluster centers (global structure)
main_cluster_centers <- data.frame(
  x = c(0, 5, 10, 15),
  y = c(0, 5, 10, 15),
  z = c(0, 5, 10, 15),
  cluster = factor(1:4)
)

# Manually define subcluster offsets relative to each main cluster
# These small offsets will determine subcluster locations within each main cluster
subcluster_offsets <- data.frame(
  x_offset = c(-0.25, 0.25, -0.25, 0.25),
  y_offset = c(-0.25, -0.25, 0.25, 0.25),
  z_offset = c(0.25, -0.25, -0.25, 0.25),
  subcluster = factor(1:4)
)

# Initialize an empty data frame to hold all data
data <- data.frame()

# Generate data for each main cluster with subclusters
for (i in 1:nrow(main_cluster_centers)) {
  for (j in 1:nrow(subcluster_offsets)) {
    # Calculate subcluster center by adding the offset to the main cluster center
    subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]
    
    # Generate points for each subcluster with a small spread (to form local clusters)
    subcluster_data <- data.frame(
      gene1 = rnorm(n, mean = subcluster_center$x, sd = 0.25),  # Small spread within subclusters
      gene2 = rnorm(n, mean = subcluster_center$y, sd = 0.25),
      gene3 = rnorm(n, mean = subcluster_center$z, sd = 0.25),
      cluster = main_cluster_centers$cluster[i],
      subcluster = subcluster_offsets$subcluster[j]
    )
    
    # Add generated subcluster data to the main data frame
    data <- rbind(data, subcluster_data)
  }
}

This data has just three dimentions for 400 samples (4 major clusters). We can plot the data here:

Code
# Visualize the original data in 3D, distinguishing clusters and subclusters
library(dplyr)
library(plotly)  # For interactive 3D plots
plot_ly(
  data, x = ~gene1, y = ~gene2, z = ~gene3, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter3d", mode = "markers",size = 5
)  %>%
  layout(
    title = "Original 3D Data with Clusters and Subclusters",
    scene = list(
      camera = list(
        eye = list(x = 0.3, y =2.5, z = 1.2)  # Change x, y, z to adjust the starting angle
      )
    )
  )

For this data, the features are in the column and samples in the row.

We are now going to do a t-SNE on this data and see the results. There are some parameters to set but we just go for the default ones. We also do PCA and show the results side by side

Code
library(Rtsne)
library(ggplot2)
library(cowplot)

set.seed(123)
tsne_results_30 <- Rtsne(
  as.matrix(data[, c("gene1", "gene2", "gene3")])
)


data$tsne_x_30 <- tsne_results_30$Y[, 1]
data$tsne_y_30 <- tsne_results_30$Y[, 2]


tsne_plot <- plot_ly(
  data, x = ~tsne_x_30, y = ~tsne_y_30, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )



pca_results <- prcomp(data[, c("gene1", "gene2", "gene3")], scale. = FALSE)
data$pca_x <- pca_results$x[, 1]
data$pca_y <- pca_results$x[, 2]


pca_plot <- plot_ly(
  data, x = ~pca_x, y = ~pca_y, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )

  
subplot(pca_plot%>%layout(showlegend = FALSE), tsne_plot%>%layout(showlegend = FALSE),
        titleX = T,titleY = T,margin = 0.2)%>% layout(annotations = list(
 list(x = 0.13 , y = 1.035, text = "PCA", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.85 , y = 1.035, text = "t-SNE", showarrow = F, xref='paper', yref='paper'))
)

The plot shows a side-by-side comparison of results obtained using PCA (left) and t-SNE (right) for dimensionality reduction. In the PCA plot, the data points are clearly separated by their clusters along the horizontal axis, but the clusters themselves are quite stretched and not very compact, indicating that PCA is apturing global variance sources but doesn’t reveal tight local groupings.

On the other hand, the t-SNE plot reveals well-separated, compact clusters, showing how good t-SNE is at preserving local structures within the data. The clusters are clearly distinct from one another, and their circular, tightly packed shapes indicate that t-SNE effectively keeps points that are similar (close in high-dimensional space) together in the low-dimensional projection. However, the distances between clusters may not reflect global relationships as well as PCA does (we will get back to this later).

2.1 Input parameters

Like any other method, t-SNE also requires some input parameters. Let’s start with the most essential ones and later go through the rest.

The data matrix (X) is the core input for t-SNE, where each row represents an observation, and each column is a variable or feature. This matrix is typically high-dimensional, and t-SNE’s job is to map it into a lower-dimensional space (usually 2D or 3D) to make patterns more interpretable.

Next, dims defines the dimensionality of the output. Typically, we choose 2D for visualization purposes, but 3D is also possible if more complexity is required.

The perplexity parameter is probably the most important one. It controls how t-SNE balances local and global structures of the data. You can think of it as determining how many “neighbors” each data point should consider when projecting into the low-dimensional space. Choosing the right perplexity is very important because it affects how t-SNE interprets the relationships between data points.

Another key parameter in this specific implementation is theta, which adjusts the balance between accuracy and computational speed. For large datasets, using a larger theta can make t-SNE run faster but at the cost of some accuracy. If you prioritize precision, especially for smaller datasets, you can set this parameter to 0 for exact t-SNE.

We also have the max_iter parameter, which controls the number of iterations the algorithm goes through during optimization. More iterations give t-SNE time to better fine-tune the output, though in practice, the default is often sufficient unless you notice the algorithm hasn’t converged.

After these essential parameters, additional settings like PCA preprocessing, momentum terms, and learning rate can further fine-tune the performance of t-SNE for different types of data. We will talk about these later.

2.2 Perplexity

I had to devote a small section to this parameter because it is probably the most important one (if you already have your data 😉). Perplexity essentially defines how t-SNE balances attention between local and global structures in your data. You can think of it as controlling the size of the neighborhood each data point considers when positioning itself in the lower-dimensional space.

A small perplexity value (e.g., 5–30) means that t-SNE will focus more on preserving the local structure, i.e., ensuring that close neighbors in the high-dimensional space stay close in the low-dimensional projection. This is great for datasets with small clusters or when you’re particularly interested in capturing details.

A larger perplexity (e.g., 30–50 or even higher) encourages t-SNE to preserve global structure, considering more far away points as part of the neighborhood. This can be useful for larger datasets or when you want to capture broader relationships between clusters.

Finding the right perplexity often involves some experimentation. If it’s too small, t-SNE might overfit to the local structure and fail to reveal larger patterns in the data. If it’s too large, you might lose relationships between nearby data points. t-SNE is retively robust to different perplexity values, so changing this parameter slightly usually won’t result in big changes, but it can make the difference between a good visualization and a great one.

A rule of thumb is to set the perplexity such that \(3 \times \text{perplexity} < n-1\), where \(n\) is the number of data points. Testing several values across this range will help you find the best fit for your data.

2.2.1 See the impact of perplexity

I did not tell you before (you might have realized it from the code though) but there are substructures in the original data. We are now going to plot them again.

Code
library(dplyr)
library(plotly)  # For interactive 3D plots
plot_ly(
  data, x = ~gene1, y = ~gene2, z = ~gene3, color = ~subcluster,#,symbol=~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter3d", mode = "markers",size = 5
)  %>%
  layout(
    title = "Original 3D Data with Clusters and Subclusters",
    scene = list(
      camera = list(
        eye = list(x = 0.3, y =2.5, z = 1.2)  # Change x, y, z to adjust the starting angle
      )
    )
  )

You should now see that within each cluster we have subclusters. Let’s see if our original t-SNE was successful in separating them.

Code
tsne_plot2<-plot_ly(
  data, x = ~tsne_x_30, y = ~tsne_y_30, color = ~subcluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )



pca_plot2 <- plot_ly(
  data, x = ~pca_x, y = ~pca_y, color = ~subcluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )

  
subplot(pca_plot2%>%layout(showlegend = FALSE), tsne_plot2%>%layout(showlegend = FALSE),
        titleX = T,titleY = T,margin = 0.2)%>% layout(annotations = list(
 list(x = 0.13 , y = 1.035, text = "PCA", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.85 , y = 1.035, text = "t-SNE", showarrow = F, xref='paper', yref='paper')))

Compared to PCA we have actually done a good job. Most clusters seems to be well separated. But what we want to do is to change perplexity to see if we can make this better.

Code
set.seed(123)
tsne_results_20 <- Rtsne(
  as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity = 20
)
data$tsne_x_20 <- tsne_results_20$Y[, 1]
data$tsne_y_20 <- tsne_results_20$Y[, 2]

tsne_plot2<-plot_ly(
  data, x = ~tsne_x_30, y = ~tsne_y_30, color = ~subcluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )



tsne_plot20<-plot_ly(
  data, x = ~tsne_x_20, y = ~tsne_y_20, color = ~subcluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )
  
subplot(tsne_plot2%>%layout(showlegend = FALSE), tsne_plot20%>%layout(showlegend = FALSE),
        titleX = T,titleY = T,margin = 0.2)%>% layout(annotations = list(
 list(x = 0.13 , y = 1.035, text = "t-SNE (30)", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.85 , y = 1.035, text = "t-SNE (20)", showarrow = F, xref='paper', yref='paper')))

I have now decreased perplexity to 20. What we can see is that at 30 perplexity, t-SNE is accounting for a larger neighborhood of points when embedding the data. This results in clearer separation between clusters, with well-defined compact clusters. At 20 however, each cluster also appears distinctly separated in space, maintaining a reasonable balance between local and global structure. The substructures within the clusters are more prominent, with some separation and internal grouping within each main cluster, suggesting that t-SNE is better at capturing smaller-scale local variations with a lower perplexity.

2.3 Output of Rtsne

This is relatively streightforward. The most important output of this function is \(Y\). You can extract using tsne_results_20$Y. This is a matrix that has the exact same number of rows as your orignal data and the number of columns is the same as dims parameter. So basically it is your data transformed into the lower dimention of size dims. In our case, it was 2.

The rest of the output is basically either information about the optimization process or the summary of the input paramters. We are going to ignore them for now and will get back to it later if needed.

Do it yourself

Play around with these parameters (perplexity, theta, and max_iter) to see if you can get better results

3 What does this t-SNE plot mean!?

At this stage, it’s time to explore a bit more about what the t-SNE results can mean and whether we should trust them.

The first thing you should have noted from the last exercise is that t-SNE results are sensitive to the choice of parameters. As we saw, adjusting the perplexity changes how the data is visualized, with higher perplexities focusing more on global structure and lower perplexities emphasizing local relationships. This flexibility is both a strength and a limitation: it can show different aspects of the data, but it also means that different parameter settings can lead to different interpretations of the same dataset.

Code
tnse_data_frames<-c()
for(it in c(1:10,100,200,300,500,1000,3000,5000))
{
  
  set.seed(123)
tsne_results_it <- Rtsne(
  as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity = 30,max_iter = it
)
  tnse_data_frames<-rbind(tnse_data_frames,data.frame(tsne_results_it$Y,data[,c(4,5)],max_iter=it))
}


tsne_plot2<-plot_ly(
  tnse_data_frames, x = ~X1, y = ~X2, color = ~subcluster,frame=~max_iter,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )

tsne_plot2

Click on the “Play” button to see the effect of max_iter parameter. In this specific case, it we could already see the clusters in a very early stage so very little unexpected happens. But in reality for a nosier data, the number of steps matter and can actually change the interpretation of the data. Please note that in this specific case, i have also set the seed. t-SNE is a stochastic method, meaning that each run can give slightly different results unless a random seed is fixed. This randomness can also impact the arrangement of points in the low-dimensional space, especially with small datasets. So if you want to get the same results set the seed!

We can see how changing perplexity affect our interpretation

Code
tnse_data_frames<-c()
for(it in c(10,20,30,40,50,60,70,100))
{
  
  set.seed(123)
tsne_results_it <- Rtsne(
  as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity = it,
)
  tnse_data_frames<-rbind(tnse_data_frames,data.frame(tsne_results_it$Y,data[,c(4,5)],perplexity=it))
}


tsne_plot2<-plot_ly(
  tnse_data_frames, x = ~X1, y = ~X2, color = ~subcluster,frame=~perplexity,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )

tsne_plot2

This should not come as a surprise that perplexity has the biggest impact on the interpretation. One can almost play around with the different values of parameters to get a few different views of the data in order to have a more robust interpretation.

3.1 between cluster distances and densities might not be accurate!

Code
# Set seed for reproducibility
set.seed(123)

# Number of points per subcluster
n <- 25

# Manually define main cluster centers (global structure)
main_cluster_centers <- data.frame(
  x = c(0, 10, 15, 35),
  y = c(0, 10, 15, 35),
  z = c(0, 10, 15, 35),
  cluster = factor(1:4)
)

# Manually define subcluster offsets relative to each main cluster
# These small offsets will determine subcluster locations within each main cluster
subcluster_offsets <- data.frame(
  x_offset = c(-0.25, 0.25, -0.25, 0.25),
  y_offset = c(-0.25, -0.25, 0.25, 0.25),
  z_offset = c(0.25, -0.25, -0.25, 0.25),
  subcluster = factor(1:4)
)

# Initialize an empty data frame to hold all data
data <- data.frame()

# Generate data for each main cluster with subclusters

for (i in 1:nrow(main_cluster_centers)) {
  for (j in 1:nrow(subcluster_offsets)) {
    # Calculate subcluster center by adding the offset to the main cluster center
    subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]
    
    # Generate points for each subcluster with a small spread (to form local clusters)
    subcluster_data <- data.frame(
      gene1 = rnorm(n, mean = subcluster_center$x, sd = 0.25*i*i),  # Small spread within subclusters
      gene2 = rnorm(n, mean = subcluster_center$y, sd = 0.25*i*i),
      gene3 = rnorm(n, mean = subcluster_center$z, sd = 0.25*i*i),
      cluster = main_cluster_centers$cluster[i],
      subcluster = subcluster_offsets$subcluster[j]
    )
    
    # Add generated subcluster data to the main data frame
    data <- rbind(data, subcluster_data)
  }
}


plot_ly(
  data, x = ~gene1, y = ~gene2, z = ~gene3, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter3d", mode = "markers",size = 5
)  %>%
  layout(
    title = "Original 3D Data with Clusters and Subclusters",
    scene = list(
      camera = list(
        eye = list(x = 0.3, y =2.5, z = 1.2)  # Change x, y, z to adjust the starting angle
      )
    )
  )

What i have done now is to give different density to each larger cluster and also different distances between the clusters. We can do t-SNE and compare the results to PCA.

Code
set.seed(123)
tsne_results_30 <- Rtsne(
  as.matrix(data[, c("gene1", "gene2", "gene3")])
)


data$tsne_x_30 <- tsne_results_30$Y[, 1]
data$tsne_y_30 <- tsne_results_30$Y[, 2]


tsne_plot <- plot_ly(
  data, x = ~tsne_x_30, y = ~tsne_y_30, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )



pca_results <- prcomp(data[, c("gene1", "gene2", "gene3")], scale. = FALSE)
data$pca_x <- pca_results$x[, 1]
data$pca_y <- pca_results$x[, 2]


pca_plot <- plot_ly(
  data, x = ~pca_x, y = ~pca_y, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )

  
subplot(pca_plot%>%layout(showlegend = FALSE), tsne_plot%>%layout(showlegend = FALSE),
        titleX = T,titleY = T,margin = 0.2)%>% layout(annotations = list(
 list(x = 0.13 , y = 1.035, text = "PCA", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.85 , y = 1.035, text = "t-SNE", showarrow = F, xref='paper', yref='paper'))
)

PCA did a great job in preserving the relative distances between clusters, reflecting the original distribution of the data. The density of the clusters is also proportional to the original data, with tighter clusters remaining dense and more spread-out clusters maintaining their looser arrangement. In contrast, t-SNE naturally equalizes the densities of the clusters, making them appear more uniform. This is not an artifact, but a deliberate feature of t-SNE, where dense clusters are expanded and sparse ones contracted. As a result, the distances between clusters in the t-SNE plot may appear more similar, and the clusters themselves more evenly distributed, which can distort the true global relationships.

Clustering on t-SNE results

Avoid doing density and distance based clustering on t-SNE space. In vast majority of the cases the distances and density don’t have much meaning!

The last thing i want to mention here is that having loo little perplexity might cause misinterpretation of noise as clusters. In our previous example we know that we have four major clusters in our data but look what happens if we decrease the perplexity too much.

Code
tnse_data_frames<-c()
for(it in c(2:10))
{
  
  set.seed(123)
tsne_results_it <- Rtsne(
  as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity = it,
)
  tnse_data_frames<-rbind(tnse_data_frames,data.frame(tsne_results_it$Y,data[,c(4,5)],perplexity=it))
}


tsne_plot2<-plot_ly(
  tnse_data_frames, x = ~X1, y = ~X2, color = ~subcluster,frame=~perplexity,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
)  %>%
  layout(
    title = ""

  )

tsne_plot2

Small clusters start to appear, which don’t exist in the original data. Unfortunately, this is largely dependent on the perplexity parameter. Lower perplexities overemphasize local structures, making t-SNE susceptible to identifying random noise as distinct clusters. This can lead to false interpretations, especially when exploring new datasets where the true structure is not well known. Therefore, it’s important to experiment with different perplexity values and validate findings with complementary methods.

4 Other parameters

As said before, t-SNE has a few parameters that we can play with. I try to explain them here. But don’t be scared! You can usually set most of these to their default values.

  1. initial_dims (integer): This sets the number of dimensions used in the initial PCA step, which simplifies the data before t-SNE is applied. PCA helps reduce noise and makes the t-SNE computation faster by keeping only the most important information. This is especially useful for large datasets. Keep this default unless you’re sure you need more/fewer features.

  2. theta (numeric): This controls the speed/accuracy trade-off. A higher value (closer to 1) makes the algorithm faster but less accurate, while a lower value (closer to 0) makes it slower but more precise. For very large datasets, it might take too long to run t-SNE precisely. Increasing theta speeds things up at the cost of some accuracy. Use theta = 0 for small datasets if you want full accuracy. For large datasets, increase theta for faster results.

  3. check_duplicates (logical): If TRUE, t-SNE will check if you have duplicate rows in your data. If duplicates exist, it could lead to inaccurate results. Duplicates can mess up the analysis because t-SNE assumes every data point is unique. For large datasets, this check can slow down the computation, so disable it if you know there are no duplicates.

  4. pca (logical): Controls whether an initial PCA step is applied. PCA reduces the dimensionality of the data before t-SNE processes it. PCA helps with large datasets by reducing noise and computation time. It’s usually a good idea to leave this as TRUE, especially for high-dimensional data. Only set it to FALSE if you don’t want PCA pre-processing (e.g., if your data is already low-dimensional).

  5. partial_pca (logical): If TRUE, a faster version of PCA called truncated PCA is used. It’s useful for very large datasets. This speeds up the computation when the dataset is huge and PCA would take too long.

  6. is_distance (logical): If TRUE, X is treated as a distance matrix rather than raw data. Sometimes, instead of feeding the raw data, you may already have pairwise distances between points (e.g., from a precomputed distance metric). This allows t-SNE to work directly on that distance matrix. Use TRUE if you’ve precomputed distances, like Euclidean or cosine distances.

  7. Y_init (matrix): Allows you to specify initial locations for the data points in the low-dimensional space. If NULL, random initialization is used. Custom initialization can help if you want more control over how the algorithm starts, but random initialization usually works fine. Leave as NULL unless you have a specific reason to initialize positions manually (e.g., restarting an interrupted run).

  8. pca_center (logical): Determines whether the data should be centered (mean subtracted) before PCA is applied. Centering is standard practice in PCA. It ensures that the PCA finds meaningful directions in the data. Default is TRUE, and you should leave it as is.

  9. pca_scale (logical): Determines whether the data should be scaled (divided by standard deviation) before PCA is applied. Scaling ensures that variables with different units (e.g., age vs. income) don’t dominate the PCA. Use scaling if your variables have vastly different ranges. Set to TRUE if your data has features on different scales.

  10. normalize (logical): Controls whether the data is normalized internally before distance calculations. Normalizing ensures that all features are treated equally in terms of distance calculation. Leave this as TRUE unless your data is already pre-normalized.

  11. stop_lying_iter (integer): The iteration after which t-SNE stops using exaggerated similarities (helps separate clusters early on). Default is 250. The “lying” phase exaggerates the distances between clusters to spread the data points more effectively at the beginning. After 250 iterations, t-SNE refines this. Use the default unless you know you need something different.

  12. momentum (numeric): Controls how much momentum is used during the first part of the optimization. Momentum helps the algorithm avoid getting stuck in bad configurations. The default is 0.5, which is a good balance for exploration. Only change this if you’re very familiar with gradient-based optimization.

  13. mom_switch_iter (integer): The iteration where t-SNE switches to final momentum (controls how fast the algorithm converges). Early in optimization, t-SNE uses lower momentum to explore more configurations. After this switch, it refines the layout. Use the default value unless you have a specific reason to change it.

  14. final_momentum (numeric): The momentum used during the final phase of optimization. Default is 0.8. In the final phase, higher momentum helps the algorithm settle into the final configuration.

  15. eta (numeric): The learning rate for optimization. Default is 200. Learning rate controls how fast the algorithm updates the data point positions. Too high can cause instability; too low can make it slow. Stick with the default unless you notice unstable convergence.

  16. exaggeration_factor (numeric): Controls how much the similarities are exaggerated in the early iterations. Default is 12. Exaggeration helps spread out points early on, making it easier to see clusters. Leave it as the default unless you are experienced with t-SNE’s optimization.

  17. num_threads (integer): The number of CPU threads to use for computation. The default is 1. More threads can speed up the computation, especially for large datasets. Set it to the number of cores your machine has to speed up the process.

  18. index (integer matrix): Precomputed nearest neighbors. This helps t-SNE compute distances faster by starting with neighbors already calculated.

  19. distance (numeric matrix): Precomputed distances between the nearest neighbors. This further speeds up t-SNE by skipping the distance calculations.

5 How does t-SNE work (math details)

We are now going to explore a bit of internal of t-SNE. In a simple case, t-SNE works by finding pairwise similarities between data points in high dimensions, then mapping them to a lower-dimensional space while preserving these similarities, ensuring that points close together in high dimensions remain close in the lower-dimensional map.

So for now it seems we need three key components:

  1. A way to measure similarities (between the samples or datapoints) in the higher-dimensional space.
  2. A way to measure similarities in the lower-dimensional space.
  3. An way to map points from the higher-dimensional space to the lower-dimensional space while preserving the similarities as much as possible.

Let’s start exploring each step.Probably the first thing that comes to mind is that we should use Euclidean distance to represent the distance between the data points. Unfortunately, directly using Euclidean distance would prioritize global structure which distort the more small-scale relationships between points. This was particularly evident in the PCA plot where most of the subclusters were merged. Basically, directly using Euclidean distances for dimensionality reduction, the algorithm attempts to maintain the relative distances between all pairs of points, including those that are far apart. But in any case, let’s go ahead and try Euclidean distance first.

We are going to start with smaller version of dataset we simulated before to save some computational time.

Code
# Set seed for reproducibility
set.seed(123)

# Number of points per subcluster
n <- 5

# Manually define main cluster centers (global structure)
main_cluster_centers <- data.frame(
  x = c(0, 5, 10, 15),
  y = c(0, 5, 10, 15),
  z = c(0, 5, 10, 15),
  cluster = factor(1:4)
)

# Manually define subcluster offsets relative to each main cluster
# These small offsets will determine subcluster locations within each main cluster
subcluster_offsets <- data.frame(
  x_offset = c(-0.25, 0.25, -0.25, 0.25),
  y_offset = c(-0.25, -0.25, 0.25, 0.25),
  z_offset = c(0.25, -0.25, -0.25, 0.25),
  subcluster = factor(1:4)
)

# Initialize an empty data frame to hold all data
data <- data.frame()

# Generate data for each main cluster with subclusters
for (i in 1:nrow(main_cluster_centers)) {
  for (j in 1:nrow(subcluster_offsets)) {
    # Calculate subcluster center by adding the offset to the main cluster center
    subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]
    
    # Generate points for each subcluster with a small spread (to form local clusters)
    subcluster_data <- data.frame(
      gene1 = rnorm(n, mean = subcluster_center$x, sd = 0.25),  # Small spread within subclusters
      gene2 = rnorm(n, mean = subcluster_center$y, sd = 0.25),
      gene3 = rnorm(n, mean = subcluster_center$z, sd = 0.25),
      cluster = main_cluster_centers$cluster[i],
      subcluster = subcluster_offsets$subcluster[j]
    )
    
    # Add generated subcluster data to the main data frame
    data <- rbind(data, subcluster_data)
  }
}

It looks like before but smaller.

Code
# Visualize the original data in 3D, distinguishing clusters and subclusters
library(dplyr)
library(plotly)  # For interactive 3D plots
plot_ly(
  data, x = ~gene1, y = ~gene2, z = ~gene3, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter3d", mode = "markers",size = 5
)  %>%
  layout(
    title = "Original 3D Data with Clusters",
    scene = list(
      camera = list(
        eye = list(x = 0.3, y =2.5, z = 1.2)  # Change x, y, z to adjust the starting angle
      )
    )
  )

We can now simply calculate Euclidean distance between every pair of points. Let’s call this the distance matrix \(D\), where each element \(D_{ij}\) represents the distance between two points \(i\) and \(j\) in the original high-dimensional space. These distances are typically Euclidean distances, calculated as:

\[ D_{ij} = \sqrt{\sum_{d=1}^{p} (x_i^{(d)} - x_j^{(d)})^2} \]

where \(x_i^{(d)}\) and \(x_j^{(d)}\) are the coordinates of points \(i\) and \(j\) in dimension \(d\), and \(p\) is the number of dimensions in the original space.

We can simply calculate this in R using

D <- as.matrix(dist(data[,1:3]))

We should now start from somewhere creating the lower-dimensional \(Z\)-space. Let’ say we aim for just two dimentions. Let’s start with random points.

ndim <- 2
set.seed(123456)
Z <- matrix(rnorm(nrow(data[,1:3]) * ndim, sd = 1,mean =1),
             nrow = nrow(data[,1:3]), ncol = ndim)

colnames(Z)<-c("dim1","dim2")

This is our initial points:

Code
# Visualize the original data in 3D, distinguishing clusters and subclusters
plot_ly(
  as.data.frame(Z), x = ~dim1, y = ~dim2, color = data$cluster,
  colors = c("red", "green", "blue", "purple"),
  mode = "markers",size = 5
)  %>%
  layout(
    title = "Z 2D Data with Clusters"
  )

This obviously does not look anything similar to our original data. We need to update this random points so that the distances are more similar to what it looks like in the original data. So basically, what we want to do is to minimize the difference between the distances in the original space \(D_{ij}\) and the distances in the projected space \(Dz_{ij}\). This difference is quantified by the cost function.

In this case, we can define the cost function as:

\[ \text{cost} = \sum_{i=1}^{n} \sum_{j=i+1}^{n} (D_{ij} - Dz_{ij})^2 \]

This is the sum of squared differences between the original distances \(D_{ij}\) and the projected distances \(Dz_{ij}\). We try to minimize this function by adjusting the positions of the points in \(Z\)-space.

One way to do that is to naively shuffle the points to get a new set of points and calculate the cost, but this is not optimal. We want a bit smarter approach. Instead of random shuffling, we can gradually improve the positions by making small adjustments in the right direction. This is called gradient descent.

You can think about the cost function as a hilly landscape, where the height represents the “cost” or error. Our goal is to find the lowest point, the place where the cost is minimized. Gradient descent works by taking small steps downhill.

Code
# Create a grid of points to represent the "hilly" cost surface
x <- seq(-2, 2, length.out = 50)
y <- seq(-2, 2, length.out = 50)
z <- outer(x, y, function(x, y) x^2 + y^2 + 0.5 * sin(5 * (x + y)))

# Create a meshgrid for the plot
surface_data <- list(x = x, y = y, z = z)

# Define a gradient descent path (manually for this demo)
gradient_path <- data.frame(
  x = c(1.8, 1.4, 1.0, 0.5, 0.1),
  y = c(1.8, 1.5, 1.2, 0.8, 0.3),
  z = c(1.8^2 + 1.8^2 + 0.5 * sin(5 * (1.8 + 1.8)),
        1.4^2 + 1.5^2 + 0.5 * sin(5 * (1.4 + 1.5)),
        1.0^2 + 1.2^2 + 0.5 * sin(5 * (1.0 + 1.2)),
        0.5^2 + 0.8^2 + 0.5 * sin(5 * (0.5 + 0.8)),
        0.1^2 + 0.3^2 + 0.5 * sin(5 * (0.1 + 0.3)))
)

# Plot the surface and gradient descent path using plotly
plot_ly() %>%
  # Add the surface plot
  add_surface(x = surface_data$x, y = surface_data$y, z = surface_data$z, 
              colorscale = list(c(0, 1), c('lightblue', 'blue')),
              showscale = FALSE) %>%
  
  # Add the gradient descent path
  add_markers(x = gradient_path$x, y = gradient_path$y, z = gradient_path$z,
              marker = list(size = 5, color = 'red')) %>%
  
  # Add arrows to show the direction of gradient descent
  add_trace(type = 'scatter3d', mode = 'lines+markers',
            x = gradient_path$x, y = gradient_path$y, z = gradient_path$z,
            line = list(width = 4, color = 'red')) %>%
  
  # Add annotations for start and end points
  add_annotations(
    x = gradient_path$x[1], y = gradient_path$y[1], z = gradient_path$z[1], 
    text = "Start", showarrow = TRUE, arrowcolor = "red", ax = 0, ay = -30
  ) %>%
 
  
  # Add title and labels
  layout(scene = list(
    xaxis = list(title = "x"),
    yaxis = list(title = "y"),
    zaxis = list(title = "Cost Function"),
    title = "Gradient Descent on Hilly Surface"
  ))

In this interactive 3D plot, we are visualizing a cost function (the “hilly” landscape) and the process of gradient descent, which aims to find the lowest point in this landscape. The surface shows the cost function. The height of the surface (the z-axis) represents the value of the cost function at different points on the x and y axes. Points higher up on the surface represent higher cost or error, while lower points represent smaller error. The green points on the plot show the path taken by the gradient descent algorithm. The algorithm starts at a high point and moves toward a lower point. Gradient descent works by iteratively moving from the current position in the direction of the steepest descent, or the direction where the cost function decreases the fastest. In mathematical terms, this direction is given by the negative gradient of the cost function at each point.

Mathematically,the gradient is essentially the derivative of the cost function with respect to the positions of the points. We define the cost function as the sum of squared differences between the original distances \(D_{ij}\) and the distances in the lower-dimensional space \(Dz_{ij}\):

\[ \text{Cost}(Z) = \sum_{i=1}^{n} \sum_{j=i+1}^{n} \left( D_{ij} - Dz_{ij} \right)^2 \]

Where:

  • \(D_{ij}\) is the distance between points \(i\) and \(j\) in the original high-dimensional space.

  • \(Dz_{ij}\) is the distance between points \(i\) and \(j\) in the lower-dimensional space, defined as:

\[ Dz_{ij} = \sqrt{\sum_{d=1}^{k} (z_i^{(d)} - z_j^{(d)})^2} \]

Here, \(z_i^{(d)}\) is the coordinate of point \(i\) in dimension \(d\), and \(k\) is the number of dimensions in the low-dimensional space. We aim to compute the gradient of the cost function with respect to the positions of the points \(z_i\) in the low-dimensional space. As we said before the gradient tells us how the cost function changes when we move the points in the low-dimensional space, and it is used to update the positions in gradient descent.

To compute the gradient, we will use the chain rule. The chain rule helps us differentiate a composite function which is in our case, the cost function involves \(Dz_{ij}\), which is itself a function of the positions \(z_i\) and \(z_j\).

chain rule

If you have a composite function \(f(g(x))\), the chain rule states that the derivative of \(f(g(x))\) with respect to \(x\) is:

[ f(g(x)) = ]

This means we first differentiate the outer function \(f\) with respect to the inner function \(g\), and then multiply by the derivative of the inner function \(g\) with respect to \(x\).

We are interested in finding the derivative of the cost function with respect to the position \(z_i\). To do this, we first need to focus on the contribution to the cost from a single pair of points \(i\) and \(j\):

\[ \text{Cost}_{ij} = \left( D_{ij} - Dz_{ij} \right)^2 \]

We need to differentiate this with respect to \(z_i\). Using the chain rule, we proceed in two steps:

  1. Differentiate the outer function: The square term \(\left( D_{ij} - Dz_{ij} \right)^2\) with respect to \(Dz_{ij}\).
  2. Differentiate the inner function: The distance \(Dz_{ij}\) with respect to \(z_i\).

We first differentiate the outer function \(\left( D_{ij} - Dz_{ij} \right)^2\) with respect to \(Dz_{ij}\). Applying the chain rule:

\[ \frac{\partial}{\partial z_i} \left( D_{ij} - Dz_{ij} \right)^2 = 2 \left( D_{ij} - Dz_{ij} \right) \cdot \frac{\partial}{\partial z_i} \left( D_{ij} - Dz_{ij} \right) \]

Since \(D_{ij}\) is a constant (it does not depend on \(z_i\)), the derivative of \(D_{ij}\) with respect to \(z_i\) is zero. Therefore, this reduces to:

\[ \frac{\partial}{\partial z_i} \left( D_{ij} - Dz_{ij} \right)^2 = -2 \left( D_{ij} - Dz_{ij} \right) \cdot \frac{\partial Dz_{ij}}{\partial z_i} \]

Now, we differentiate \(Dz_{ij}\), the Euclidean distance between points \(i\) and \(j\), with respect to \(z_i\).

The Euclidean distance between points \(i\) and \(j\) is:

\[ Dz_{ij} = \sqrt{\sum_{d=1}^{k} (z_i^{(d)} - z_j^{(d)})^2} \]

To differentiate this, we apply the chain rule again. The derivative of a square root is:

\[ \frac{\partial Dz_{ij}}{\partial z_i^{(d)}} = \frac{1}{2 Dz_{ij}} \cdot 2 \left( z_i^{(d)} - z_j^{(d)} \right) \]

This simplifies to:

\[ \frac{\partial Dz_{ij}}{\partial z_i^{(d)}} = \frac{z_i^{(d)} - z_j^{(d)}}{Dz_{ij}} \]

So, the gradient of \(Dz_{ij}\) with respect to the vector \(z_i\) (across all dimensions \(d\)) is:

\[ \frac{\partial Dz_{ij}}{\partial z_i} = \frac{z_i - z_j}{Dz_{ij}} \]

Finally, we substitute the derivative of \(Dz_{ij}\) back into the gradient of the cost function:

\[ \frac{\partial \text{Cost}_{ij}}{\partial z_i} = -2 \left( D_{ij} - Dz_{ij} \right) \cdot \frac{z_i - z_j}{Dz_{ij}} \]

This is the gradient contribution for the pair of points \(i\) and \(j\).

To compute the total gradient for point \(z_i\), we sum up the contributions from all other points \(j\) (since each pair of points \(i\) and \(j\) contributes to the overall cost function):

\[ \frac{\partial \text{Cost}}{\partial z_i} = 2 \sum_{j \neq i} \left( Dz_{ij} - D_{ij} \right) \frac{z_i - z_j}{Dz_{ij}} \]

Using this gradient, we can update the positions of the points in the low-dimensional space using gradient descent:

\[ z_i^{\text{new}} = z_i^{\text{old}} - \eta \cdot \frac{\partial \text{Cost}}{\partial z_i} \]

Where \(\eta\) is the learning rate, which controls the step size of the update. The learning rate controls how big the step is. If it’s too large, we might overshoot the minimum, but if it’s too small, the process might take a long time. The idea is to keep making these adjustments until the positions of the points stabilize and the stress no longer decreases significantly. Here is a small implemntation what we just proved.

  colnames(Z)<-c("dim1","dim2")
  Z_frames<-c()

  # Create V matrix
  V <- nrow(data[,1:3]) * diag(nrow(data[,1:3])) - matrix(1,nrow(data[,1:3]), nrow(data[,1:3]))

  
  # set maximum iterations
  max_iter <- 1000
  
  # Initialize cost vector
  cost <- numeric(max_iter)
  
  # set learning rate
  lr <- 0.001
  
  # tol
  tol <- 1e-06
  
  for (i in 1:max_iter) {
    
    # Compute pairwise distances in Z space
    Dz <- as.matrix(dist(Z))
    
    
    # Compute cost
    cost[i] <- sum((D - Dz)^2)
  if(i%%50==0 | i%in%1:50)
    Z_frames<-rbind(Z_frames,data.frame(scale(Z),iteration=i,cluster=data$cluster,subcluster=data$subcluster))
    # Check for convergence
    if (i > 1 && abs(cost[i] - cost[i - 1]) < tol) {
      cost <- cost[1:i]
      print("done")
      break
    }
    
    # Compute gradient
      grad <- matrix(0, nrow = nrow(data[,1:3]), ncol = ndim)  # Initialize gradient matrix
    
    for (p in 1:nrow(data[,1:3])) {
      for (q in 1:nrow(data[,1:3])) {
        if (p != q) {
          # Calculate the difference between the points
          diff_z <- Z[p, ] - Z[q, ]
          
          # Calculate the Euclidean distance in the low-dimensional space
          Dz_pq <- sqrt(sum(diff_z^2))
          
          # Avoid division by zero
          Dz_pq <- max(Dz_pq, 1e-8)
          
          # Calculate the gradient contribution from pair (p, q)
          grad[p, ] <- grad[p, ] + 2 * (Dz_pq - D[p, q]) * diff_z / Dz_pq
        }
      }
    }
    
    
    # Update positions in lower dimention
    Z <- Z - lr * grad
    
  } 
Code
plot_ly(
  as.data.frame(Z_frames), x = ~dim1, y = ~dim2,frame=~iteration,color=~cluster,
  colors = c("red", "green", "blue", "purple"),
  mode = "markers",size = 5
)  %>%
  layout(
    title = "Z 2D Data with Clusters (animation)"
  )

Click on Play to see how the algorithm finds the best lower dimensional space. In fact this algorithm is similar to what classical MDS (multidimensional scaling) does with random initialization.

As it is clear from the plot above, what we are trying to do is to capture the overall trend of the data. That is the differences between the larger clusters. The subclusters however are quite mixed. Have a look here:

Code
# Visualize the original data in 3D, distinguishing clusters and subclusters
plot_ly(
  as.data.frame(Z), x = ~dim1, y = ~dim2, color = data$subcluster,
  colors = c("red", "green", "blue", "purple"),
  mode = "markers",size = 5
)  %>%
  layout(
    title = "Z 2D Data with Clusters"
  )

5.1 Gaussian (or normal) distribution

One issue with the Euclidean distance is that it does not inherently prioritize nearby points over distant ones. Every pair of points, whether close or far apart, is treated equally when computing the overall data structure. While this is not necessarily a flaw, our goal here is to capture the local structure rather than the global one. To address this, we can explore modifications to the Euclidean distance to make it slightly more local, such as incorporating a scaling factor like the standard deviation, which emphasizes nearby points without completely discarding global relationships.

More specifically we could divide the Euclidean distance between two points by the standard deviation of the distances from one of the points to its neighbors. This adjustment would make distances in denser regions appear relatively larger, prioritizing local structures without completely ignoring the global one.

So basically we do something like

\[ \tilde{D}_{ij} = \frac{D_{ij}}{\sigma_i} \]

Where:

  • \(D_{ij}\) is the Euclidean distance between points \(i\) and \(j\),
  • \(\sigma_i\) is the standard deviation of the distances from point \(i\) to its neighbors.

Let’s see how it looks like

Code
# Function to compute local standard deviation based on k-nearest neighbors
compute_sigma <- function(dist_matrix, k_neighbors) {
  n <- nrow(dist_matrix)
  sigma <- numeric(n)
  
  for (i in 1:n) {
    # Get distances from point i to all other points
    dists <- dist_matrix[i, ]
    
    # Sort the distances and select k nearest neighbors (excluding self)
    nearest_neighbors <- sort(dists, decreasing = FALSE)[2:(k_neighbors + 1)]
    
    # Compute standard deviation of distances to nearest neighbors
    sigma[i] <- sd(nearest_neighbors)
    sigma[i] <- max(sigma[i] , 1e-4)
    # Avoid sigma being zero
   # sigma <- sigma+
  }
  
  return(sigma)
}

# Compute sigma for original distances D (high-dimensional space)
sigma_D <- compute_sigma(D, 10)

P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
for (i in 1:nrow(D)) {
  for (j in 1:nrow(D)) {
    if (i != j) {
       P[i, j] <- (D[i, j] /( sigma_D[i]))
    }
  }
}
sorted_indices <- order(D[1, ])
sorted_D <- D[1, sorted_indices]
sorted_P <- P[1, sorted_indices]



# Plot  line with sorted x and y values
plot_ly(x = sorted_D[-1], y = sorted_P[-1], type = 'scatter', mode = 'lines') %>%
  layout(title = "Line plot of D vs (D/sigma) 10 NN (Smoothed)",
         xaxis = list(title = "Original distance"),
         yaxis = list(title = "D/sigma"))

We calculated the sigma based on 10 nearest neighbors for each point and divide the original distance by each sigma. I’m showing the distance of the point 1 to all other points here. Admittedly, it did not do much. We are just scaling the distances. Obviously it will have small impact on the lower dimension but still it is a global measure. What we are after is to decay the distances as they get larger and we want to do it smoothly. One option would be to apply exponential function here on the negative distances.

\[ K(x, x') = \exp\left( -\frac{\|x - x'\|}{\sigma} \right) \]

Code
# Compute sigma for original distances D (high-dimensional space)
sigma_D <- compute_sigma(D, 10)

P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
for (i in 1:nrow(D)) {
  for (j in 1:nrow(D)) {
    if (i != j) {
       P[i, j] <- exp(-D[i, j] /( sigma_D[i]))
    }
  }
}
sorted_indices <- order(D[1, ])
sorted_D <- D[1, sorted_indices]
sorted_P <- P[1, sorted_indices]



# Plot  line with sorted x and y values
plot_ly(x = sorted_D[-1], y = sorted_P[-1], type = 'scatter', mode = 'lines') %>%
  layout(title = "Line plot of D vs exp(-D/sigma) 10 NN (Smoothed)",
         xaxis = list(title = "Original distance"),
         yaxis = list(title = "exp(-D/sigma)"))

So basically, \(\exp\left(-\frac{D[i,j]}{\sigma_D[i]}\right)\) is converting the Euclidean distance into a similarity measure. The matrix \(D[i,j]\) represents the Euclidean distance between points \(i\) and \(j\). A larger value of \(D[i,j]\) means the points are farther apart, while a smaller value indicates they are closer. The exponential function \(\exp(-x)\) decreases rapidly as \(x\) increases, meaning that larger distances result in smaller similarities. When the distance \(D[i,j]\) is small, \(\exp(-D[i,j])\) is close to 1, indicating high similarity. When the distance is large, the exponential value approaches 0, indicating low similarity. As said before dividing by \(\sigma_D[i]\) allows us to control the “spread” or sensitivity of the similarity function. If \(\sigma_D[i]\) is small, the function decays quickly, meaning only very close points will be considered similar. If \(\sigma_D[i]\) is large, the decay is slower, allowing points that are further away to still be considered somewhat similar. Let’s see the effect of \(\sigma_D[i]\) based on the number of nearest neighbors.

Code
# Number of neighbors to test
neighbor_values <- c(5, 10, 20, 30,50)

# Initialize an empty plotly object
p <- plot_ly()

# Loop over different neighbor values
for (neighbors in neighbor_values) {
  
  # Compute sigma for the current number of neighbors
  sigma_D <- compute_sigma(D, neighbors)
  
  # Compute P matrix based on sigma_D
  P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        P[i, j] <- exp(-D[i, j] / sigma_D[i])
      }
    }
  }
  
  # Sort the first row of D and P for plotting
  sorted_indices <- order(D[1, ])
  sorted_D <- D[1, sorted_indices]
  sorted_P <- P[1, sorted_indices]
  
  # Add a trace to the plot for the current number of neighbors
  p <- p %>%
    add_trace(x = sorted_D[-1], y = sorted_P[-1], type = 'scatter', mode = 'lines',
              name = paste0(neighbors, " NN"))
}

# Customize the layout
p <- p %>%
  layout(title = "Line plot of D vs exp(-D/sigma) for different neighbors",
         xaxis = list(title = "Original distance"),
         yaxis = list(title = "exp(-D/sigma)"))

p

As we increase the number of neighbors, the standard deviation will increase which allows the distant points to be considered similar. The formula we just presented (\(\exp\left(-\frac{D[i,j]}{\sigma_D[i]}\right)\)) looks very similar to the well-known Gaussian (or normal) distribution function. In statistics, the Gaussian distribution is used to describe data that cluster around a mean, with the probability of a value decreasing as it moves further from the center.

The Gaussian (or normal) distribution’s probability density function for a single-dimensional random variable \(x\) is given by:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right) \]

Notice the key similarities:

  • The negative exponential term,
  • The squared difference \((x - \mu)^2\),
  • The factor \(2\sigma^2\) in the denominator.

In the Gaussian distribution, the distance between a value and the mean is squared, meaning that both positive and negative deviations contribute equally to the distribution. Squaring ensures that large deviations from the mean (or between points, in our case) are penalized more heavily. This same principle is applied when transforming distances into similarities: squaring the distance emphasizes larger distances, making the similarity decrease more sharply as the distance gets bigger.

We also have the factor \(2\sigma^2\) in the Gaussian function. This normalizes the spread of the data based on the variance. In our case, this factor ensures that the similarity measure accounts for the relative spread of the distances. So, \(\sigma^2\) controls how quickly the similarity decays as the distance increases. A small \(\sigma^2\) causes the similarity to drop off quickly, meaning only very close points are considered similar. Conversely, a larger \(\sigma^2\) results in a slower decay, making points further apart still somewhat similar. The factor of 2 in \(2\sigma^2\) ensures that this behavior is aligned with the properties of the Gaussian function, which decays at a rate proportional to the variance.

We can now transform our similarity function into the more commonly recognized Gaussian kernel by incorporating the squared Euclidean distance and the factor \(2\sigma^2\): \[ K(i,j) = \exp\left( -\frac{D[i,j]^2}{2\sigma_D[i]^2} \right) \]

This will look like:

Code
# Initialize an empty plotly object


sigma_D <- compute_sigma(D, 10)

   # Compute P matrix based on sigma_D
  P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        P[i, j] <- exp(-(D[i, j]^2) / (2*sigma_D[i]^2))
      }
    }
  }
  
  # Sort the first row of D and P for plotting
  sorted_indices <- order(D[1, ])
  sorted_D <- D[1, sorted_indices]
  sorted_P <- P[1, sorted_indices]
  

  
# Customize the layout
plot_ly(x = sorted_D[-1], y = sorted_P[-1], type = 'scatter', mode = 'lines')%>%
  layout(title = "Line plot of D vs exp(-D^2/(2*sigma^2)) for different neighbors",
         xaxis = list(title = "Original distance"),
         yaxis = list(title = "exp(-D^2/(2*sigma^2))"))

So in summary, we calculate the standard deviation of distances for each individual sample based on its neighborhood. This standard deviation, \(\sigma_D[i]\), reflects how spread out the distances are around that specific point, giving us a local measure of distance variability for each sample.

Then, we use the Gaussian kernel to measure the similarity between points. The Gaussian kernel transforms the Euclidean distance between two points into a similarity score that decays exponentially with increasing distance. The similarity between two points is determined based on how close they are relative to the local standard deviation of distances.

There is one more step to do here. The distances for each point are now based on its own standard deviation (SD), which means they are likely on different scales because each point has a different SD reflecting its local neighborhood density. This variability makes it challenging to compare distances directly across different points since each distance is relative to its local scale. It would be nice to have distances that have the same meaning and statistical propery throughout the dataset. To address this, we convert the distances into probabilities, which normalize the values to a common scale and ensure that they are comparable. The conversion to probabilities for each point (i) is done as follows:

\[ p_{j|i} = \frac{\exp\left(-\frac{D_{ij}^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{D_{ik}^2}{2\sigma_i^2}\right)} \]

Where:

  • \(D_{ij}\) is the distance between points \(i\) and \(j\),
  • \(\sigma_i\) is the standard deviation of distances for point \(i\),
  • \(p_{j|i}\) is the probability of selecting point \(j\) given point \(i\).

The numerator \(\exp\left(-\frac{D_{ij}^2}{2\sigma_i^2}\right)\) ensures that smaller distances (closer neighbors) are given higher probabilities, while larger distances contribute less. The denominator \(\sum_{k \neq i} \exp\left(-\frac{D_{ik}^2}{2\sigma_i^2}\right)\) ensures that all probabilities for a given point (i) sum to 1, bringing all points onto a comparable scale.

The entire equation computes the probability \(p_{j|i}\), which represents the likelihood of selecting point \(j\), given point \(i\), based on how close point \(j\) is to point \(i\). This probability is higher for points that are close to \(i\) and lower for points that are far from \(i\).

Code
# Initialize an empty plotly object


sigma_D <- compute_sigma(D, 10)

   # Compute P matrix based on sigma_D
  P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        P[i, j] <- exp(-(D[i, j]^2) / (2*sigma_D[i]^2))
        
      }
    }
    P[i, ] <- P[i, ] / sum(P[i, ])
  }
  
  # Sort the first row of D and P for plotting
  sorted_indices <- order(D[1, ])
  sorted_D <- D[1, sorted_indices]
  sorted_P <- P[1, sorted_indices]
  

  
# Customize the layout
plot_ly(x = sorted_D[-1], y = sorted_P[-1], type = 'scatter', mode = 'lines')%>%
  layout(title = "Line plot of D vs exp(-D^2/(2*sigma^2)) for different neighbors",
         xaxis = list(title = "Original distance"),
         yaxis = list(title = "exp(-D^2/(2*sigma^2))"))

You can play around with the number of nearest neighbors to get a feeling of how the number of neighbors affect the distances.

Code
data_2d<-data[,1:2]



# Select point 1
point_1 <- data_2d[1, ]

# Create a grid of points around point 1 for the Gaussian kernel
x_seq <- seq(min(data_2d$gene1) - 1, max(data_2d$gene1) + 1, length.out = 100)
y_seq <- seq(min(data_2d$gene2) - 1, max(data_2d$gene2) + 1, length.out = 100)

# Function to compute sigma (standard deviation of distances for the first point)
compute_sigma <- function(D, nn) {
  sigma <- numeric(1)
  dist_to_point1 <- D[1, ]
  nearest_neighbors <- sort(dist_to_point1, decreasing = FALSE)[2:(nn + 1)]
  sigma[1] <- sd(nearest_neighbors)
  return(sigma)
}

# Function to create Gaussian kernel based on nn
create_gaussian_kernel <- function(nn) {

  
  # Get sigma for point 1
  sigma <- compute_sigma(D, nn)
  
  # Compute probabilities for all data points (Gaussian kernel)
  gaussian_probabilities <- numeric(nrow(data_2d))
  for (i in 1:nrow(data_2d)) {
    dist_sq <- (data_2d$gene1[i] - point_1$gene1)^2 + (data_2d$gene2[i] - point_1$gene2)^2
    gaussian_probabilities[i] <- exp(-dist_sq / (2 * sigma^2))
  }
  
  # Normalize the probabilities (sum to 1)
  gaussian_probabilities <- gaussian_probabilities / sum(gaussian_probabilities)
  
  # Create kernel values for contour plot
  gaussian_kernel <- matrix(0, nrow = length(x_seq), ncol = length(y_seq))
  for (i in 1:length(x_seq)) {
    for (j in 1:length(y_seq)) {
      dist_sq <- (x_seq[i] - point_1$gene1)^2 + (y_seq[j] - point_1$gene2)^2
      gaussian_kernel[i, j] <- exp(-dist_sq / (2 * sigma^2))
    }
  }
  
  return(list(kernel = as.vector(gaussian_kernel), probabilities = gaussian_probabilities))
}

# Create steps for slider (each step corresponds to a nn value)
nn_values <- seq(2, 50, by = 1)
steps <- list()
for (i in 1:length(nn_values)) {
  nn <- nn_values[i]
  result <- create_gaussian_kernel(nn)
  
  steps[[i]] <- list(
    args = list(
      list(z = list(result$kernel), "marker.color" = list(result$probabilities))
    ),
    label = paste0("nn: ", nn),
    method = "restyle"
  )
}

# Initial nn and kernel
initial_nn <- 10
initial_result <- create_gaussian_kernel(initial_nn)

# Convert grid and kernel matrix into a data frame for plotting
kernel_df <- expand.grid(x = x_seq, y = y_seq)
kernel_df$z <- initial_result$kernel

# Plot 2D data points and Gaussian kernel around point 1
fig <- plot_ly() %>%
  add_trace(data = data_2d, x = ~gene1, y = ~gene2, type = 'scatter', mode = 'markers',
            marker = list(size = 10, color = initial_result$probabilities, showscale = TRUE),
            name = 'Data Points') %>%
  add_trace(x = kernel_df$x, y = kernel_df$y, z = kernel_df$z, type = 'contour',
            contours = list(showlabels = TRUE), name = 'Gaussian Kernel', showscale = F) %>%
  layout(title = "Interactive Gaussian Kernel",
         sliders = list(
           list(
             active = 8,  # Set the active index (for nn = 10 in our steps)
             currentvalue = list(prefix = ""),
             pad = list(t = 60),
             steps = steps
           )
         ))

# Show the interactive plot with the slider
fig

As you adjust the number of nearest neighbors (nn) using the slider, you’ll notice a significant change in the probability assigned to the points, even though their positions remain fixed. This is because the Gaussian kernel adapts based on the local neighborhood of point_1, and its standard deviation (σ) increases with more neighbors included.

  1. Smaller Number of Neighbors (Low nn):
    • When the number of nearest neighbors is small, the Gaussian kernel is tighter around point_1. This means the local neighborhood is defined narrowly, and only points that are very close to point_1 have high probabilities.
    • As a result, the probability for points near point_1 is high because these points are strongly influenced by the local structure around point_1. In the contour plot, this is reflected by the smaller, more concentrated yellow region, indicating that fewer points have high probabilities.
  2. Larger Number of Neighbors (High nn):
    • As you increase the number of nearest neighbors, the Gaussian kernel widens, taking more distant points into account. This effectively means the local neighborhood grows, and the probability is distributed across a larger area.
    • For the same point that previously had a high probability (when the neighborhood was small), its probability will now decrease because the influence of point_1 is spread out over a larger region. More points are included in the neighborhood, but the individual contribution from each point is reduced.
    • This is visually represented by the expansion of the yellow area in the contour plot, where more points now have non-negligible probabilities, but the magnitude of the probabilities for any specific point (including those close to point_1) is lower.

The probability assigned to an identical point is different under different numbers of nearest neighbors because the Gaussian kernel adapts based on the local neighborhood size. With fewer neighbors, the kernel is more focused, leading to higher probabilities for nearby points. As the number of neighbors increases, the kernel spreads out, and the probability for each individual point decreases, even though the point’s location remains unchanged.

5.2 Gaussian kernel for low dimentions

Now we need to think about he low dimentional space. We can use the same formula:

\[ q_{j|i} = \frac{\exp\left(-Dz_{ij}^2\right)}{\sum_{k \neq i} \exp\left(-Dz_{ik}^2\right)} \] Here \(Dz\) is the distances in the lower dimensional space (i will come to the point where we define the lower dimension). Here the normalization is over all data points.

It is important that once the data is projected into the low-dimensional space, the idea is to preserve local neighborhoods as faithfully as possible. In low-dimensional space, we are no longer working with highly varying densities like you have in high-dimensional space. Instead, points that are similar should already be close together in this space. Therfore, we don’t need dynamic standard deviation (SD) estimation. The aim here is to presernve the distance under this specific distribution.

5.2.1 Making things symmetric

So far, we have measured how similar data points are in both high-dimensional and low-dimensional spaces. To do this, we computed conditional probabilities for each point, but these are inherently asymmetric. Basicaly, the conditional probability \(p_{j|i}\) is calculated based on the similarity between point \(x_i\) and point \(x_j\) in the high-dimensional space. This is done using a Gaussian kernel:

\[ p_{j|i} = \frac{\exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{\|x_i - x_k\|^2}{2\sigma_i^2}\right)} \]

However, these conditional probabilities are not symmetric: \(p_{j|i}\) is not equal to \(p_{i|j}\) because they are computed separately for \(x_i\) and \(x_j\), with different reference points and sigma.

For example, if \(x_i\) is an outlier far from the rest of the data, \(p_{j|i}\) will be very small for all \(j\), but \(p_{i|j}\) might still be large if \(x_j\) is close to other points.

To deal with this asymmetry, we compute the joint probability \(p_{ij}\), which is symmetric:

\[ p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} \]

This ensures that \(p_{ij} = p_{ji}\), meaning the similarity between \(x_i\) and \(x_j\) is the same, regardless of the direction of comparison. Without symmetrization, outliers or points with unusual distances could have a disproportionately small or large influence on the our lower dimensions. This also ensure that sum of all probabilities is 1. We again need to calculate the gradient to update the lower dimension.

5.3 Similarity between higher and lower dimentions

In order to say how different our higher and lower dimentions are we are going to use KL (Kullback-Leibler) divergence. This is a measure of how one probability distribution is different from another. We have from before

  • \(P\) represents similarities (or probabilities) in high-dimensional space.
  • \(Q\) represents similarities in the lower-dimensional space (our reduced embedding).

The goal is to minimize the difference between these two distributions so that the lower-dimensional embedding represents the data as faithfully as possible.

Mathematically, the KL divergence between the high-dimensional and low-dimensional distributions is expressed as:

\[ C = \sum_i KL(P_i || Q_i) = \sum_i \sum_j p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} \]

  • \(p_{j|i}\): Probability that point \(j\) is similar to point \(i\) in high-dimensional space.
  • \(q_{j|i}\): Probability that point \(j\) is similar to point \(i\) in low-dimensional space.

This basically tells us how surprised we are by the difference between the two distributions.

Similar to before, we need to calculate the gradient to update the lower dimension. The gradient of the KL divergence with respect to \(y_i\) is:

\[ \frac{\delta C}{\delta y_i} = 4 \sum_j \left( p_{j|i} - q_{j|i} \right) (y_i - y_j) \] Here \(p_{j|i} - q_{j|i}\) is showing the difference how much the high-dimensional similarity \(p_{j|i}\) differs from the low-dimensional similarity \(q_{j|i}\). If they differ significantly, the gradient will be larger. \(y_i - y_j\) is the distance between the points \(y_i\) and \(y_j\) in the low-dimensional space. It shows how far the two points are in the lower dimentions, and moving \(y_i\) toward \(y_j\) will help reduce the divergence.

Cost function

I’m not going to include the cost function (and tolerance in the code) to speed things up. Maybe you want to do that and check for tolerance?

After calculating the gradient, we can use it to update the position of \(y_i\) in the lower-dimensional space. The update rule looks like this:

\[ y_i \leftarrow y_i - \eta \frac{\delta C}{\delta y_i} \]

Where \(\eta\) is the learning rate controlling the step size.

Let’s try to implement this

Code
# calculate orignal distances
D<-as.matrix(dist(data[,1:3]))
# Function to compute local standard deviation based on k-nearest neighbors
compute_sigma <- function(dist_matrix, k_neighbors) {
  n <- nrow(dist_matrix)
  sigma <- numeric(n)
  
  for (i in 1:n) {
    # Get distances from point i to all other points
    dists <- dist_matrix[i, ]
    
    # Sort the distances and select k nearest neighbors (excluding self)
    nearest_neighbors <- sort(dists, decreasing = FALSE)[2:(k_neighbors + 1)]
    
    # Compute standard deviation of distances to nearest neighbors
    sigma[i] <- sd(nearest_neighbors)
    sigma[i] <- max(sigma[i] , 1e-4)
    # Avoid sigma being zero
    # sigma <- sigma+
  }
  
  return(sigma)
}

# Compute sigma for original distances D (high-dimensional space)
sigma_D <- compute_sigma(D, 10)

# Compute P matrix based on sigma_D
P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
for (i in 1:nrow(D)) {
  for (j in 1:nrow(D)) {
    if (i != j) {
      P[i, j] <- exp(-(D[i, j]^2) / (2*sigma_D[i]^2))
      
    }
    
  }
   P[i, ] <- P[i, ] / sum(P[i,])
}
P = (P + t(P))/(2*nrow(D))

# Create random low dimention
ndim <- 2
set.seed(12345)
Y <- matrix(rnorm(nrow(D) * ndim, sd = 1, mean = 0), nrow = nrow(D), ncol = ndim)
colnames(Y) <- c("dim1", "dim2")

# define data for plot

plot_frames_sne<-c()
for (iter in 1:1000) {
  Dy <- as.matrix(dist(Y))
  
  Q <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        # Calculate conditional probability q_j|i using the same Gaussian kernel
        Q[i, j] <- exp(-(Dy[i, j])^2)
        
      }
    }
  }

Q <- Q / sum(Q)
  grad <- matrix(0, nrow = nrow(D), ncol = ndim)
  # For each point i and j, compute the gradient of KL divergence
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
         diff_y <- Y[i, ] - Y[j, ]
        scaling_factor <- (P[i, j] - Q[i, j])
        grad[i, ] <- grad[i, ] + scaling_factor * diff_y
        
        
      }
    }
     grad[i, ]<-4* grad[i, ]
  }
  Y <- Y - 5 * (grad)

  if (iter %% 50 == 0) {
    plot_frames_sne<-rbind(plot_frames_sne,data.frame(Y, cluster = data$cluster, subcluster = data$subcluster,iteration=iter))

  }
  
}

# 
plot_ly(
  as.data.frame(plot_frames_sne), x = ~dim1, y = ~dim2,frame=~iteration,color=~cluster,symbol=~subcluster,
  colors = c("red", "green", "blue", "purple"),
  mode = "markers",size = 5
)  %>%
  layout(showlegend = FALSE,
    title = "SNE 2D Data with Clusters (animation)"
  )

Click on Play to see how this algorithm maps the higher dimention to the lower one by starting from random points in the lower dimention and progressively change them so they get more and more similar with respect to thier distance to the higher dimension. In fact that process that we saw is symmetric SNE algortihm. The results are promising but we still have work to do to make it better!

5.4 Perplexity (Effective number of neighbors)

So far we have been using standard deviation that has been directly calculated based on the data points’s nearest neighbors. This however has a potential problem. Using this kernel, we convert distances between points in high-dimensional space into probabilities that represent similarities between points. For two points \(i\) and \(j\), the probability that \(j\) is a neighbor of \(i\) is computed using a Gaussian distribution:

\[ p_{ij} = \frac{\exp \left( - \frac{d_{ij}^2}{2 \sigma_i^2} \right)}{\sum_{k \neq i} \exp \left( - \frac{d_{ik}^2}{2 \sigma_i^2} \right)} \]

Again here, \(\sigma_i\) is the standard deviation used for point \(i\), and \(d_{ij}\) is the distance between points \(i\) and \(j\). The choice of \(\sigma_i\) determines how much weight we assign to close vs. distant neighbors when computing these probabilities.

So far we have used the real standard deviation (based on actual distances to neighbors in high-dimensional space). The problem is in dense regions, where many points are packed closely together, the real standard deviation \(\sigma_i\) will be small because the average distance to neighbors is small. A small \(\sigma_i\) makes the Gaussian distribution very narrow. his means that only the very closest neighbors will have significant probabilities, while slightly more distant neighbors (even if they aren’t that far away) will have almost zero probability. This causes the probabilities $ p_{ij}$ to focus too much on the nearest neighbors, potentially ignoring important local structure beyond the immediate vicinity. With certian number neighbors selected, points in dense clusters might appear farther apart than they should in the lower dimensions. Local clusters that should be tight could become overly spread out.

In sparse regions, where points are more spread out, the real standard deviation \(\sigma_i\) will be larger because the average distance to neighbors is larger. A large \(\sigma_i\) makes the Gaussian distribution wider. This means that the probabilities will be more spread out across neighbors, and distant neighbors will get significant probabilities, even though they are far away. In this case, the algorithm might overestimate the importance of distant neighbors, leading to blurring of the local structure in these regions.

Let’s have a look at an example:

Code
# Set seed for reproducibility
set.seed(123)

# Number of points per subcluster
n <- c(8,6,4,2)

# Manually define main cluster centers (global structure)
main_cluster_centers <- data.frame(
  x = c(0, 5, 10, 15),
  y = c(0, 5, 10, 15),
  z = c(0, 5, 10, 15),
  cluster = factor(1:4)
)

# Manually define subcluster offsets relative to each main cluster
# These small offsets will determine subcluster locations within each main cluster
subcluster_offsets <- data.frame(
  x_offset = c(-0.25, 0.25, -0.25, 0.25),
  y_offset = c(-0.25, -0.25, 0.25, 0.25),
  z_offset = c(0.25, -0.25, -0.25, 0.25),
  subcluster = factor(1:4)
)

# Initialize an empty data frame to hold all data
data <- data.frame()

# Generate data for each main cluster with subclusters

for (i in 1:nrow(main_cluster_centers)) {
  for (j in 1:nrow(subcluster_offsets)) {
    # Calculate subcluster center by adding the offset to the main cluster center
    subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]
    
    # Generate points for each subcluster with a small spread (to form local clusters)
    subcluster_data <- data.frame(
      gene1 = rnorm(n[i], mean = subcluster_center$x, sd = 2),  # Small spread within subclusters
      gene2 = rnorm(n[i], mean = subcluster_center$y, sd = 2),
      gene3 = rnorm(n[i], mean = subcluster_center$z, sd = 2),
      cluster = main_cluster_centers$cluster[i],
      subcluster = subcluster_offsets$subcluster[j]
    )
    
    # Add generated subcluster data to the main data frame
    data <- rbind(data, subcluster_data)
  }
}


plot_ly(
  data, x = ~gene1, y = ~gene2, z = ~gene3, color = ~cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter3d", mode = "markers",size = 5
)  %>%
  layout(
    title = "Original 3D Data with Clusters and Subclusters",
    scene = list(
      camera = list(
        eye = list(x = 0.3, y =2.5, z = 1.2)  # Change x, y, z to adjust the starting angle
      )
    )
  )

Different clusters have the same standard deviation but i have sampled different number of points. Cluster one is the densest cluster and the cluster four is the sparse one.

Let’s see what our algorithm does on this:

Code
D<-as.matrix(dist(data[,1:3]))
# Function to compute local standard deviation based on k-nearest neighbors
compute_sigma <- function(dist_matrix, k_neighbors) {
  n <- nrow(dist_matrix)
  sigma <- numeric(n)
  
  for (i in 1:n) {
    # Get distances from point i to all other points
    dists <- dist_matrix[i, ]
    
    # Sort the distances and select k nearest neighbors (excluding self)
    nearest_neighbors <- sort(dists, decreasing = FALSE)[2:(k_neighbors + 1)]
    
    # Compute standard deviation of distances to nearest neighbors
    sigma[i] <- sd(nearest_neighbors)
    sigma[i] <- max(sigma[i] , 1e-4)
    # Avoid sigma being zero
    # sigma <- sigma+
  }
  
  return(sigma)
}

#compute
sigma_D  <-compute_sigma(D,10)

# Compute P matrix based on sigma_D
P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
for (i in 1:nrow(D)) {
  for (j in 1:nrow(D)) {
    if (i != j) {
      P[i, j] <- exp(-(D[i, j]^2) / (2*sigma_D[i]^2))
      
    }
    
  }
  P[i, ] <- P[i, ] / sum(P[i,])
}
P = (P + t(P))/(2*nrow(D))

ndim <- 2
set.seed(12345)
Y <- matrix(rnorm(nrow(D) * ndim, sd = 1, mean = 0), nrow = nrow(D), ncol = ndim)
colnames(Y) <- c("dim1", "dim2")


for (iter in 1:2000) {
  Dy <- as.matrix(dist(Y))
  
  Q <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        # Calculate conditional probability q_j|i using the same Gaussian kernel
        Q[i, j] <- exp(-(Dy[i, j])^2)
        
      }
    }
  }
  
  Q <- Q / sum(Q)
  grad <- matrix(0, nrow = nrow(D), ncol = ndim)
  # For each point i and j, compute the gradient of KL divergence
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        diff_y <- Y[i, ] - Y[j, ]
        scaling_factor <- (P[i, j] - Q[i, j])
        grad[i, ] <- grad[i, ] + scaling_factor * diff_y
        
        
      }
    }
    grad[i, ]<-4* grad[i, ]
  }
  Y <- Y - 10 * (grad)

  
}
plot_ly(
  data.frame(Y), x = ~dim1, y = ~dim2,color = ~data$cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
) 

Despite that for the sparse cluster things look OK but for the denser regions we spread the points unnecessarily. They should infact form tight clusters. We need a better measure of nearest neighbors that is more or less adaptive to the density of each region in the map. In this specific case, we probably have to consider a bit more neighbors in denser part of the data.

In an ideal situation we for a specific point, we want to have neighbors that more or less have same influence on the current point. So basically, we need a method that ensures each point considers a consistent number of neighbors, significantly influencing it across the entire dataset, regardless of local density variations.

One intiuative way to do that is to rather than setting \(\sigma_i\) solely based on nearest neighbor distances, we can aim to adjust \(\sigma_i\) so that each point considers the same effective number of neighbors.

Now consider this, if all neighbors contributed equally, the probabilities \(p_{ij}\) would be uniformly distributed. For instance, if there were \(k\) neighbors, and each contributed equally, then:

\[ p_{ij} = \frac{1}{k}, \quad \text{for each } j \]

However, in reality, the distances between the points vary, which means the probabilities \(p_{ij}\) are not uniform. Our task is to find a way to measure how far the actual distribution \(p_{ij}\) is from being uniform. To measure this, we can introduce a simple criterion: if we want to have \(k\) neighbors contributing equally, the sum of the probabilities for those neighbors should behave as if they are uniformly distributed across the \(k\) neighbors.

To express this mathematically, we can look at the average contribution of each neighbor and compare it with how far off each neighbor’s actual contribution is. So, we want a measure that tells us ff all probabilities \(p_{ij}\) are equal, this measure should indicate that all neighbors are contributing equally. And also If one or a few \(p_{ij}\)’s dominate, this measure should indicate that some neighbors are contributing more than others.

Let’s consider the concept of information content associated with each probability \(p_{ij}\). The smaller \(p_{ij}\), the more surprising (or informative) it is that neighbor \(j\) has a significant influence. The information content for each neighbor \(j\) can be defined as:

\[ \text{Information content of } p_{ij} = -\log(p_{ij}) \]

This makes sense because if \(p_{ij}\) is large (close neighbors), the information content is small, indicating that these neighbors are not surprising. But if \(p_{ij}\) is small (distant neighbors), the information content is large, indicating that these neighbors are more surprising.

Now, to get a measure of the average information content of all the neighbors, we can take the weighted sum of these individual information contents, weighted by the probabilities ( p_{ij} ) themselves:

\[ \text{Average information content} = - \sum_j p_{ij} \log_2(p_{ij}) \]

This expression tells us the average uncertainty or spread in the probability distribution \(P_i\), based on the Gaussian distances. This is exactly the formula for Shannon entropy (\(H\)).

Now let’s consider a super good scenario where all the neighbors are contributing euqally to the current point.

\[ H(P_i) = -\sum_{i=1}^{n} \frac{1}{n} \log_2 \frac{1}{n} = \log_2 n \]

This is straightforward: the entropy is just the logarithm of the number of outcomes, which makes sense because more outcomes increase uncertainty. How do we transfomr this \(log_2(n)\) to the number of neighbors? We simply take \(2^{H(P_i)}\). In this case this is going to give us exactly \(n\). In cases where the distances are not equal, \(H(P_i)\) is no longer a simple logarithm of the number of outcomes. Instead, the value of \(H(P_i)\) reflects the uncertainty considering the actual probability distribution. However, we can still think about how many neighbors with equal contribution would generate the same amount of uncertainty. So we can still use \(2^{H(P_i)}\) which is going to give us the effective number of neighbors for a particular point. In our case, this parameter is called perplexity. The perplexity tells us the effective number of equally contributing neighbors and we want to adjust \(\sigma_i\) so perplexity matches a desired number. That number is our number of neighbors (A smooth one!).

How do we find all \(\sigma_i\)? Well, we have to search really. Either grid search or binary search etc can help us to figure out all the standard deviation for every single data point such that perplexity matches the desired value.

Here i have changed our function to do that.

Code
# Function to compute perplexity given distances D and sigma
compute_perplexity <- function(D, sigma) {
  # Compute the conditional probabilities P_{j|i}
  P <- exp(-D^2 / (2 * sigma^2))
  # Avoid division by zero
  sumP <- sum(P) + 1e-10
  P <- P / sumP
  # Compute Shannon entropy H(P_i)
  H <- -sum(P * log2(P + 1e-10))
  # Compute perplexity
  perplexity <- 2^H
  return(perplexity)
}

# Function to find sigma given distances D and target perplexity
find_sigma <- function(D, target_perplexity, tol = 1e-5, max_iter = 100, sigma_min = 1e-20, sigma_max = 1000) {
  # Initialize sigma bounds
  sigma_lower <- sigma_min
  sigma_upper <- sigma_max
  # Initial sigma guess
  sigma <- (sigma_lower + sigma_upper) / 2
  for (i in 1:max_iter) {
    # Compute perplexity with current sigma
    perp <- compute_perplexity(D, sigma)
    # Check if perplexity is within tolerance
    if (abs(perp - target_perplexity) < tol) {
      break
    }
    # Adjust sigma bounds based on perplexity
    if (perp > target_perplexity) {
      # Perplexity too big, decrease sigma
      sigma_upper <- sigma
    } else {
      # Perplexity too small, increase sigma
      sigma_lower <- sigma
    }
    # Update sigma
    sigma <- (sigma_lower + sigma_upper) / 2
  }
  return(sigma)
}
compute_sigma<-function(D,perplexity)
{
  sigmas<-c()
  for(i in 1:nrow(D)){
  sigmas<-c(sigmas,find_sigma(D[i,-i], perplexity,max_iter = 1000))
}
sigmas
}

#compute
D<-as.matrix(dist(data[,1:3]))
sigma_D  <-compute_sigma(D,10)

# Compute P matrix based on sigma_D
P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
for (i in 1:nrow(D)) {
  for (j in 1:nrow(D)) {
    if (i != j) {
      P[i, j] <- exp(-(D[i, j]^2) / (2*sigma_D[i]^2))
      
    }
    
  }
  P[i, ] <- P[i, ] / sum(P[i,])
}
P = (P + t(P))/(2*nrow(D))

ndim <- 2
set.seed(12345)
Y <- matrix(rnorm(nrow(D) * ndim, sd = 1, mean = 0), nrow = nrow(D), ncol = ndim)
colnames(Y) <- c("dim1", "dim2")

# define data for plot

plot_frames_sne<-c()
for (iter in 1:2000) {
  Dy <- as.matrix(dist(Y))
  
  Q <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        # Calculate conditional probability q_j|i using the same Gaussian kernel
        Q[i, j] <- exp(-(Dy[i, j])^2)
        
      }
    }
  }
  
  Q <- Q / sum(Q)
  grad <- matrix(0, nrow = nrow(D), ncol = ndim)
  # For each point i and j, compute the gradient of KL divergence
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        diff_y <- Y[i, ] - Y[j, ]
        scaling_factor <- (P[i, j] - Q[i, j])
        grad[i, ] <- grad[i, ] + scaling_factor * diff_y
        
        
      }
    }
    grad[i, ]<-4* grad[i, ]
  }
  Y <- Y - 5 * (grad)
}
plot_ly(
  data.frame(Y), x = ~dim1, y = ~dim2,color = ~data$cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
) 

We now have got much nicer clustering of the data. The gradients are kept and most things seem to be in place. We are very close in wrapping things up but there is still one thing left we need to do.

5.5 Crowding problem

The crowding problem arises because high-dimensional data points, when reduced to a lower-dimensional space, often become “crowded” in a way that fails to preserve the relative distances properly. This happens because it is difficult to accurately represent the larger pairwise distances in the lower-dimensional map, causing too many points to cluster closely together. So far our idea of dimensionality reduction has been to preserve the relative distances between points by converting distances into probabilities in both the high-dimensional and low-dimensional spaces. The relationship between probability and distance is important. Basicaly, the closer points in the high-dimensional spaceshould have higher probabilities of being close in the lower-dimensional map, while distant points should have lower probabilities of being near each other in the lower-dimensional map.

We used a Gaussian kernel in the lower-dimensional space, in this kernel the probabilities for distant points decrease rapidly because the Gaussian distribution assigns very small probabilities to pairs of points that are even moderately far apart. This means that even points that are moderately distant in the high-dimensional space will be assigned a very low probability of being far apart in the lower-dimensional map.

Now, since our algorithm tries to match the probabilities (reduce the pariwise difference) between the high-dimensional and low-dimensional spaces, if the Gaussian assigns very low probabilities for moderately distant points in the low-dimensional map, the gradient will try to pull those points closer together to match this. The problem is that these points are not meant to be so close (since they are distant in high-dimensional space), but the fast decay of the Gaussian distribution forces them closer than they should be to match the low probability.

We bassicaly need another kernel that is decay more slowly in the lower dimension. The t-distribution is the one! It assigns a bit larger probabilities to distant points compared to a Gaussian. This allows moderately distant points to remain appropriately separated in the low-dimensional map. The slower decay helps to prevent crowding because it doesn’t force moderately distant points to come together as much as the Gaussian does.

So we are going to use t-distribution with one degree of freedom (Student’s t-distribution) in the low-dimensional space instead of a Gaussian. T-distribution has heavier tails, which allow for larger distances between points. Mathematically, the probability of a pair of points \(i\) and \(j\) in the low-dimensional map is computed as:

\[ q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l}(1 + ||y_k - y_l||^2)^{-1}} \] Our gradient is also going to change to

\[ \frac{\delta C}{\delta y_i} = 4 \sum_j \left( p_{j|i} - q_{j|i} \right) (y_i - y_j)(1+||y_i-y_j||^2)^-1 \] All the notations are the same as we present before. We can go ahead and implement this in our algorithm:

Code
#compute
D<-as.matrix(dist(data[,1:3]))
sigma_D  <-compute_sigma(D,10)

# Compute P matrix based on sigma_D
P <- matrix(0, nrow = nrow(D), ncol = nrow(D))
for (i in 1:nrow(D)) {
  for (j in 1:nrow(D)) {
    if (i != j) {
      P[i, j] <- exp(-(D[i, j]^2) / (2*sigma_D[i]^2))
      
    }
    
  }
  P[i, ] <- P[i, ] / sum(P[i,])
}
P = (P + t(P))/(2*nrow(D))

ndim <- 2
set.seed(12345)
Y <- matrix(rnorm(nrow(D) * ndim, sd = 1, mean = 0), nrow = nrow(D), ncol = ndim)
colnames(Y) <- c("dim1", "dim2")

for (iter in 1:2000) {
  Dy <- as.matrix(dist(Y))
  
  Q <- matrix(0, nrow = nrow(D), ncol = nrow(D))
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        # Calculate conditional probability q_j|i using the same t-dist kernel
        Q[i, j] <- (1+(Dy[i, j])^2)^-1
        
      }
    }
  }
  
  Q <- Q / sum(Q)
  grad <- matrix(0, nrow = nrow(D), ncol = ndim)
  # For each point i and j, compute the gradient of KL divergence
  for (i in 1:nrow(D)) {
    for (j in 1:nrow(D)) {
      if (i != j) {
        diff_y <- Y[i, ] - Y[j, ]
        scaling_factor <- (P[i, j] - Q[i, j])
        grad[i, ] <- grad[i, ] + scaling_factor * diff_y *((1+(Dy[i, j])^2)^-1)
        
        
      }
    }
    grad[i, ]<-4* grad[i, ]
  }
  Y <- Y - 5 * (grad)
  
}
plot_ly(
  data.frame(Y), x = ~dim1, y = ~dim2,color = ~data$cluster,
  colors = c("red", "green", "blue", "purple"),
  type = "scatter", mode = "markers",size = 5
) 

5.6 t-SNE (summary)

The algorithm that we derived together is t-SNE. Obviously we have missed/omitted a lot of implemntational details but the general way that this algorithm works is more or less the same as we went through. t-SNE works by find finding standard deviations for each data point based on the perplexity parameter. It then uses the gaussian kernel to convert the distances between the data points to probability. It does the same thing for the lower dimention and tries to match the higher and lower dimensional probabilities. This is done by minimizing the Kullback-Leibler (KL) divergence between the two distributions. By iteratively adjusting the points in the lower-dimensional space, t-SNE captures both local and global structures in the data, with an emphasis on preserving local neighborhoods. The result is a visually intuitive low-dimensional embedding that reflects the high-dimensional relationships between points, allowing us to observe clusters and patterns that may not have been apparent in the original space.

In order to use t-SNE effectively, it is important to carefully choose the perplexity parameter, as it controls the balance between local and global structure preservation. A higher perplexity value emphasizes larger neighborhoods, capturing more global relationships, while a lower value focuses on smaller, local clusters. Additionally, t-SNE can be sensitive to initialization and learning rate, so experimenting with these parameters can help avoid poor embeddings. Preprocessing the data, such as normalizing or reducing dimensions beforehand (e.g., with PCA), can also improve performance and stability of the algorithm.