t-SNE is often used when the goal is to uncover patterns or clusters within complex datasets that aren’t easily seen in higher dimensions. Unlike some methods like PCA that focus on capturing global patterns across all data points, t-SNE emphasizes local relationships. By local relationships, we mean that t-SNE is primarily working towards preserving the relative distances between points that are already close to each other in the original dataset.

It is important to already mention that t-SNE does not focus as much on global structure, which means how distant points across the entire dataset are preserved. So basically we want give priority to accurately reflecting the relationships of nearby points and that comes most of the time at the expense of the larger global arrangement. We will later say in detail why such a thing happens but for now let’s accept that we have a method called t-sne that is capable of preserving local structures.

Basically, we want to use t-SNE because we believe there are tightly clustered groups in the data. When we suspect that the key information or insights lie in these tightly clustered regions (such as distinct subpopulations or small patterns), t-SNE is great for revealing these local structures. For example, in genomics, t-SNE is often used to identify subtypes of cells based on gene expression data. This is obviously based on the assumption that certain groups of cells will have similar expression profiles, forming tightly clustered groups. There are many other fields such as image analysis that can benefit revealing these local structures.

2 t-sne in R

Without going further into the details, we are going to start using t-SNE in R. The main function used for this is Rtsne() from the Rtsne package. Before that i want to simulate some data so we can check later how t-SNE is doing.

Code

# Set seed for reproducibilityset.seed(123)# Number of points per subclustern <-25# Manually define main cluster centers (global structure)main_cluster_centers <-data.frame(x =c(0, 5, 10, 15),y =c(0, 5, 10, 15),z =c(0, 5, 10, 15),cluster =factor(1:4))# Manually define subcluster offsets relative to each main cluster# These small offsets will determine subcluster locations within each main clustersubcluster_offsets <-data.frame(x_offset =c(-0.25, 0.25, -0.25, 0.25),y_offset =c(-0.25, -0.25, 0.25, 0.25),z_offset =c(0.25, -0.25, -0.25, 0.25),subcluster =factor(1:4))# Initialize an empty data frame to hold all datadata <-data.frame()# Generate data for each main cluster with subclustersfor (i in1:nrow(main_cluster_centers)) {for (j in1:nrow(subcluster_offsets)) {# Calculate subcluster center by adding the offset to the main cluster center subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]# Generate points for each subcluster with a small spread (to form local clusters) subcluster_data <-data.frame(gene1 =rnorm(n, mean = subcluster_center$x, sd =0.25), # Small spread within subclustersgene2 =rnorm(n, mean = subcluster_center$y, sd =0.25),gene3 =rnorm(n, mean = subcluster_center$z, sd =0.25),cluster = main_cluster_centers$cluster[i],subcluster = subcluster_offsets$subcluster[j] )# Add generated subcluster data to the main data frame data <-rbind(data, subcluster_data) }}

This data has just three dimentions for 400 samples (4 major clusters). We can plot the data here:

Code

# Visualize the original data in 3D, distinguishing clusters and subclusterslibrary(dplyr)library(plotly) # For interactive 3D plotsplot_ly( data, x =~gene1, y =~gene2, z =~gene3, color =~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter3d", mode ="markers",size =5) %>%layout(title ="Original 3D Data with Clusters and Subclusters",scene =list(camera =list(eye =list(x =0.3, y =2.5, z =1.2) # Change x, y, z to adjust the starting angle ) ) )

For this data, the features are in the column and samples in the row.

We are now going to do a t-SNE on this data and see the results. There are some parameters to set but we just go for the default ones. We also do PCA and show the results side by side

Code

library(Rtsne)library(ggplot2)library(cowplot)set.seed(123)tsne_results_30 <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]))data$tsne_x_30 <- tsne_results_30$Y[, 1]data$tsne_y_30 <- tsne_results_30$Y[, 2]tsne_plot <-plot_ly( data, x =~tsne_x_30, y =~tsne_y_30, color =~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )pca_results <-prcomp(data[, c("gene1", "gene2", "gene3")], scale. =FALSE)data$pca_x <- pca_results$x[, 1]data$pca_y <- pca_results$x[, 2]pca_plot <-plot_ly( data, x =~pca_x, y =~pca_y, color =~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )subplot(pca_plot%>%layout(showlegend =FALSE), tsne_plot%>%layout(showlegend =FALSE),titleX = T,titleY = T,margin =0.2)%>%layout(annotations =list(list(x =0.13 , y =1.035, text ="PCA", showarrow = F, xref='paper', yref='paper'),list(x =0.85 , y =1.035, text ="t-SNE", showarrow = F, xref='paper', yref='paper')))

The plot shows a side-by-side comparison of results obtained using PCA (left) and t-SNE (right) for dimensionality reduction. In the PCA plot, the data points are clearly separated by their clusters along the horizontal axis, but the clusters themselves are quite stretched and not very compact, indicating that PCA is apturing global variance sources but doesn’t reveal tight local groupings.

On the other hand, the t-SNE plot reveals well-separated, compact clusters, showing how good t-SNE is at preserving local structures within the data. The clusters are clearly distinct from one another, and their circular, tightly packed shapes indicate that t-SNE effectively keeps points that are similar (close in high-dimensional space) together in the low-dimensional projection. However, the distances between clusters may not reflect global relationships as well as PCA does (we will get back to this later).

2.1 Input parameters

Like any other method, t-SNE also requires some input parameters. Let’s start with the most essential ones and later go through the rest.

The data matrix (X) is the core input for t-SNE, where each row represents an observation, and each column is a variable or feature. This matrix is typically high-dimensional, and t-SNE’s job is to map it into a lower-dimensional space (usually 2D or 3D) to make patterns more interpretable.

Next, dims defines the dimensionality of the output. Typically, we choose 2D for visualization purposes, but 3D is also possible if more complexity is required.

The perplexity parameter is probably the most important one. It controls how t-SNE balances local and global structures of the data. You can think of it as determining how many “neighbors” each data point should consider when projecting into the low-dimensional space. Choosing the right perplexity is very important because it affects how t-SNE interprets the relationships between data points.

Another key parameter in this specific implementation is theta, which adjusts the balance between accuracy and computational speed. For large datasets, using a larger theta can make t-SNE run faster but at the cost of some accuracy. If you prioritize precision, especially for smaller datasets, you can set this parameter to 0 for exact t-SNE.

We also have the max_iter parameter, which controls the number of iterations the algorithm goes through during optimization. More iterations give t-SNE time to better fine-tune the output, though in practice, the default is often sufficient unless you notice the algorithm hasn’t converged.

After these essential parameters, additional settings like PCA preprocessing, momentum terms, and learning rate can further fine-tune the performance of t-SNE for different types of data. We will talk about these later.

2.2 Perplexity

I had to devote a small section to this parameter because it is probably the most important one (if you already have your data 😉). Perplexity essentially defines how t-SNE balances attention between local and global structures in your data. You can think of it as controlling the size of the neighborhood each data point considers when positioning itself in the lower-dimensional space.

A small perplexity value (e.g., 5–30) means that t-SNE will focus more on preserving the local structure, i.e., ensuring that close neighbors in the high-dimensional space stay close in the low-dimensional projection. This is great for datasets with small clusters or when you’re particularly interested in capturing details.

A larger perplexity (e.g., 30–50 or even higher) encourages t-SNE to preserve global structure, considering more far away points as part of the neighborhood. This can be useful for larger datasets or when you want to capture broader relationships between clusters.

Finding the right perplexity often involves some experimentation. If it’s too small, t-SNE might overfit to the local structure and fail to reveal larger patterns in the data. If it’s too large, you might lose relationships between nearby data points. t-SNE is retively robust to different perplexity values, so changing this parameter slightly usually won’t result in big changes, but it can make the difference between a good visualization and a great one.

A rule of thumb is to set the perplexity such that \(3 \times \text{perplexity} < n-1\), where \(n\) is the number of data points. Testing several values across this range will help you find the best fit for your data.

2.2.1 See the impact of perplexity

I did not tell you before (you might have realized it from the code though) but there are substructures in the original data. We are now going to plot them again.

Code

library(dplyr)library(plotly) # For interactive 3D plotsplot_ly( data, x =~gene1, y =~gene2, z =~gene3, color =~subcluster,#,symbol=~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter3d", mode ="markers",size =5) %>%layout(title ="Original 3D Data with Clusters and Subclusters",scene =list(camera =list(eye =list(x =0.3, y =2.5, z =1.2) # Change x, y, z to adjust the starting angle ) ) )

You should now see that within each cluster we have subclusters. Let’s see if our original t-SNE was successful in separating them.

Code

tsne_plot2<-plot_ly( data, x =~tsne_x_30, y =~tsne_y_30, color =~subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )pca_plot2 <-plot_ly( data, x =~pca_x, y =~pca_y, color =~subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )subplot(pca_plot2%>%layout(showlegend =FALSE), tsne_plot2%>%layout(showlegend =FALSE),titleX = T,titleY = T,margin =0.2)%>%layout(annotations =list(list(x =0.13 , y =1.035, text ="PCA", showarrow = F, xref='paper', yref='paper'),list(x =0.85 , y =1.035, text ="t-SNE", showarrow = F, xref='paper', yref='paper')))

Compared to PCA we have actually done a good job. Most clusters seems to be well separated. But what we want to do is to change perplexity to see if we can make this better.

Code

set.seed(123)tsne_results_20 <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity =20)data$tsne_x_20 <- tsne_results_20$Y[, 1]data$tsne_y_20 <- tsne_results_20$Y[, 2]tsne_plot2<-plot_ly( data, x =~tsne_x_30, y =~tsne_y_30, color =~subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )tsne_plot20<-plot_ly( data, x =~tsne_x_20, y =~tsne_y_20, color =~subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )subplot(tsne_plot2%>%layout(showlegend =FALSE), tsne_plot20%>%layout(showlegend =FALSE),titleX = T,titleY = T,margin =0.2)%>%layout(annotations =list(list(x =0.13 , y =1.035, text ="t-SNE (30)", showarrow = F, xref='paper', yref='paper'),list(x =0.85 , y =1.035, text ="t-SNE (20)", showarrow = F, xref='paper', yref='paper')))

I have now decreased perplexity to 20. What we can see is that at 30 perplexity, t-SNE is accounting for a larger neighborhood of points when embedding the data. This results in clearer separation between clusters, with well-defined compact clusters. At 20 however, each cluster also appears distinctly separated in space, maintaining a reasonable balance between local and global structure. The substructures within the clusters are more prominent, with some separation and internal grouping within each main cluster, suggesting that t-SNE is better at capturing smaller-scale local variations with a lower perplexity.

2.3 Output of Rtsne

This is relatively streightforward. The most important output of this function is \(Y\). You can extract using tsne_results_20$Y. This is a matrix that has the exact same number of rows as your orignal data and the number of columns is the same as dims parameter. So basically it is your data transformed into the lower dimention of size dims. In our case, it was 2.

The rest of the output is basically either information about the optimization process or the summary of the input paramters. We are going to ignore them for now and will get back to it later if needed.

Do it yourself

Play around with these parameters (perplexity, theta, and max_iter) to see if you can get better results

3 What does this t-SNE plot mean!?

At this stage, it’s time to explore a bit more about what the t-SNE results can mean and whether we should trust them.

The first thing you should have noted from the last exercise is that t-SNE results are sensitive to the choice of parameters. As we saw, adjusting the perplexity changes how the data is visualized, with higher perplexities focusing more on global structure and lower perplexities emphasizing local relationships. This flexibility is both a strength and a limitation: it can show different aspects of the data, but it also means that different parameter settings can lead to different interpretations of the same dataset.

Code

tnse_data_frames<-c()for(it inc(1:10,100,200,300,500,1000,3000,5000)){set.seed(123)tsne_results_it <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity =30,max_iter = it) tnse_data_frames<-rbind(tnse_data_frames,data.frame(tsne_results_it$Y,data[,c(4,5)],max_iter=it))}tsne_plot2<-plot_ly( tnse_data_frames, x =~X1, y =~X2, color =~subcluster,frame=~max_iter,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )tsne_plot2

Click on the “Play” button to see the effect of max_iter parameter. In this specific case, it we could already see the clusters in a very early stage so very little unexpected happens. But in reality for a nosier data, the number of steps matter and can actually change the interpretation of the data. Please note that in this specific case, i have also set the seed. t-SNE is a stochastic method, meaning that each run can give slightly different results unless a random seed is fixed. This randomness can also impact the arrangement of points in the low-dimensional space, especially with small datasets. So if you want to get the same results set the seed!

We can see how changing perplexity affect our interpretation

Code

tnse_data_frames<-c()for(it inc(10,20,30,40,50,60,70,100)){set.seed(123)tsne_results_it <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity = it,) tnse_data_frames<-rbind(tnse_data_frames,data.frame(tsne_results_it$Y,data[,c(4,5)],perplexity=it))}tsne_plot2<-plot_ly( tnse_data_frames, x =~X1, y =~X2, color =~subcluster,frame=~perplexity,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )tsne_plot2

This should not come as a surprise that perplexity has the biggest impact on the interpretation. One can almost play around with the different values of parameters to get a few different views of the data in order to have a more robust interpretation.

3.1 between cluster distances and densities might not be accurate!

Code

# Set seed for reproducibilityset.seed(123)# Number of points per subclustern <-25# Manually define main cluster centers (global structure)main_cluster_centers <-data.frame(x =c(0, 10, 15, 35),y =c(0, 10, 15, 35),z =c(0, 10, 15, 35),cluster =factor(1:4))# Manually define subcluster offsets relative to each main cluster# These small offsets will determine subcluster locations within each main clustersubcluster_offsets <-data.frame(x_offset =c(-0.25, 0.25, -0.25, 0.25),y_offset =c(-0.25, -0.25, 0.25, 0.25),z_offset =c(0.25, -0.25, -0.25, 0.25),subcluster =factor(1:4))# Initialize an empty data frame to hold all datadata <-data.frame()# Generate data for each main cluster with subclustersfor (i in1:nrow(main_cluster_centers)) {for (j in1:nrow(subcluster_offsets)) {# Calculate subcluster center by adding the offset to the main cluster center subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]# Generate points for each subcluster with a small spread (to form local clusters) subcluster_data <-data.frame(gene1 =rnorm(n, mean = subcluster_center$x, sd =0.25*i*i), # Small spread within subclustersgene2 =rnorm(n, mean = subcluster_center$y, sd =0.25*i*i),gene3 =rnorm(n, mean = subcluster_center$z, sd =0.25*i*i),cluster = main_cluster_centers$cluster[i],subcluster = subcluster_offsets$subcluster[j] )# Add generated subcluster data to the main data frame data <-rbind(data, subcluster_data) }}plot_ly( data, x =~gene1, y =~gene2, z =~gene3, color =~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter3d", mode ="markers",size =5) %>%layout(title ="Original 3D Data with Clusters and Subclusters",scene =list(camera =list(eye =list(x =0.3, y =2.5, z =1.2) # Change x, y, z to adjust the starting angle ) ) )

What i have done now is to give different density to each larger cluster and also different distances between the clusters. We can do t-SNE and compare the results to PCA.

Code

set.seed(123)tsne_results_30 <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]))data$tsne_x_30 <- tsne_results_30$Y[, 1]data$tsne_y_30 <- tsne_results_30$Y[, 2]tsne_plot <-plot_ly( data, x =~tsne_x_30, y =~tsne_y_30, color =~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )pca_results <-prcomp(data[, c("gene1", "gene2", "gene3")], scale. =FALSE)data$pca_x <- pca_results$x[, 1]data$pca_y <- pca_results$x[, 2]pca_plot <-plot_ly( data, x =~pca_x, y =~pca_y, color =~cluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )subplot(pca_plot%>%layout(showlegend =FALSE), tsne_plot%>%layout(showlegend =FALSE),titleX = T,titleY = T,margin =0.2)%>%layout(annotations =list(list(x =0.13 , y =1.035, text ="PCA", showarrow = F, xref='paper', yref='paper'),list(x =0.85 , y =1.035, text ="t-SNE", showarrow = F, xref='paper', yref='paper')))

PCA did a great job in preserving the relative distances between clusters, reflecting the original distribution of the data. The density of the clusters is also proportional to the original data, with tighter clusters remaining dense and more spread-out clusters maintaining their looser arrangement. In contrast, t-SNE naturally equalizes the densities of the clusters, making them appear more uniform. This is not an artifact, but a deliberate feature of t-SNE, where dense clusters are expanded and sparse ones contracted. As a result, the distances between clusters in the t-SNE plot may appear more similar, and the clusters themselves more evenly distributed, which can distort the true global relationships.

Clustering on t-SNE results

Avoid doing density and distance based clustering on t-SNE space. In vast majority of the cases the distances and density don’t have much meaning!

The last thing i want to mention here is that having loo little perplexity might cause misinterpretation of noise as clusters. In our previous example we know that we have four major clusters in our data but look what happens if we decrease the perplexity too much.

Code

tnse_data_frames<-c()for(it inc(2:10)){set.seed(123)tsne_results_it <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity = it,) tnse_data_frames<-rbind(tnse_data_frames,data.frame(tsne_results_it$Y,data[,c(4,5)],perplexity=it))}tsne_plot2<-plot_ly( tnse_data_frames, x =~X1, y =~X2, color =~subcluster,frame=~perplexity,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )tsne_plot2