2Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

If you have not read the t-SNE, I strongly suggest to scan that section before you continue with UMAP.

Similar to t-SNE, UMAP is used to reduced the dimensions of data. Despite that it is slighlt better in preserving the global structure, I would put in the same class as t-SNE that is the main focus is to give priority to accurately reflecting the relationships of nearby points. So the idea is the same. We want to reduce the dimension of the data to a set of new dimesion so that we can see the relationship between the data points. We are interested in tightly clustered data points because we think they are showing a stronger or more meaningful relationship in the original high-dimensional space. These tight clusters likely represent similar or related data points, and by preserving the local structure, we aim to retain these key relationships in the lower-dimensional representation. This helps in identifying patterns, trends, or groupings that are otherwise difficult to visualize. While global relationships are not entirely ignored, the emphasis on local neighborhoods ensures that small but significant clusters are faithfully represented, which is particularly useful in tasks like clustering, classification, or understanding complex datasets.

Later we are going to see exactly how UMAP works but for now let’s consider it as a tool that takes the input data, transforms it and gives us some output. We are going to see how to use UMAP in R.

3 UMAP in R

There is a few good implementation on UMAP in R. We are going to use uwot package. It should work on most platforms. An alternative would be umap package uwot gives more flexebility in terms of functions it can perform and parameters etc.

we are going to use the same data that we have been using for t-SNE

Code

# Visualize the original data in 3D, distinguishing clusters and subclusterslibrary(dplyr)library(plotly) # For interactive 3D plots# Set seed for reproducibilityset.seed(123)# Number of points per subclustern <-25# Manually define main cluster centers (global structure)main_cluster_centers <-data.frame(x =c(0, 5, 10, 15),y =c(0, 5, 10, 15),z =c(0, 5, 10, 15),cluster =factor(1:4))# Manually define subcluster offsets relative to each main cluster# These small offsets will determine subcluster locations within each main clustersubcluster_offsets <-data.frame(x_offset =c(-0.25, 0.25, -0.25, 0.25),y_offset =c(-0.25, -0.25, 0.25, 0.25),z_offset =c(0.25, -0.25, -0.25, 0.25),subcluster =factor(1:4))# Initialize an empty data frame to hold all datadata <-data.frame()# Generate data for each main cluster with subclustersfor (i in1:nrow(main_cluster_centers)) {for (j in1:nrow(subcluster_offsets)) {# Calculate subcluster center by adding the offset to the main cluster center subcluster_center <- main_cluster_centers[i, 1:3] + subcluster_offsets[j, 1:3]# Generate points for each subcluster with a small spread (to form local clusters) subcluster_data <-data.frame(gene1 =rnorm(n, mean = subcluster_center$x, sd =0.25), # Small spread within subclustersgene2 =rnorm(n, mean = subcluster_center$y, sd =0.25),gene3 =rnorm(n, mean = subcluster_center$z, sd =0.25),cluster = main_cluster_centers$cluster[i],subcluster = subcluster_offsets$subcluster[j] )# Add generated subcluster data to the main data frame data <-rbind(data, subcluster_data) }}plot_ly( data, x =~gene1, y =~gene2, z =~gene3, color =~cluster,symbol=~subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter3d", mode ="markers",size =5) %>%layout(title ="Original 3D Data with Clusters and Subclusters",scene =list(camera =list(eye =list(x =0.3, y =2.5, z =1.2) # Change x, y, z to adjust the starting angle ) ) )

This data has just three dimentions for 400 samples (4 major clusters within each we have subclusters). We are now going to do a UMAP (using umap function) on this data and see the results. We also do t-SNE and show the results side by side. There are parameters to set but we just go for the default ones except for n_neighbors in UMAP which we set to 30 to be similar to t-SNE.

Code

library(Rtsne)library(uwot)set.seed(123)tsne_results <-Rtsne(as.matrix(data[, c("gene1", "gene2", "gene3")]),perplexity =30,)set.seed(123)umap_results <-umap(as.matrix(data[, c("gene1", "gene2", "gene3")]),n_neighbors =30,)tsne_plot <-plot_ly(as.data.frame(tsne_results$Y), x =~V1, y =~V2, color =~data$cluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )umap_plot <-plot_ly(as.data.frame(umap_results), x =~V1, y =~V2, color =~data$cluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )subplot(tsne_plot%>%layout(showlegend =FALSE), umap_plot%>%layout(showlegend =FALSE),titleX = T,titleY = T,margin =0.2)%>%layout(annotations =list(list(x =0.13 , y =1.035, text ="t-SNE", showarrow = F, xref='paper', yref='paper'),list(x =0.85 , y =1.035, text ="UMAP", showarrow = F, xref='paper', yref='paper')))

What we see here is lower dimension repsentation of our data (from 3D to 2D). Both methods did a great job (as expected!) to capture the clusters. We are going to leave t-SNE for now and only focus on how we can use UMAP and what we can do with it.

3.1 Most important parameters

This implementation of UMAP actually offers many parameters to configure. Fortunately, most of them can be left at their default values. However, to get started, there are three key parameters that we need to focus on for now:

X: This represents the input data that we want to reduce in dimensionality. It can be any numerical dataset (or factor, more on this later), such as a matrix or data frame in R.

n_components: The dimension of the space to transform the original data into.This is basically, the size of the lower dimension.

n_neighbors: This parameter controls the size of the local neighborhood that UMAP uses for each point. It essentially determines how many nearby points influence the embedding of a given point, with smaller values focusing more on local structure and larger values considering broader relationships.

We can run the function by calling the umap(data,n_components=2,n_neighbors=10). Remember to set the seed for reproducibility.

The n_neighbors parameter is the most important here. We can visualize how it affects the outcome of dimensionality reduction.

Code

umap_data_frames<-c()for(nn inc(5,10,15,20,30,50,100,200,300,400)){set.seed(123)umap_results <-umap(as.matrix(data[, c("gene1", "gene2", "gene3")]),n_neighbors = nn,) umap_data_frames<-rbind(umap_data_frames,data.frame(umap_results,data[,c(4,5)],n_neighbors=nn))}umap_plot<-plot_ly( umap_data_frames, x =~X1, y =~X2, color =~subcluster,frame=~n_neighbors,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )umap_plot

Here i have plotted the data and colored the sub clusters. Surprisingly, the algorithm is quite robust when it comes to little changes to n_neighbors parameters but as we use larger number of neighbors the algorithm starts capturing more of a global structure than local. This is similar behavior to t-SNE but is less sensitive to the changes.

There are a few other parameters to play with, namely spread and min_dist. We are going to talk a bit about them later in the math section but for now let’s give you an intiutaive explanation for them. The spread parameter sets how much “space” the UMAP algorithm has to distribute points across the lower-dimensional space. A higher value of spread allows points to be distributed more widely, meaning that the embedding will appear more “spread out.” Conversely, a lower spread value will make the points more tightly packed together across the entire embedding. min_dist on the other hand, sets how close points can be to each other. Smaller min_dist values lead to tighter local clusters in the embedding. Obviousyl, this has to be set relative to spread.

For an example let’s focus on one our highest n_neighbors results (n_neighbors=200).

Code

umap_data_frames<-c()for(spr inc(1:10)){set.seed(123)umap_results <-umap(as.matrix(data[, c("gene1", "gene2", "gene3")]),n_neighbors =200,min_dist = spr/100,spread = spr) umap_data_frames<-rbind(umap_data_frames,data.frame(umap_results,data[,c(4,5)],spread=spr))}umap_plot<-plot_ly( umap_data_frames, x =~X1, y =~X2, color =~subcluster,frame=~spread,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )umap_plot

As you can can see UMAP changes the scales of how the points are spreads out the points in the lower dimension. We can now see the effect of min_dist:

Code

umap_data_frames<-c()for(mn_dst in1:8){set.seed(123)umap_results <-umap(as.matrix(data[, c("gene1", "gene2", "gene3")]),n_neighbors =200,min_dist = mn_dst,spread =10) umap_data_frames<-rbind(umap_data_frames,data.frame(umap_results,data[,c(4,5)],min_dist=mn_dst))}umap_plot<-plot_ly( umap_data_frames, x =~X1, y =~X2, color =~subcluster,frame=~min_dist,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="" )umap_plot

In this plot instead of umap “streaching out” the space, the distance between the points are now increasing. Although it might be a bit confusing, you can think of as if spread focuses on global structure whereas min_dist focuses on local relationships. In any case there is no best value for these parameters. One has to experiment with them to get an desired output.

3.2 Mixed data types

Sometimes it is possible to encounter mixed data types, combinations of numeric, categorical, and even binary data. This implementation of UMAP offers flexibility for handling such datasets through its metric parameter. It can handle multiple types of distance metrics simultaneously, meaning we can tune the way UMAP processes different subsets of our data based on their characteristics. For instance, we might use Euclidean distance for continuous variables, Hamming distance for binary data, or categorical distance for factor variables.

These are basically the distances that are supported now:

Euclidean: Best suited for continuous numeric data.

Cosine: Useful when you want to measure the angle between vectors (common in text or word embedding tasks).

Manhattan: Focuses on absolute differences between values, which can be useful for certain types of numeric data.

Hamming: Ideal for binary or bit-wise data.

Categorical: Designed specifically for factor variables, ensuring that categorical data is treated appropriately.

If our dataset is a data frame or matrix with multiple data types, UMAP allows us to specify a list of metrics, each applied to specific columns. For example, we could assign Euclidean distance to a block of numeric columns and Manhattan distance to another block. In our case, we have used default parameters which is euclidean distance. Have a look at help page of umap function to see how you can incorporate different distance metric.

3.2.1 Machine learning

Despite the fact that the most common application of UMAP, at least in life sciences, is to visualize data and confirm clusters, its utility goes beyond visualization and can sometimes significantly improve machine learning workflows. The idea here is to project high-dimensional data into a lower-dimensional space and train the model on these limited space. Basically, what we want to do is to mitigate the curse of dimensionality, reduces computational costs, and eliminates noise, so we get cleaner and more compact data representations.

Training a model is streight forward, we could just take the umap latent scores and use them in the model as features. In our case, let’s see if we can use Random Forest to do that. We are going to predict the subclusters in our data using both the original and transformed features:

Using the orignal features

Code

set.seed(123)umap_results <-umap(as.matrix(data[, c("gene1", "gene2", "gene3")]),n_neighbors =10,)X <- data[, c("gene1", "gene2", "gene3")] # Predictor variables (genes)y <- data$subcluster # Target variable# Set seed for reproducibilityset.seed(123)rf_model <- randomForest::randomForest(y ~ ., data =data.frame(X, y), family = binomial)print(rf_model)

Call:
randomForest(formula = y ~ ., data = data.frame(X, y), family = binomial)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 26.25%
Confusion matrix:
1 2 3 4 class.error
1 72 14 6 8 0.28
2 8 73 9 10 0.27
3 7 11 74 8 0.26
4 6 10 8 76 0.24

The performance is OK. We can now perform the same modeling but using the UMAP scores instead of the original data.

Code

set.seed(123)umap_results <-umap(as.matrix(data[, c("gene1", "gene2", "gene3")]),n_neighbors =10,ret_model =TRUE)X <- umap_results$embedding # Get the lower dimensiony <- data$subcluster # Target variable# Set seed for reproducibilityset.seed(123)rf_model <- randomForest::randomForest(y ~ ., data =data.frame(X, y), family = binomial)print(rf_model)

Call:
randomForest(formula = y ~ ., data = data.frame(X, y), family = binomial)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 25.25%
Confusion matrix:
1 2 3 4 class.error
1 72 12 6 10 0.28
2 10 74 9 7 0.26
3 6 9 75 10 0.25
4 4 9 9 78 0.22

The performance became slightly better and we are using one less dimension. Please note that in calling the umap function, i have used ret_model = TRUE. This is not because we want to do modeling on the scores (it does not change anything) but rather because retaining the model allows us to reuse the trained UMAP model on new, unseen data. This is an important feature of UMAP, which allows us to project new data that has not been part of the original dataset onto the same lower-dimensional space. We want to do that because when testing our machine learning model, we need to have consistency between the training and testing data representations. By projecting the test data onto the same UMAP-reduced space as the training data, we ensure that the relationships and structures are preserved, making the model’s performance evaluation more reliable. This is essential for maintaining the integrity of the testing phase, as the model can only make valid predictions if the test data is represented in the same lower-dimensional space as the data it was trained on. Without this, differences in dimensionality reduction could lead to misleading results and reduced model performance.

Let’s generate some test data and see how UMAP does the projection.

Code

set.seed(123)gene1<-rnorm(20,mean =mean(data$gene1),sd =sd(data$gene1))gene2<-rnorm(20,mean =mean(data$gene2),sd =sd(data$gene2))gene3<-rnorm(20,mean =mean(data$gene3),sd =sd(data$gene3))data_test <-data.frame(gene1=gene1,gene2=gene2,gene3=gene3)plot_ly( data_test, x =~gene1, y =~gene2, z =~gene3,type ="scatter3d", mode ="markers",size =5) %>%layout(title ="Original 3D Data with Clusters and Subclusters",scene =list(camera =list(eye =list(x =0.3, y =2.5, z =1.2) # Change x, y, z to adjust the starting angle ) ) )

We are not going to measure the performance etc because we just randomly generated these points but in real application we need to do that.

So to summarize, we can use the projection capability of UMAP in order to map new unseen data points onto the latent structure and use for variety of tasks including predictive modelling etc.

3.3 Data integration

UMAP’s flexibility and ability to handle multiple data types make it an excellent tool for data integration, meaning when combining datasets from different sources or modalities. In many machine learning and bioinformatics applications, datasets may come from varied experiments or measurement techniques (e.g., RNA-seq, metabolomics, clinical data), and we might want to integrate into a common analysis space.

You need to have the same samples across the different modalities to be able to perform intergration. It is possible to modify UMAP to do integration without having corresponding samples across the data sources but this is not the standard capability of UMAP.

In order to demonstrate data integration, we need to simulate some more data. In this case, i have simulated a dataset that has six genes. There are some correlation between these genes but the key differences is that three of these are showing some form of clustering while the other three show different clusters.

Code

# Set seed for reproducibilityset.seed(123)# Number of points per subclustern <-25# Function to introduce correlation between gene1:gene3 and gene4:gene6generate_correlated_data <-function(n, main_cluster_id, subcluster_offset) {# Generate subcluster-visible data in gene1:gene3 gene1 <-rnorm(n, mean = subcluster_offset[1], sd =0.25) # Subcluster-visible in gene1:gene3 gene2 <-rnorm(n, mean = subcluster_offset[2], sd =0.25) gene3 <-rnorm(n, mean = subcluster_offset[3], sd =0.25)# Introduce correlation between gene1:gene3 and gene4:gene6 to ensure full visibility in all 6 genes gene4 <-0.8* gene1 +rnorm(n, mean = main_cluster_id *2, sd =0.25) # Main clusters visible in gene4:gene6 gene5 <-0.8* gene2 +rnorm(n, mean = main_cluster_id *2, sd =0.25) gene6 <-0.8* gene3 +rnorm(n, mean = main_cluster_id *2, sd =0.25)return(data.frame(gene1, gene2, gene3, gene4, gene5, gene6))}# Define subcluster offsets (subclusters in gene1:gene3)subcluster_offsets <-list(c(-0.25, -0.25, 0.25),c(0.25, -0.25, 0.25),c(-0.25, 0.25, -0.25),c(0.25, 0.25, -0.25))# Initialize an empty data frame to hold all datadata <-data.frame()# Generate data for each main cluster and subclusterfor (main_cluster_id in1:4) { # 4 main clustersfor (subcluster_id in1:4) { # 4 subclusters per main cluster# Generate data for the current subcluster subcluster_data <-generate_correlated_data(n, main_cluster_id, subcluster_offsets[[subcluster_id]])# Add the cluster and subcluster labels subcluster_data$main_cluster <-factor(main_cluster_id) subcluster_data$subcluster <-factor(subcluster_id)# Combine with the main data data <-rbind(data, subcluster_data) }}set.seed(123)full_model<-umap(data[,1:6], n_neighbors =30)set.seed(123)D1<-umap(data[,1:3], n_neighbors =30)set.seed(123)D2<-umap(data[,4:6], n_neighbors =30)full_data<-plot_ly(as.data.frame(full_model), x =~V1, y =~V2, color =~data$main_cluster,symbol =~data$subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5)D1_data<-plot_ly(as.data.frame(D1), x =~V1, y =~V2, color =~data$main_cluster,symbol =~data$subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5)D2_data<-plot_ly(as.data.frame(D2), x =~V1, y =~V2, color =~data$main_cluster,symbol =~data$subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5)subplot(full_data%>%layout(showlegend =FALSE), D1_data%>%layout(showlegend =FALSE), D2_data%>%layout(showlegend =FALSE),nrows =2,titleX = T,titleY = T,margin =0.03)%>%layout(annotations =list(list(x =0.13 , y =1.035, text ="Full data", showarrow = F, xref='paper', yref='paper'),list(x =0.85 , y =1.035, text ="Dataset 1", showarrow = F, xref='paper', yref='paper'),list(x =0.13 , y =0.3, text ="Dataset 2", showarrow = F, xref='paper', yref='paper')))

In the three subplots above, you see the results of UMAP on the full data (all six variables), Dataset 1 (first three variables) and Dataset 2 (last three variables). In all the plots, the colors show one type of clusters and shapes show a different type. Considering the full data, it is clear that colors are separated and within each color we have relatively good separation of shapes. However, if we do UMAP on each dataset in isolation, none of the datasets can show us a good separation of both cluster types.

Now the point here is that for some reasons we don’t want to merge our data sources. It could be because they come from different distributions, each dataset might have some hidden pattern, or we want to preserve the original structure of each dataset. In fact, in most cases, we really don’t want to do the merging and apply UMAP on the merged data directly. What we are actually interested in doing is to map each dataset to an intermediate space that preserves as much information as possible from each dataset individually, while having the exact same statistical properties.

In this case, this intermediate stage is a graph that represents similarities between the samples. similarity_graph function from uwot allows us to do such a mapping. It is important to note that similar to t-SNE, this similarity is affected by so many parameters but most importantly n_neighbors. In fact, you can think about the approach as follow, we are going to do UMAP on each of datasets in isolation, extract the similarity graph from the UMAP object and merge these similarity graphs into a single one and then only visiualize the merged one.

So let’s start creating the similarity graphs first:

We can now easily merge these graphs using simplicial_set_intersect or simplicial_set_union function. In many applications we want to use simplicial_set_intersect as it focuses on the shared structures between the datasets but if the total structure is of interest, we could use union. After building the merged similartiy we can use optimize_graph_layout function to map these similarty onto a lower dimensional space and visualize it.

set.seed(123)# Combine the two representations into oneinteg_umap <-simplicial_set_intersect(x = D1_sim, y = D2_sim)set.seed(123)umap_scores <-optimize_graph_layout(integ_umap)plot_ly(as.data.frame(umap_scores), x =~V1, y =~V2, color =~data$main_cluster,symbol =~data$subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="",showlegend =FALSE )

In the figure you see the integrated UMAP space. This acutally looks really good. But the question is can we make this a bit better by for example trying to make color clusters a bit tighter?

Fortunately, UMAP gives us weight parameter in simplicial_set_intersect function which can be used to tune relative influence each of the dataset on the final embedding. When we called the function above we assigned dataset 1 to x and dataset 2 to y. A weight of 0.5 gives equal influence. Values smaller than 0.5 put more weight on x, meaning dataset 1. Values greater than 0.5 put more weight on y, that is dataset 2.

In this specific case, sine i want tighter colors, i want to give slightly more weight to the dataset 2.

set.seed(123)# Combine the two representations into oneinteg_umap <-simplicial_set_intersect(x = D1_sim, y = D2_sim,weight =0.6)set.seed(123)umap_scores <-optimize_graph_layout(integ_umap)plot_ly(as.data.frame(umap_scores), x =~V1, y =~V2, color =~data$main_cluster,symbol =~data$subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="",showlegend =FALSE )

We can see the results, we clusters are now much tighter than before and still we have OK clustering of the shapes. Can you find a better value for the weight? This takes a bit of experimenting with weight. But you can also use different paramters when contructing the initial similarity graph. For example you might suspect that the clusters are more refined in dataset 1 vs dataset 2, then you can use smaller number of neighbors. Other parameters can also be changed depending on the specific dataset. This wraps up the data integration approach of UMAP.

3.4 Supervised

Although UMAP is an unsupervised method, it can be adapted in a way that mimics supervised learning. This doesn’t mean UMAP maps data from an input space to a specific output space. Instead, the constructed lower-dimensional space is adjusted to reflect class separability, tuning the embedding so that data points from the same class are closer together, while points from different classes are more distinct. In this case, we need some target data. This is often the group of interest or some numerical value but it can also be a matrix with a lot of different targets of interest. In anycase, we can use the y argument from umap function to pass the target of interest. For example here we use subclusters as the target. Please be aware that we need to convert categorical variables to factors.

set.seed(123)supervised_score<-umap(data[,1:6],n_neighbors =30,y =data$subcluster,spread =7,min_dist =3)plot_ly(as.data.frame(supervised_score), x =~V1, y =~V2, color =~data$main_cluster,symbol =~data$subcluster,colors =c("red", "green", "blue", "purple"),type ="scatter", mode ="markers",size =5) %>%layout(title ="",showlegend =FALSE )

Here i had to spread the points a bit so we can see the results a bit better. Compared to the results we saw before, this looks a bit better at least some of the sub clusters (shapes) they moved closer to each other which basically reflects their relationship that we wanted to capture using y. Given some test data we could perform the projection of the test datapoints onto this space and use some form of similarity to perform prediction but we are going to leave this to you to do :)

3.5 Should i trust UMAP?

While UMAP is a powerful tool that preserves both local and global structures in your data, it’s important to approach the results with caution. UMAP does not guarantee perfect global preservation, and it may sometimes produce artifacts, such as random noise clusters, that do not necessarily reflect meaningful patterns in the data. These clusters can arise due to random noise or overfitting during the optimization process, especially when tuning the parameters too aggressively.

UMAP is primarily designed to preserve local relationships, meaning that points close to each other in the high-dimensional space should remain close in the low-dimensional embedding. However, global structure may be distorted, and distant relationships may not always be accurately represented. This is particularly true when using smaller values for the n_neighbors parameter, which can make UMAP focus too much on local clusters, potentially missing broader trends.

It’s also important to avoid tuning UMAP’s parameters to match a desired outcome or preconceived conclusion. Over-tuning the algorithm, especially by manipulating n_neighbors or min_dist to achieve specific patterns, can lead to misleading visualizations that don’t reflect the true structure of the data. Instead, parameter tuning should be guided by the nature of the data and the goal of the analysis, not by an attempt to force specific patterns or clusters.

Lastly, similary to t-SNE avoid doing distance based clustering methods on UMAP. Density based clustering might work but there is no guarantee for that!

Code

library(Rtsne)library(uwot)cat_dt<-read.table("https://raw.githubusercontent.com/PayamEmami/pca_basics/refs/heads/master/data/cat.tsv")set.seed(123)tsne_results_cat <-Rtsne(as.matrix(cat_dt),perplexity =30,)set.seed(123)umap_results_cat <-umap(as.matrix(cat_dt),n_neighbors =30,)pca<-prcomp(cat_dt)$x[,1:2]pca_plot <-plot_ly(as.data.frame(pca), x =~PC1, y =~PC2,type ="scatter", mode ="markers",size =5) %>%layout(title ="" )tsne_plot <-plot_ly(as.data.frame(tsne_results_cat$Y), x =~V1, y =~V2, type ="scatter", mode ="markers",size =5) %>%layout(title ="" )umap_plot <-plot_ly(as.data.frame(umap_results_cat), x =~V1, y =~V2, type ="scatter", mode ="markers",size =5) %>%layout(title ="" )colnames(cat_dt)<-c("V1","V2","V3")original_data <-plot_ly(as.data.frame(cat_dt), x =~V1, y =~V2,z=~V3,type ="scatter3d", mode ="markers",size =5) %>%layout(title ="Original data" )original_data