7 Exercises

In this section we want to re-analyze a data set from mixOmics package. According to the authors: “The Small Round Blue Cell Tumors (SRBCT) dataset from includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).”

The dataset can be downloaded from here: https://github.com/mixOmicsTeam/mixOmics/raw/master/data/srbct.rda

When downloaded, you can load the data into R using the load function:

load("PATH_TO_FOLDER/srbct.rda")

Please note that, PATH_TO_FOLDER must be change to the path you downloaded the file into.

After loading the data, we will have a variable called srbct which is a list containing the following:

gene: a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.

class: a class vector containing the class tumor of each individual (4 classes in total).

gene.name: a data frame with 2,308 rows and 2 columns containing further information on the gene

We can combine gene and class into a single data frame using:

srbct_data<-data.frame(srbct$gene,class=srbct$class)

We are going to use random forests to find variables that are important for discriminating the 4 classes.

Randomly split your data into a training (80 percent of the data) and testing set (20 percent of the data).
Tune a hyperparameter (mtry) of random forest (only on training data)
Fit a random forest model to the training data and find its error rate (either out of bag error or cross validation. up to you!)
What is the accuracy of predicting the test set?
Find the top 10 most important genes for discriminating all the classes
What are the top 10 genes for predicting EWS class