In this section we want to re-analyze a data set from mixOmics package. According to the authors: “The Small Round Blue Cell Tumors (SRBCT) dataset from includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).”
The dataset can be downloaded from here: https://github.com/mixOmicsTeam/mixOmics/raw/master/data/srbct.rda
When downloaded, you can load the data into
R using the
Please note that,
PATH_TO_FOLDER must be change to the path you downloaded the file into.
After loading the data, we will have a variable called
srbct which is a
list containing the following:
gene: a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.
class: a class vector containing the class tumor of each individual (4 classes in total).
gene.name: a data frame with 2,308 rows and 2 columns containing further information on the gene
We can combine
class into a single data frame using:
We are going to use random forests to find variables that are important for discriminating the 4 classes.
- Randomly split your data into a training (80 percent of the data) and testing set (20 percent of the data).
- Tune a hyperparameter (
mtry) of random forest (only on training data)
- Fit a random forest model to the training data and find its error rate (either out of bag error or cross validation. up to you!)
- What is the accuracy of predicting the test set?
- Find the top 10 most important genes for discriminating all the classes
- What are the top 10 genes for predicting EWS class