7 Exercises
In this section we want to re-analyze a data set from mixOmics package. According to the authors: “The Small Round Blue Cell Tumors (SRBCT) dataset from includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).”
The dataset can be downloaded from here: https://github.com/mixOmicsTeam/mixOmics/raw/master/data/srbct.rda
When downloaded, you can load the data into R
using the load
function:
Please note that, PATH_TO_FOLDER
must be change to the path you downloaded the file into.
After loading the data, we will have a variable called srbct
which is a list
containing the following:
gene
: a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.
class
: a class vector containing the class tumor of each individual (4 classes in total).
gene.name
: a data frame with 2,308 rows and 2 columns containing further information on the gene
We can combine gene
and class
into a single data frame using:
We are going to use random forests to find variables that are important for discriminating the 4 classes.
- Randomly split your data into a training (80 percent of the data) and testing set (20 percent of the data).
- Tune a hyperparameter (
mtry
) of random forest (only on training data) - Fit a random forest model to the training data and find its error rate (either out of bag error or cross validation. up to you!)
- What is the accuracy of predicting the test set?
- Find the top 10 most important genes for discriminating all the classes
- What are the top 10 genes for predicting EWS class