Skip to content
Permalink
837de1d39b
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
737 lines (602 sloc) 32.5 KB
Using mmRmeta
================
Sebastian Lieske
15 Januar 2020
# Introduction
The developed R package mmRmeta is a framework to analyze multi-omics
data such as RNA-seq and DNA methylation data with the focus on bimodal
genes. The workflow of mmRmeta consists of three major steps. First, the
data needs to be acquired and preprocessed. This is followed by the
identification of bimodally expressed genes. The package mmRmeta uses
mclust for model-based clustering to obtain parameter estimates and the
cluster assignment of the samples. The third step is the multi-omics
analysis to find relations between bimodally expressed genes and the
phenotypes of the patient population via modeling regularized logistic
regressions and conducting survival analyses with Cox proportional
hazard models.
<center>
![Overview mmRmeta](overview.svg)(\#anchor5)
</center>
# Installation
Before doing the data analysis you need to install and load the package.
``` r
devtools::install.github("sebastianlieske/mmRmeta", host = "github.molgen.mpg.de/apu/v3") # link to package
library(mmRmeta)
```
There are 7 other packages that are needed that can be downloaded from
Bioconductor.
``` r
requiredBio <- c("Biobase", "biomaRt", "edgeR", "GenomicRanges", "SummarizedExperiment", "EnsDb.Hsapiens.v86", "ComplexHeatmap")
for (i in requiredBio) {
BiocManager::install(i)
}
```
# Preprocessing
Let us start with the preprocessing of the data. To proceed this
vignette, make sure you have available data either in form of a
SummarizedExperiment containing the gene expression data and clinical
data or at least one gene expression count matrix and clinical data in
form of a data frame in your R environment. In this example, the data
was downloaded from NCI’s [Genomic Data Commons Data
Portal](https://portal.gdc.cancer.gov/) via
[TCGAbiolinks](https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html).
To see an example of how to download the data, please go to the
following section [Data Acquisition](#anchor1). To combine RNA
expression count matrix and additional data into a summarized experiment
head to [Make Summarized Experiment](#anchor2).
## Data Acquisition
To obtain the needed data for the analysis you first need to install two
needed packages: BiocManager to install other packages from Bioconductor
and TCGAbiolinks.
``` r
if(!require(BiocManager)) install.packages("BiocManager", repos = "http://cran.us.r-project.org")
if(!require(TCGAbiolinks)) BiocManager::install("TCGAbiolinks", dependencies = TRUE)
```
Now you are set to download the data. You can browse
<https://portal.gdc.cancer.gov/repository> for the primary site of the
tumor you desire. We are interested in RNA-seq data from HTSeq and we
need the raw counts. The package mmRmeta allows for the analysis of
additional multi-omics data like DNA methylation. An example for that is
shown as well.
``` r
library("TCGAbiolinks")
if (!dir.exists("GDCdata")) dir.create("GDCdata") #create a new folder for the data to be downloaded
#obtain the query result for available data in the data base
queryLGG <- GDCquery(project = c("TCGA-LGG"), #low grade glioma (brain cancer)
data.category = c("Transcriptome Profiling"), data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts", legacy = F)
#with the query we are able to download the found data
GDCdownload(query = queryLGG, method = "api",
files.per.chunk = 40, directory = "GDCdata") #can take several minutes
#read downloaded data and transform into an R object
#if you encounter an error please download most recent version of bioconductor
rnaCountLGG <- GDCprepare(queryLGG, summarizedExperiment = TRUE,
directory = "GDCdata") #create a summarized experiment
#lastly save object as .rds file
saveRDS(rnaCountLGG, "rnaCountLGG.RDS")
```
To download DNA methylation, we need to adjust the data category,
data.type and workflow.type. Due to the size of the files, the download
will take up some more time.
``` r
library("TCGAbiolinks")
if (!dir.exists("GDCdata")) dir.create("GDCdata") #create a new folder for the data to be downloaded
queryMethLGG <- GDCquery(project= "TCGA-LGG",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 450",
legacy = F)
GDCdownload(query = queryMethLGG, method = "api",
files.per.chunk = 40, directory = "GDCdata")
methylLGG <- GDCprepare(queryMethLGG, summarizedExperiment = T, directory = "GDCdata")
saveRDS(methylLGG, "methylLGG.RDS")
```
Now you have a summarized experiment object saved as an rds file. This
data needs to be preprocessed due to redundant or missing data. For that
continue with [Preprocess Summarized Experiment](#anchor2)
## Make Summarized Experiment
When you already have your raw counts in a matrix and additional
clinical data of the patients or samples in a data frame, you can
combine them into a summarized experiment to use this R package to its
fullest extend. To learn more about the summarized experiment head over
to
[Bioconductor](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)
or go to section [Structure of a Summarized Experiment](#anchor3). You
have to make sure that the rows correspond the a single gene and the
columns correspond to each patient. The patients must have an unique
identifier. In the clinical data, the patients or samples should be in
the rows and the variables like age, gender, tumor type etc. should be
in the columns. Generally, make sure that your data is
[tidy](https://r4ds.had.co.nz/tidy-data.html).
### Structure of a Summarized Experiment
</center>
![SummarizedExperiment](SEedited.svg)
</center>
To better understand what certain functions will do, you’ll get a short
overview of the class summarized experiment. Rows represent features
like gene counts, while the columns represent samples. All included
functions of this package are designed to either work on simple
matrix-like objects or on the SummarizedExperiment (SE) container. In
this work, the assays store the gene expression, classification, and DNA
methylation data. The colData stores the clinical data of each sample or
patient and the rowRanges contains the genomic features like the
location of the gene on the chromosome. Due to the characteristic of the
SE class, the data will always be in sync, meaning when deleting rows or
columns out of the RNA-seq assay the respective data is deleted out of
the other objects too.
## Preprocessing Summarized Experiment
You have the necessary data ready by now. This R package has a variety
of functions for data cleansing and data reduction. These were
especially made to work on the data obtained from GDC Data Portal but
with the right attribute costimiziation they can be used on other
tabulated data as well.
``` r
#load data
if (!exists("rnaCountLGG")) {
rnaCountLGG <- readRDS("rnaCountLGG.RDS")
}
```
Due to the data design, the colData of the summarized experiment may
contain nested elements; like a list as a cell entry. Especially one
column is creating trouble.
``` r
#delete column "treatments" because its leading to errors and it contains no information
library(mmRmeta)
if ("treatments" %in% colnames(colData((rnaCountLGG)))) {
rnaCountLGG$treatments <- NULL
}
```
The preprocessing of the summarized experiment consists of two parts:
the first one handles the gene expression matrix in the assay and the
second one deals with the colData containing the clinical data. Both
need to be altered for downstream functions to properly work.
### Processing colData
Starting with the clinical data in the colData there are several things
to pay attention to. We want to filter out columns with only missing
values, change all categorical entries to factors and filter out single
level columns. Additionally, you only want to have samples from primary
solid tumors and filter out metastatic samples.
``` r
#new object for applied changes
editedCountLGG <- filter_for_factor(rnaCountLGG,
column_value = "Primary solid Tumor") #only keep primary solid tumours
editedCountLGG <- preprocess_coldata(editedCountLGG)
editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name
editedCountLGG <- factorize_columns(editedCountLGG)
editedCountLGG <- drop_unused_levels(editedCountLGG)
editedCountLGG <- filter_small_binary(editedCountLGG, min_counts = 10) #drop binary factor columns where one level has less than 10 counts
editedCountLGG <- filter_single_level(editedCountLGG) #drop factor columns with only one level
colnames(editedCountLGG) <- substr(colnames(editedCountLGG), 1, 12)
```
The function *preprocess\_coldata* incorporates 3 single functions that
also can be applied individually. When using the standard parameters all
not numerical columns were set to factors, columns with only NAs or “not
reported” are deleted and single factor level entries are discarded as
well.
#### Additional Functions for Coldata
These functions are called by *preprocess\_coldata\_setw*
- factorize\_columns()
- filter\_columns\_with\_na()
- filter\_single\_level()
There are some more functions to further filter and restructure the
colData:
- reorder\_all() = reorder factor levels by count, important for
figure legends and glmnet
- collapse\_column() = combines factor levels by given name, useful
for small factor levels
- drop\_unused\_levels()
To successfully conduct the survival analysis, the associated data
columns need to be transformed into a standardized format. This
functionality is implemented by the function add\_coldata\_survival()
which requires the information of the time of the event, event status
(dead, alive), and time of censoring.
``` r
editedCountLGG <- add_coldata_survival(editedCountLGG)
```
### Processing Assay
Now the gene expression matrix needs some attention. When analysing
RNA-seq data, you want to filter out genes with over all low or zero
read count to safe space and computational time. Also there can be
duplicated gene that needs to be filtered out.
``` r
#new object for applied changes
editedCountLGG <- filter_chromosome(editedCountLGG, chromosome = c("chrY", "chrX")) #filter genes on x and y chromosome
editedCountLGG <- preprocess_assay(editedCountLGG)
```
In this case we want to filter out genes located on the Y and X
chromosome because they would be flagged as bimodal genes due to the
gender. It is necessary that the summarized experiment has also a
GRanges object. By applying *preprocess\_assay* duplicated genes and
genes with less than 5% of detected samples are discarded with a overall
read count lower than 5% of samples having at least 5. Also at least 10%
of all samples that have reads must have more than 5 reads for that gene
because a gene is considered as expressed when it has at least a count
of 5-10 in a library.
Some patients have multiple existing samples in the expression matrix.
We want to have only one enty per patient so we need to merge those
additional samples by either calculating the mean or median of all
samples belonging to one patient.
``` r
editedCountLGG <- merge_duplicated(editedCountLGG, assay_index = 1,
method = "median", shorten_names = T, original_names = T, parallel = T) #combine duplicated genes
editedCountLGG <- reorder_all(editedCountLGG) #reorder factors by counts
```
When working with TCGA data, all patients or samples have a barcode. The
barcode of the samples are the names in the columns of the expression
matrix. So far the first 12 characters of the barcode represent a
patient and the following characters specify the sample.
#### Additional Functions for Assay
These functions are called by *preprocess\_assay*
- filter\_duplicates()
- filter\_rare()
Additionally, these functions are available:
- use\_silvermantest() = uses silvermantest on every genes
- use\_diptest() = use diptest on every gene
After gene filtering, the gene expression counts are normalized and
transformed to log counts by add\_assay\_norm() and transform\_assay(),
respectively. To ensure the availability of the original counts, any
alteration of the data is saved as a new assay which is added to the
SummarizedExperiment. The Trimmed Mean of M values (TMM) method was
adopted from the edgeR Bioconductor package.
``` r
#add normalized assay, log transform assay
editedCountLGG <- add_assay_norm(editedCountLGG, 1, method = "TMM",
name = "normalized", add_info_metadata = F ) #add new normalized assay
editedCountLGG <- transform_assay(editedCountLGG, assay_index = 2) #log2 transform second assay
```
#### Preprocessing Methylation Data
When preprocessing the methylation data you can use the same functions
as well.
``` r
methylLGG$treatments <- NULL
methylLGG <- filter_columns_with_na(methylLGG)
methylLGG <- filter_duplicates(methylLGG)
methylLGG <- filter_zero(methylLGG)
colnames(methylLGG) <- substr(colnames(methylall), 1, 12) #shorten IDs
```
# Finding Bimodal Genes
## Using SIBERG
After preprocessing we are set to analyse the data, starting with
finding bimodal genes. Our matrix has several thousand genes and with
the help of
[SIBERG](https://cran.r-project.org/web/packages/SIBERG/index.html) we
can prefilter possible bimodal candidates relatively fast.
``` r
siberLGG <- use_siber(editedCountLGG, 1) #use SIBER in parallel
#order by bimodal index
siberLGG <- siberLGG %>% tibble::rownames_to_column() %>% arrange(., -BI,) %>% tibble::column_to_rownames()
```
When using this function, one obtains a matrix with 6 columns.
We are interested in the last column containing the [bimodal
index](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2730180/). The
bimodal index is an indicator of how well a gene has probably two modes
and therefore can be considered as bimodal. Theoretically, the BI can
have values ranging from zero to infinity. Usually they reach at most a
value of 3. Generally, the higher the index the more likely the gene is
bimodal. In SIBERG the index is a generalized bimodal index to allow for
different variance in the two modes. Having that index, we can set a
cutoff and every gene below that cutoff is discarded.
Next, we want to filter out genes with a bimodal index below 1.5.
``` r
bimodalNames <- rownames(siberLGG[siberLGG$BI >= 1.5,]) #names of bimodal genes
editedCountLGG <- editedCountLGG[bimodalNames,] #subset
```
With that we have a much smaller summarized experience with less than
300 rows/genes that are normalized and also log2 transformed. The
remaining genes are candidates of being bimodal. In the next section we
are going to narrow in those bimodal genes and also use model-based
classification to get the modality of every patient for each gene.
## Using mclust for Classification
After applying SIBERG, our summarized experiment has two assays and a
bunch of bimodal candidates. Now, we are going to use model based
clustering with
[mclust](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/) to
classify our samples for each gene into modes/groups. Mclust uses an
expectation–maximizationalgorithm for maximum likelihood estimation.
``` r
resultClust <- use_mclust(editedCountLGG, 2, G = c(1:2), modelNames = "V") #apply mclust on normalized counts
```
With that function call, mclust calculated the BIC of having 1-2 mixture
components (clusters) in the univariate data and we allow the model to
have variable or unequal variance in the mixture components. We obtained
a list with the results for every gene. Mclust returns the model with
the highest BIC. Due to the data properties there may be flawed results
and those need to be corrected.
### Verify Results
When applying mclust with variable variance it can happen that the data
is not classified correctly. Samples with lowest gene expression are
classified into the higher mode and vice versa:
<enter> ![Misclassification](misclassification.svg)
</center>
Those flaws have to be corrected.
``` r
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification
correctedClust <- verify_mclust(correctedClust, min_proportion = 0.10) #look for imbalanced models
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification# again
```
The function *correctedClust* iterates through the list and searches for
flawed results. It has two modi. The first one without the parameter
*min\_proportion* looks for misclassfied samples. With that parameter,
it searches for results with imbalanced proportion of the modes. In this
example we want to have a model where at least 10% of the samples are
assigned to one mode. If the function encounters inconsistencies or
flaws, if will call *use\_mclust* again with different parameters. Next
up we filter results that returned an unimodal model and models with a
low fold change.
``` r
correctedClust <- filter_unimodal(resultClust) #filter out unimodal models
correctedClust <- filter_proportion(correctedClust, proportion = 0.10) # filter out imbalanced proportions
correctedClust <- filter_fold_change(correctedClust, min_fold = 2 ) #filter out models with low fold change
```
Now that we have filtered the results, we can add the classification
from mclust to our summarized experiment.
``` r
#add new assay to SE
editedCountLGG <- add_assay_classification(editedCountLGG, correctedClust, index = 3)
#save .RDS
saveRDS(editedCountLGG, "classifiedLGG.RDS")
```
A new assay with the classification matrix was added to the SE. In the
next section we are going to analyse the relation between bimodal genes,
DNA methylation and the clinical data.
# Multi-Omics Analysis
The following part will cover the analysis of the additional clinical
data stored in colData of our summarized experiment and DNA methylation.
The goal is to find possible explanations or correlations between the
features of the samples and the bimodal expression of certain genes or,
from an other approach, find relations between bimodal genes and their
possible influence on the data characteristics like survival probability
of the patients. To achieve this goal this package uses penalized
generalized linear models via
[glmnet](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) and
Cox proportional hazard models for survival analysis.
## Glmnet for Multiple Logistic Regression
For every gene, a generalized linear model is fit to predict the mode of
the samples. Because one is looking for bimodal genes, a single gene has
K classes labeled as 1 and 2. The class is the response variable.
Additionally, there is a variety of predictors (k) in the clinical data
for example categorical predictors like gender, tumour type or grade,
and also numerical predictors like the age of the patients. Due to the
abundance, variable selection is needed to shrink the model and to
decrease overfitting of the model.
### Model Matrix
Our data consists of a variety of qualitative and quantitative
variables. For our multivariate analysis we want to use any variable
that seems interesting to us. First of all, we need to create a model
matrix (design matrix) to include the values of our explanatory
variables (columns in colData). Each row represents a sample and the
columns correspond to the variables.
``` r
#you can get a first glimps of which columns have many missing values
sum_is_na(editedCountLGG)
#create vector with column names from our data
variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count",
"paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score",
"paper_ATRX.status", "paper_MGMT.promoter.status", "paper_IDH.codel.subtype",
"paper_Chr.7.gain.Chr.10.loss", "paper_Chr.19.20.co.gain")
#create model matrix
modelMatrix <- make_model_matrix(editedCountLGG, predictors = variables,
zscore = T, categorical_caret = T, impute_na = T, fun = mean)
```
This function fulfills several tasks. Because glmnet can not use factors
directly, every categorical variable that is encountered is transformed
into dummy variables with either values of 0 and 1. Furthermore, every
continuous variable (e.g. age) is z-score scaled. Standardizing features
is an important feature before model fitting. Otherwise, different
scaled features will influence the process negatively. Additionally,
missing continuous entries are imputed. Unfortunately, due to missing
data values, the model may consist of fewer samples than the original
data.
### Nested k-fold Cross-Validation
Next up, we are going to apply a regression model with an elastic net
penalization via glmnet. Two examples are given, one simple example with
only taking the colData into account and the second example with
additional DNA methylation data which needs more steps. Our model matrix
has many variables that our final model can draw their coefficients
from. We want to know if the classification of the bimodality can be
predicted by a model containing the clinical data, which predictors can
model the bimodality and to which extend. Because of the abundance of
our predictor variables we are applying variable selection.
``` r
#example without methylation
resultModel <- use_glmnet(editedCountLGG, assay_classification = 3, model_matrix = modelMatrix,
parallel = F,col_names = variables, use_glm = T, repeats = 1,
fast_cv = F, standardize = F, stratify_class = T)
```
Depending on the parameters we obtain either a large list with two. The
first one is a list of results from glmnet for each gene and the second
one are the results of the cross validated error scores (auc values).
With the function *make\_result\_matrix* we get a better overview of the
results and are able to order by AUC.
``` r
resultCombined <- combine_list(resultModel) #give result Genewise
resultLGG <- lapply(resultCombined, function(x) {
unlist(make_result_matrix(x, result_cv = 2,
fun1 = mean, fun2 = sd, predictors = variables, repeated = F,
include_auc_range = T)) #make result matrix for every gene
})
resultLGG <- do.call(rbind, args = resultLGG) #combine them all
```
Keep in mind that we made a k-fold nested cross validation. By
considering the range/standard deviation and the mean of the area under
curve values, we can get an idea of how stable the model is and to what
extend we can rely on the coefficients obtained by said model. From the
obtained result matrix you can consider and filter genes that have a
certain AUC value with a small standard deviation.
``` r
filteredResults <- data.frame(resultLGG) %>% filter(between(round(mean, digits = 2), 0.8, 1)) %>%
filter(between(round(sd, digits = 3), 0, 0.1)) #filter genes with AUC between 0.8 and 1 and a sd of 0 and 0.1
```
#### Example with DNA Methylation
In this section an example of how to utilize DNA methylation in the
analysis is given. At first the preprocessing of the downloaded data is
needed as well.
``` r
if (!exists("methylLGG")) {
methylLGG <- readRDS("methylLGG.RDS")
}
methylLGG$treatments <- NULL
methylLGG <- merge_duplicated_columns(methylLGG)
methylLGG <- filter_duplicates(methylLGG)
methylLGG <- filter_zero(methylLGG)
colnames(methylLGG) <- substr(colnames(methylLGG), 1, 12)
```
The downloaded methylation data has to be mapped to the genes. The
information of the genomic location of the genes is stored as GRanges in
the rowRanges object of the SummarizedExperiment. The first part
consisted of annotating the CpG probes by overlapping the GRanges of the
probes with general genomic locations. The EnsDb.Hsapiens.v86 library
(Rainer 2017)\[\!link\] is used to retrieve the annotation and GRanges
of genes, exons, and introns. Promoter regions were defined by a set
distance from the start of a gene. The default value is 2000 bases
upstream and 200 bases downstream of the TSS. The function
*methylation\_cpg\_annotation()* overlaps the GRanges of the methylation
probes to the retrieved genomic information. Each mapped probe was
assigned a prefix, indicated whether it matched to the promoter, exon or
intron. Multiple matching probes on the same location were ordered and
marked by a secondary prefix. Genes can have zero, some, or several
hundred mapping probes. In general, longer genes have more mapping
probes.
To start, we need the genomic information of genes from ensembl.
``` r
#at first obtain genomic information
edb <- EnsDb.Hsapiens.v86
seqlevelsStyle(edb) <- "UCSC"
exons <- ensembldb::exonsBy(edb, by = "gene", filter = AnnotationFilterList(
SeqNameFilter(c(1:22, "X", "Y")),
GeneIdFilter("ENSG", "startsWith")))
genes <- genes(edb, filter = AnnotationFilterList(
SeqNameFilter(c(1:22, "X", "Y")),
GeneIdFilter("ENSG", "startsWith")))
```
Then the probes in the DNA methylation assay are mapped to the genes.
All mapped probes and their beta values are transformed into a model
matrix as well and stored in a list.
``` r
namelist <- as.list(rownames(editedCountLGG)) #name of genes
cpgGeneList <- lapply(namelist, function(x) methylation_cpg_annotation(edb, methylLGG, 1, genes = genes, gene_name = x, exons = exons)) #get annotation for every gene
names(cpgGeneList) <- rownames(editedCountLGG)
geneMethylList <- lapply(cpgGeneList, function(x) methylation_beta_value(x,
methylLGG, shorten_colnames = T, impute_na = T, fun = mean)) #get beta value for every mapped probe
geneMethylList <- lapply(geneMethylList, function(x) if (!is.null(colnames(x))) rename_margin(x,
margin = 2, cut_from = 1, cut_to = 12, char_from = "\\.", "-") else x)
methylModelList <- lapply(geneMethylList, function(x) if(!is.null(x)) assay_to_model(x) else x) #
methylModelList <- lapply(methylModelList, function(x) if(!is.null(x)) rename_margin(x, margin = 1,
char_from = "\\.", "-") else x)
```
Now we have a list of model matrices for every mapped gene. Some genes
have no mapping CpG probes and are empty. Now were ready to apply
glmnet.
``` r
methylpred <- c("promotor", "exon_first", "exon_inter", "exon_last", "intron") #predictor names for methylation
predictors <- c(methylpred, variables) #all predictor variables
intersectn <- intersect(rownames(methylModelList[[1]]), rownames(modelMatrix)) #get intersection of sample IDs of the model matrices
#then the two model matrices of each gene are combined
modelList <- lapply(methylModelList, function(x) {
if(!is.null(x)) cbind(modelMatrix[intersectn, ], x[intersectn,]) else modelMatrix[intersectn,] })
```
Due to the large list of model matrices, it is not possible to use the
function *use\_glmnet()* on its own. Therefore you have to use lapply
with the two seperate functions that are included in *use\_glmnet()*
``` r
resultLGGMet <- lapply(seq_along(modelList), function(x) {
use_cvglmnet(response = assay(editedCountLGG, 3)[x, intersectn, drop = F],
verbose = T, type_measure = "auc", model_matrix = modelList[[x]], stratify_class = T,
alpha = 0.5, standardize = F, stratify = T, return_both = T, parallel = F)})
names(resultLGGMet) <- rownames(editedCountLGG)
resultLGGMetCv <- lapply(seq_along(modelList), function(x) {
use_nested_cvglmnet(response = assay(editedCountLGG, 3)[x, intersectn, drop = F],
model_matrix = modelList[[x]][intersectn,, drop = F], alpha = 0.5,
standardize = F, stratify = T, parallel = F)})
names(resultLGGMetCv) <- rownames(editedCountLGG)
```
Then you combine the results into a single list to then create the
result matrix.
``` r
resultsCombined <- rlist::list.zip(resultLGGMet, resultLGGMetCv)
names(resultsCombined) <- editedCountLGG@rowRanges$external_gene_name #change gene name type
resMatrixCombined <- lapply(resultsCombined, function(x) unlist(make_result_matrix(x, result_cv = 2, fun1 = mean, fun2 = sd, predictors = predictors, repeated = F, include_auc_range = T)))
resMatrixCombined <- do.call(rbind, args = resMatrixCombined)
```
## Multinomial Regression
Besides doing a binomial regression it is also possible to allow for
multiple outcome levels. The function *use\_glmnet* and the relevant
functions can handle do multinomial regression as well. For example we
can test, which genes and variables are related the most to the subtype
of the patient.
``` r
exprModel <- assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values
rownames(editedCountLGG) <- editedCountLGG@rowRanges$external_gene_name #change gene names
variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count",
"paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score",
"paper_ATRX.status", "paper_MGMT.promoter.status")
multiMatrix <- make_model_matrix(editedCountLGG, predictors = variables)
megedMatrix <- merge_model_matrices(multiMatrix, exprModel)
```
Then you have to make sure, that the Matrix and the outcome variable
have the same length and names.
``` r
multiclass <- as.vector(editedCountLGG$paper_IDH.codel.subtype)
names(multiclass) <- colnames(editedCountLGG) #give it names
multiclass <- multiclass[!is.na(multiclass)] #omit NAs
intersection <- dplyr::intersect(rownames(megedMatrix), names(multiclass))
multimatrix <- megedMatrix[intersection,] #subset
multiclass <- factor(multiclass[intersection]) #subset
```
Then you can apply glmnet to the data.
``` r
multimodelResult <- use_glmnet(classification = multiclass,
model_matrix = megedMatrix, type.multinomial = "grouped", fast_cv = F,
family = "multinomial", standardize = F, repeats = 10, stratify_class = T)
```
From the result one can obtain the active coefficients.
``` r
activeCoef <- get_active_coef(multimodelResult)
```
## Survival Analysis
This package is able to conduct survival analysis. Within this
implementation, glmnet and the included regularized Cox regression
(Coxnet) is used. Coxnet fits a Cox regression model that is regularized
by the elastic net penalty. The regression model describes the relation
between an event (e.g. death) and a set of covariates as expressed by
the hazard function. For that were looking at the gene expression and
with addition of clinical variables. A nested cross-validation is used.
``` r
exprModel <- assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values
joinedModel <- merge_model_matrices(modelMatrix, exprModel)
#apply cox regression
resultSurvival <- use_cv_coxglmnet(editedCountLGG, model_matrix = joinedModel, stratify = T, k_inner_loop = 5, k_outer_loop = 8, verbose = T, add_survival = F, seed = 15)
```
The inner cross validation will be used for finding the optimal value
for the penality factor lambda, while the outer loop is used to estimate
the prediction accuracy with the C-index. Also a final model is build to
calculate the hazard ratios/risk values and build risk groups.
As a result one obtains a list with 4 elements. First one is a vector of
C (auc) statistics. Second, the whole glmnet fit. Third, risk values of
all patients according to beta values oft the model. And fourth, the
final survival or censoring times. To see which variables are active in
the final model, one can extract them out of the results.
``` r
survcoef <- coef(resultSurvival$fit, s = "lambda.1se") %>% as.matrix() %>% data.frame() %>% tibble::rownames_to_column()
survcoef <- survcoef %>% filter_if(is.numeric, any_vars(abs(.) > 0))
hazard <- exp(survcoef$X1) #calculate hazard by exponentiation
survcoefficients <- cbind(survcoef, hazard)
colnames(both) <- c("rowname", "coefficient", "hazard")
survcoefficients <- data.frame(survcoefficients)
survcoefficients <- survcoefficients %>% dplyr::filter(hazard != 1) %>% tibble::column_to_rownames()
indexorder <- order(survcoefficients$hazard, decreasing = T) #order predictors by hazard
survcoefficients <- survcoefficients[indexorder, ]
```
A high positive coefficient indicates lower survival probability. A
negative coefficient corresponds with favourable survival. The risk
scores can then be used to dichotomize the patients into low and high
risk groups based on the median.
``` r
survrisk <- resultSurvival$riskvalues %>% as.matrix()
survnames <- rownames(survrisk)
survmedian <- as.vector(dplyr::ntile(survrisk, 2))
names(survmedian) <- survnames
survmedian <- data.frame(survmedian) %>% tibble::rownames_to_column()
clinical <- data.frame(colData(editedCountLGG))
clinical <- left_join(clinical, survmedian, by = c("patient" = "rowname"))
clinical$riskgroup<- factor(clinical$survmedian, levels = c(1,2), labels = c("low", "high"))
```