Skip to content
Permalink
837de1d39b
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Using mmRmeta

Sebastian Lieske 15 Januar 2020

Introduction

The developed R package mmRmeta is a framework to analyze multi-omics data such as RNA-seq and DNA methylation data with the focus on bimodal genes. The workflow of mmRmeta consists of three major steps. First, the data needs to be acquired and preprocessed. This is followed by the identification of bimodally expressed genes. The package mmRmeta uses mclust for model-based clustering to obtain parameter estimates and the cluster assignment of the samples. The third step is the multi-omics analysis to find relations between bimodally expressed genes and the phenotypes of the patient population via modeling regularized logistic regressions and conducting survival analyses with Cox proportional hazard models.

Overview mmRmeta(#anchor5)

Installation

Before doing the data analysis you need to install and load the package.

  devtools::install.github("sebastianlieske/mmRmeta", host = "github.molgen.mpg.de/apu/v3") # link to package
  library(mmRmeta)

There are 7 other packages that are needed that can be downloaded from Bioconductor.

requiredBio <- c("Biobase", "biomaRt", "edgeR", "GenomicRanges", "SummarizedExperiment", "EnsDb.Hsapiens.v86", "ComplexHeatmap")
for (i in requiredBio) {
  BiocManager::install(i)
}

Preprocessing

Let us start with the preprocessing of the data. To proceed this vignette, make sure you have available data either in form of a SummarizedExperiment containing the gene expression data and clinical data or at least one gene expression count matrix and clinical data in form of a data frame in your R environment. In this example, the data was downloaded from NCI’s Genomic Data Commons Data Portal via TCGAbiolinks. To see an example of how to download the data, please go to the following section Data Acquisition. To combine RNA expression count matrix and additional data into a summarized experiment head to Make Summarized Experiment.

Data Acquisition

To obtain the needed data for the analysis you first need to install two needed packages: BiocManager to install other packages from Bioconductor and TCGAbiolinks.

  if(!require(BiocManager)) install.packages("BiocManager", repos = "http://cran.us.r-project.org")
  if(!require(TCGAbiolinks)) BiocManager::install("TCGAbiolinks", dependencies = TRUE)

Now you are set to download the data. You can browse https://portal.gdc.cancer.gov/repository for the primary site of the tumor you desire. We are interested in RNA-seq data from HTSeq and we need the raw counts. The package mmRmeta allows for the analysis of additional multi-omics data like DNA methylation. An example for that is shown as well.

  library("TCGAbiolinks")
  if (!dir.exists("GDCdata")) dir.create("GDCdata") #create a new folder for the data to be downloaded
  #obtain the query result for available data in the data base
  queryLGG <-  GDCquery(project = c("TCGA-LGG"), #low grade glioma (brain cancer)
                 data.category = c("Transcriptome Profiling"),  data.type = "Gene Expression Quantification",
                 workflow.type = "HTSeq - Counts", legacy = F)
  #with the query we are able to download the found data
  GDCdownload(query = queryLGG, method = "api",
            files.per.chunk = 40, directory = "GDCdata") #can take several minutes
  #read downloaded data and transform into an R object
  #if you encounter an error please download most recent version of bioconductor
  rnaCountLGG <- GDCprepare(queryLGG, summarizedExperiment = TRUE,
                 directory = "GDCdata") #create a summarized experiment
  #lastly save object as .rds file
  saveRDS(rnaCountLGG, "rnaCountLGG.RDS")

To download DNA methylation, we need to adjust the data category, data.type and workflow.type. Due to the size of the files, the download will take up some more time.

  library("TCGAbiolinks")
  if (!dir.exists("GDCdata")) dir.create("GDCdata") #create a new folder for the data to be downloaded
  queryMethLGG <- GDCquery(project= "TCGA-LGG",
                         data.category = "DNA Methylation",
                         platform = "Illumina Human Methylation 450",
                         legacy = F)

  GDCdownload(query = queryMethLGG, method = "api",
            files.per.chunk = 40, directory = "GDCdata")

  methylLGG <- GDCprepare(queryMethLGG, summarizedExperiment = T, directory = "GDCdata")
  saveRDS(methylLGG, "methylLGG.RDS")

Now you have a summarized experiment object saved as an rds file. This data needs to be preprocessed due to redundant or missing data. For that continue with Preprocess Summarized Experiment

Make Summarized Experiment

When you already have your raw counts in a matrix and additional clinical data of the patients or samples in a data frame, you can combine them into a summarized experiment to use this R package to its fullest extend. To learn more about the summarized experiment head over to Bioconductor or go to section Structure of a Summarized Experiment. You have to make sure that the rows correspond the a single gene and the columns correspond to each patient. The patients must have an unique identifier. In the clinical data, the patients or samples should be in the rows and the variables like age, gender, tumor type etc. should be in the columns. Generally, make sure that your data is tidy.

Structure of a Summarized Experiment

SummarizedExperiment

To better understand what certain functions will do, you’ll get a short overview of the class summarized experiment. Rows represent features like gene counts, while the columns represent samples. All included functions of this package are designed to either work on simple matrix-like objects or on the SummarizedExperiment (SE) container. In this work, the assays store the gene expression, classification, and DNA methylation data. The colData stores the clinical data of each sample or patient and the rowRanges contains the genomic features like the location of the gene on the chromosome. Due to the characteristic of the SE class, the data will always be in sync, meaning when deleting rows or columns out of the RNA-seq assay the respective data is deleted out of the other objects too.

Preprocessing Summarized Experiment

You have the necessary data ready by now. This R package has a variety of functions for data cleansing and data reduction. These were especially made to work on the data obtained from GDC Data Portal but with the right attribute costimiziation they can be used on other tabulated data as well.

#load data
if (!exists("rnaCountLGG")) {
  rnaCountLGG <- readRDS("rnaCountLGG.RDS")
}

Due to the data design, the colData of the summarized experiment may contain nested elements; like a list as a cell entry. Especially one column is creating trouble.

  #delete column "treatments" because its leading to errors and it contains no information
library(mmRmeta)
  if ("treatments" %in% colnames(colData((rnaCountLGG)))) {
    rnaCountLGG$treatments <- NULL
  }

The preprocessing of the summarized experiment consists of two parts: the first one handles the gene expression matrix in the assay and the second one deals with the colData containing the clinical data. Both need to be altered for downstream functions to properly work.

Processing colData

Starting with the clinical data in the colData there are several things to pay attention to. We want to filter out columns with only missing values, change all categorical entries to factors and filter out single level columns. Additionally, you only want to have samples from primary solid tumors and filter out metastatic samples.

#new object for applied changes
editedCountLGG <- filter_for_factor(rnaCountLGG,
                                    column_value = "Primary solid Tumor") #only keep primary solid tumours
editedCountLGG <- preprocess_coldata(editedCountLGG)
editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name
editedCountLGG <- factorize_columns(editedCountLGG)
editedCountLGG <- drop_unused_levels(editedCountLGG)
editedCountLGG <- filter_small_binary(editedCountLGG, min_counts = 10) #drop binary factor columns where one level has less than 10 counts
editedCountLGG <- filter_single_level(editedCountLGG) #drop factor columns with only one level
colnames(editedCountLGG) <- substr(colnames(editedCountLGG), 1, 12)

The function preprocess_coldata incorporates 3 single functions that also can be applied individually. When using the standard parameters all not numerical columns were set to factors, columns with only NAs or “not reported” are deleted and single factor level entries are discarded as well.

Additional Functions for Coldata

These functions are called by preprocess_coldata_setw

  • factorize_columns()
  • filter_columns_with_na()
  • filter_single_level()

There are some more functions to further filter and restructure the colData:

  • reorder_all() = reorder factor levels by count, important for figure legends and glmnet
  • collapse_column() = combines factor levels by given name, useful for small factor levels
  • drop_unused_levels()

To successfully conduct the survival analysis, the associated data columns need to be transformed into a standardized format. This functionality is implemented by the function add_coldata_survival() which requires the information of the time of the event, event status (dead, alive), and time of censoring.

editedCountLGG <- add_coldata_survival(editedCountLGG)

Processing Assay

Now the gene expression matrix needs some attention. When analysing RNA-seq data, you want to filter out genes with over all low or zero read count to safe space and computational time. Also there can be duplicated gene that needs to be filtered out.

#new object for applied changes
editedCountLGG <- filter_chromosome(editedCountLGG, chromosome = c("chrY", "chrX")) #filter genes on x and y chromosome
editedCountLGG <- preprocess_assay(editedCountLGG)

In this case we want to filter out genes located on the Y and X chromosome because they would be flagged as bimodal genes due to the gender. It is necessary that the summarized experiment has also a GRanges object. By applying preprocess_assay duplicated genes and genes with less than 5% of detected samples are discarded with a overall read count lower than 5% of samples having at least 5. Also at least 10% of all samples that have reads must have more than 5 reads for that gene because a gene is considered as expressed when it has at least a count of 5-10 in a library.

Some patients have multiple existing samples in the expression matrix. We want to have only one enty per patient so we need to merge those additional samples by either calculating the mean or median of all samples belonging to one patient.

editedCountLGG <- merge_duplicated(editedCountLGG, assay_index = 1,
                                   method = "median", shorten_names = T, original_names = T, parallel = T) #combine duplicated genes
editedCountLGG <- reorder_all(editedCountLGG) #reorder factors by counts

When working with TCGA data, all patients or samples have a barcode. The barcode of the samples are the names in the columns of the expression matrix. So far the first 12 characters of the barcode represent a patient and the following characters specify the sample.

Additional Functions for Assay

These functions are called by preprocess_assay

  • filter_duplicates()
  • filter_rare()

Additionally, these functions are available:

  • use_silvermantest() = uses silvermantest on every genes
  • use_diptest() = use diptest on every gene

After gene filtering, the gene expression counts are normalized and transformed to log counts by add_assay_norm() and transform_assay(), respectively. To ensure the availability of the original counts, any alteration of the data is saved as a new assay which is added to the SummarizedExperiment. The Trimmed Mean of M values (TMM) method was adopted from the edgeR Bioconductor package.

#add normalized assay, log transform assay
editedCountLGG <- add_assay_norm(editedCountLGG, 1, method = "TMM",
                                name = "normalized", add_info_metadata = F ) #add new normalized assay
editedCountLGG <- transform_assay(editedCountLGG, assay_index = 2) #log2 transform second assay                               

Preprocessing Methylation Data

When preprocessing the methylation data you can use the same functions as well.

methylLGG$treatments <- NULL
methylLGG <- filter_columns_with_na(methylLGG)
methylLGG <- filter_duplicates(methylLGG)
methylLGG <- filter_zero(methylLGG)  
colnames(methylLGG) <- substr(colnames(methylall), 1, 12)  #shorten IDs        

Finding Bimodal Genes

Using SIBERG

After preprocessing we are set to analyse the data, starting with finding bimodal genes. Our matrix has several thousand genes and with the help of SIBERG we can prefilter possible bimodal candidates relatively fast.

siberLGG <- use_siber(editedCountLGG, 1) #use SIBER in parallel
#order by bimodal index
siberLGG  <- siberLGG  %>% tibble::rownames_to_column() %>% arrange(., -BI,) %>% tibble::column_to_rownames()

When using this function, one obtains a matrix with 6 columns.

We are interested in the last column containing the bimodal index. The bimodal index is an indicator of how well a gene has probably two modes and therefore can be considered as bimodal. Theoretically, the BI can have values ranging from zero to infinity. Usually they reach at most a value of 3. Generally, the higher the index the more likely the gene is bimodal. In SIBERG the index is a generalized bimodal index to allow for different variance in the two modes. Having that index, we can set a cutoff and every gene below that cutoff is discarded.

Next, we want to filter out genes with a bimodal index below 1.5.

bimodalNames <- rownames(siberLGG[siberLGG$BI >= 1.5,]) #names of bimodal genes             
editedCountLGG <- editedCountLGG[bimodalNames,] #subset        

With that we have a much smaller summarized experience with less than 300 rows/genes that are normalized and also log2 transformed. The remaining genes are candidates of being bimodal. In the next section we are going to narrow in those bimodal genes and also use model-based classification to get the modality of every patient for each gene.

Using mclust for Classification

After applying SIBERG, our summarized experiment has two assays and a bunch of bimodal candidates. Now, we are going to use model based clustering with mclust to classify our samples for each gene into modes/groups. Mclust uses an expectation–maximizationalgorithm for maximum likelihood estimation.

resultClust <- use_mclust(editedCountLGG, 2, G = c(1:2), modelNames = "V") #apply mclust on normalized counts

With that function call, mclust calculated the BIC of having 1-2 mixture components (clusters) in the univariate data and we allow the model to have variable or unequal variance in the mixture components. We obtained a list with the results for every gene. Mclust returns the model with the highest BIC. Due to the data properties there may be flawed results and those need to be corrected.

Verify Results

When applying mclust with variable variance it can happen that the data is not classified correctly. Samples with lowest gene expression are classified into the higher mode and vice versa:

Misclassification

Those flaws have to be corrected.

correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification
correctedClust <- verify_mclust(correctedClust, min_proportion = 0.10) #look for imbalanced models
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification# again

The function correctedClust iterates through the list and searches for flawed results. It has two modi. The first one without the parameter min_proportion looks for misclassfied samples. With that parameter, it searches for results with imbalanced proportion of the modes. In this example we want to have a model where at least 10% of the samples are assigned to one mode. If the function encounters inconsistencies or flaws, if will call use_mclust again with different parameters. Next up we filter results that returned an unimodal model and models with a low fold change.

correctedClust <- filter_unimodal(resultClust) #filter out unimodal models
correctedClust <- filter_proportion(correctedClust, proportion = 0.10) # filter out imbalanced proportions
correctedClust <- filter_fold_change(correctedClust, min_fold = 2 ) #filter out models with low fold change

Now that we have filtered the results, we can add the classification from mclust to our summarized experiment.

#add new assay to SE
editedCountLGG <- add_assay_classification(editedCountLGG, correctedClust, index = 3)
#save .RDS
saveRDS(editedCountLGG, "classifiedLGG.RDS")

A new assay with the classification matrix was added to the SE. In the next section we are going to analyse the relation between bimodal genes, DNA methylation and the clinical data.

Multi-Omics Analysis

The following part will cover the analysis of the additional clinical data stored in colData of our summarized experiment and DNA methylation. The goal is to find possible explanations or correlations between the features of the samples and the bimodal expression of certain genes or, from an other approach, find relations between bimodal genes and their possible influence on the data characteristics like survival probability of the patients. To achieve this goal this package uses penalized generalized linear models via glmnet and Cox proportional hazard models for survival analysis.

Glmnet for Multiple Logistic Regression

For every gene, a generalized linear model is fit to predict the mode of the samples. Because one is looking for bimodal genes, a single gene has K classes labeled as 1 and 2. The class is the response variable. Additionally, there is a variety of predictors (k) in the clinical data for example categorical predictors like gender, tumour type or grade, and also numerical predictors like the age of the patients. Due to the abundance, variable selection is needed to shrink the model and to decrease overfitting of the model.

Model Matrix

Our data consists of a variety of qualitative and quantitative variables. For our multivariate analysis we want to use any variable that seems interesting to us. First of all, we need to create a model matrix (design matrix) to include the values of our explanatory variables (columns in colData). Each row represents a sample and the columns correspond to the variables.

#you can get a first glimps of which columns have many missing values
sum_is_na(editedCountLGG)
#create vector with column names from our data
variables <-  c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count",    
                          "paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score",
                          "paper_ATRX.status", "paper_MGMT.promoter.status", "paper_IDH.codel.subtype",
                          "paper_Chr.7.gain.Chr.10.loss", "paper_Chr.19.20.co.gain")
#create model matrix
modelMatrix <- make_model_matrix(editedCountLGG, predictors = variables,
                                 zscore = T, categorical_caret = T, impute_na = T, fun = mean)

This function fulfills several tasks. Because glmnet can not use factors directly, every categorical variable that is encountered is transformed into dummy variables with either values of 0 and 1. Furthermore, every continuous variable (e.g. age) is z-score scaled. Standardizing features is an important feature before model fitting. Otherwise, different scaled features will influence the process negatively. Additionally, missing continuous entries are imputed. Unfortunately, due to missing data values, the model may consist of fewer samples than the original data.

Nested k-fold Cross-Validation

Next up, we are going to apply a regression model with an elastic net penalization via glmnet. Two examples are given, one simple example with only taking the colData into account and the second example with additional DNA methylation data which needs more steps. Our model matrix has many variables that our final model can draw their coefficients from. We want to know if the classification of the bimodality can be predicted by a model containing the clinical data, which predictors can model the bimodality and to which extend. Because of the abundance of our predictor variables we are applying variable selection.

#example without methylation
resultModel <- use_glmnet(editedCountLGG, assay_classification = 3, model_matrix = modelMatrix,
                          parallel = F,col_names = variables, use_glm = T, repeats = 1,
                          fast_cv = F, standardize = F, stratify_class = T)

Depending on the parameters we obtain either a large list with two. The first one is a list of results from glmnet for each gene and the second one are the results of the cross validated error scores (auc values). With the function make_result_matrix we get a better overview of the results and are able to order by AUC.

resultCombined <- combine_list(resultModel) #give result Genewise
resultLGG <- lapply(resultCombined, function(x) {
                    unlist(make_result_matrix(x, result_cv = 2,
                    fun1 = mean, fun2 = sd, predictors = variables, repeated = F,
                    include_auc_range = T)) #make result matrix for every gene
  })
resultLGG <- do.call(rbind, args = resultLGG) #combine them all

Keep in mind that we made a k-fold nested cross validation. By considering the range/standard deviation and the mean of the area under curve values, we can get an idea of how stable the model is and to what extend we can rely on the coefficients obtained by said model. From the obtained result matrix you can consider and filter genes that have a certain AUC value with a small standard deviation.

filteredResults <- data.frame(resultLGG) %>% filter(between(round(mean, digits = 2), 0.8, 1)) %>%
  filter(between(round(sd, digits = 3), 0, 0.1)) #filter genes with AUC between 0.8 and 1 and a sd of 0 and 0.1

Example with DNA Methylation

In this section an example of how to utilize DNA methylation in the analysis is given. At first the preprocessing of the downloaded data is needed as well.

if (!exists("methylLGG")) {
  methylLGG <- readRDS("methylLGG.RDS")
}
methylLGG$treatments <- NULL
methylLGG <- merge_duplicated_columns(methylLGG)
methylLGG <- filter_duplicates(methylLGG)
methylLGG <- filter_zero(methylLGG)
colnames(methylLGG) <- substr(colnames(methylLGG), 1, 12)

The downloaded methylation data has to be mapped to the genes. The information of the genomic location of the genes is stored as GRanges in the rowRanges object of the SummarizedExperiment. The first part consisted of annotating the CpG probes by overlapping the GRanges of the probes with general genomic locations. The EnsDb.Hsapiens.v86 library (Rainer 2017)[!link] is used to retrieve the annotation and GRanges of genes, exons, and introns. Promoter regions were defined by a set distance from the start of a gene. The default value is 2000 bases upstream and 200 bases downstream of the TSS. The function methylation_cpg_annotation() overlaps the GRanges of the methylation probes to the retrieved genomic information. Each mapped probe was assigned a prefix, indicated whether it matched to the promoter, exon or intron. Multiple matching probes on the same location were ordered and marked by a secondary prefix. Genes can have zero, some, or several hundred mapping probes. In general, longer genes have more mapping probes.

To start, we need the genomic information of genes from ensembl.

#at first obtain genomic information
edb <- EnsDb.Hsapiens.v86
seqlevelsStyle(edb) <- "UCSC"
exons <- ensembldb::exonsBy(edb, by = "gene", filter = AnnotationFilterList(
  SeqNameFilter(c(1:22, "X", "Y")),
  GeneIdFilter("ENSG", "startsWith")))
genes <- genes(edb,  filter = AnnotationFilterList(
  SeqNameFilter(c(1:22, "X", "Y")),
  GeneIdFilter("ENSG", "startsWith")))

Then the probes in the DNA methylation assay are mapped to the genes. All mapped probes and their beta values are transformed into a model matrix as well and stored in a list.

namelist <- as.list(rownames(editedCountLGG)) #name of genes
cpgGeneList <- lapply(namelist, function(x) methylation_cpg_annotation(edb, methylLGG, 1, genes = genes, gene_name = x, exons = exons)) #get annotation for every gene
names(cpgGeneList) <- rownames(editedCountLGG)
geneMethylList <- lapply(cpgGeneList, function(x) methylation_beta_value(x,
                  methylLGG, shorten_colnames = T, impute_na = T, fun = mean)) #get beta value for every mapped probe
geneMethylList <- lapply(geneMethylList, function(x) if (!is.null(colnames(x))) rename_margin(x,
                  margin = 2, cut_from = 1, cut_to = 12, char_from = "\\.", "-") else x)
methylModelList <- lapply(geneMethylList, function(x) if(!is.null(x)) assay_to_model(x) else x) #
methylModelList <- lapply(methylModelList, function(x) if(!is.null(x)) rename_margin(x, margin = 1,
                   char_from = "\\.", "-") else x)

Now we have a list of model matrices for every mapped gene. Some genes have no mapping CpG probes and are empty. Now were ready to apply glmnet.

methylpred <- c("promotor", "exon_first", "exon_inter", "exon_last", "intron") #predictor names for methylation
predictors   <- c(methylpred, variables) #all predictor variables

intersectn <- intersect(rownames(methylModelList[[1]]), rownames(modelMatrix)) #get intersection of sample IDs of the model matrices  
#then the two model matrices of each gene are combined
modelList <- lapply(methylModelList, function(x) {
             if(!is.null(x)) cbind(modelMatrix[intersectn, ], x[intersectn,]) else modelMatrix[intersectn,] })

Due to the large list of model matrices, it is not possible to use the function use_glmnet() on its own. Therefore you have to use lapply with the two seperate functions that are included in use_glmnet()

resultLGGMet <- lapply(seq_along(modelList), function(x) {
  use_cvglmnet(response = assay(editedCountLGG, 3)[x, intersectn, drop = F],
  verbose = T, type_measure = "auc", model_matrix = modelList[[x]], stratify_class = T,
  alpha = 0.5, standardize = F, stratify = T,  return_both = T, parallel = F)})
names(resultLGGMet) <- rownames(editedCountLGG)

resultLGGMetCv <- lapply(seq_along(modelList), function(x) {
                  use_nested_cvglmnet(response = assay(editedCountLGG, 3)[x, intersectn, drop = F],
                  model_matrix = modelList[[x]][intersectn,, drop = F], alpha = 0.5,
                  standardize = F, stratify = T, parallel = F)})
names(resultLGGMetCv) <- rownames(editedCountLGG)

Then you combine the results into a single list to then create the result matrix.

resultsCombined <- rlist::list.zip(resultLGGMet, resultLGGMetCv)
names(resultsCombined) <- editedCountLGG@rowRanges$external_gene_name #change gene name type
resMatrixCombined <- lapply(resultsCombined, function(x) unlist(make_result_matrix(x, result_cv = 2, fun1 = mean, fun2 = sd, predictors = predictors, repeated = F, include_auc_range = T)))
resMatrixCombined <- do.call(rbind, args = resMatrixCombined)

Multinomial Regression

Besides doing a binomial regression it is also possible to allow for multiple outcome levels. The function use_glmnet and the relevant functions can handle do multinomial regression as well. For example we can test, which genes and variables are related the most to the subtype of the patient.

exprModel <-  assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values
rownames(editedCountLGG) <- editedCountLGG@rowRanges$external_gene_name #change gene names
variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count",    
                          "paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score",
                          "paper_ATRX.status", "paper_MGMT.promoter.status")
multiMatrix <- make_model_matrix(editedCountLGG, predictors = variables)
megedMatrix <- merge_model_matrices(multiMatrix, exprModel)

Then you have to make sure, that the Matrix and the outcome variable have the same length and names.

multiclass <- as.vector(editedCountLGG$paper_IDH.codel.subtype)
names(multiclass) <- colnames(editedCountLGG) #give it names
multiclass <- multiclass[!is.na(multiclass)] #omit NAs
intersection <- dplyr::intersect(rownames(megedMatrix), names(multiclass))
multimatrix <- megedMatrix[intersection,] #subset
multiclass <- factor(multiclass[intersection]) #subset

Then you can apply glmnet to the data.

multimodelResult <- use_glmnet(classification = multiclass,
                    model_matrix = megedMatrix, type.multinomial = "grouped",  fast_cv = F,
                    family = "multinomial", standardize = F, repeats = 10, stratify_class = T)

From the result one can obtain the active coefficients.

activeCoef <- get_active_coef(multimodelResult)

Survival Analysis

This package is able to conduct survival analysis. Within this implementation, glmnet and the included regularized Cox regression (Coxnet) is used. Coxnet fits a Cox regression model that is regularized by the elastic net penalty. The regression model describes the relation between an event (e.g. death) and a set of covariates as expressed by the hazard function. For that were looking at the gene expression and with addition of clinical variables. A nested cross-validation is used.

exprModel <-  assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values
joinedModel <- merge_model_matrices(modelMatrix, exprModel)
#apply cox regression
resultSurvival <- use_cv_coxglmnet(editedCountLGG, model_matrix = joinedModel, stratify = T,  k_inner_loop = 5, k_outer_loop = 8, verbose = T, add_survival = F, seed = 15)

The inner cross validation will be used for finding the optimal value for the penality factor lambda, while the outer loop is used to estimate the prediction accuracy with the C-index. Also a final model is build to calculate the hazard ratios/risk values and build risk groups.

As a result one obtains a list with 4 elements. First one is a vector of C (auc) statistics. Second, the whole glmnet fit. Third, risk values of all patients according to beta values oft the model. And fourth, the final survival or censoring times. To see which variables are active in the final model, one can extract them out of the results.

survcoef <- coef(resultSurvival$fit, s = "lambda.1se") %>% as.matrix() %>% data.frame() %>% tibble::rownames_to_column()
survcoef <- survcoef  %>%  filter_if(is.numeric, any_vars(abs(.) > 0))
hazard <- exp(survcoef$X1) #calculate hazard by exponentiation
survcoefficients <- cbind(survcoef, hazard)
colnames(both) <- c("rowname", "coefficient", "hazard")
survcoefficients <- data.frame(survcoefficients)
survcoefficients <- survcoefficients %>%  dplyr::filter(hazard != 1) %>% tibble::column_to_rownames()
indexorder <- order(survcoefficients$hazard, decreasing = T) #order predictors by hazard
survcoefficients <- survcoefficients[indexorder, ]

A high positive coefficient indicates lower survival probability. A negative coefficient corresponds with favourable survival. The risk scores can then be used to dichotomize the patients into low and high risk groups based on the median.

survrisk <- resultSurvival$riskvalues %>% as.matrix()
survnames <- rownames(survrisk)
survmedian <- as.vector(dplyr::ntile(survrisk, 2))
names(survmedian) <- survnames
survmedian <- data.frame(survmedian) %>% tibble::rownames_to_column()
clinical <- data.frame(colData(editedCountLGG))
clinical <- left_join(clinical, survmedian, by = c("patient" = "rowname"))
clinical$riskgroup<- factor(clinical$survmedian, levels = c(1,2), labels = c("low", "high"))