Skip to content

Commit

Permalink
spelling
Browse files Browse the repository at this point in the history
sebastianlieske committed Aug 11, 2020
1 parent 325690b commit 837de1d
Showing 6 changed files with 111 additions and 13 deletions.
7 changes: 5 additions & 2 deletions R/preprocess_coldata.R
Original file line number Diff line number Diff line change
@@ -124,11 +124,14 @@ filter_columns_with_na <- function(summarized_experiment, to_lower = T, unnest =
columnData <- mmRmeta::drop_unused_levels(columnData) #drop unused factor levels

if (!is.null(percentage)) { #drop columns with more than >percentage< NA obseervations
keepMatch <- vector(mode = "character")
keepMatch <- numeric()
sumNA <- colSums(is.na(columnData))/nrow(columnData) %>% round(., digits = 2)
isAbove <- sumNA >= percentage
if (!is.null(keep)) {
keepMatch <- append(keepMatch, grep(pattern = keep[i], names(columnData))) #get index for colums to be saved, only if >keep< is not null
for (i in 1:length(keep)) {
if (!is.null(keep)) keepMatch <- append(keepMatch, grep(pattern = keep[i], names(columnData))) #get index for colums to be saved, only if >keep< is not null
keepMatch <- unique(keepMatch)
}
keepMatch <- unique(keepMatch)
isAbove <- isAbove %notin% keepMatch
}
2 changes: 1 addition & 1 deletion R/use_glmnet.R
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@
#' @import magrittr
#' @export

use_glmnet <- function(data = NULL, assay_classification = NULL, classification = NULL, model_matrix = NULL, col_names = NULL, gene_names = NULL, use_glm = T, use_cv = T, fast_cv = T, stratify_class = F, type_measure = "class", type.multinomial = "grouped",
use_glmnet <- function(data = NULL, assay_classification = NULL, classification = NULL, model_matrix = NULL, col_names = NULL, gene_names = NULL, use_glm = T, use_cv = T, fast_cv = F, stratify_class = F, type_measure = "class", type.multinomial = "grouped",
alpha = 0.5, verbose = T, parallel = T, no_cluster = NULL, seed = NULL, repeats = 1, k_outer_loop = 10, k_inner_loop = 5, standardize = F, return_both = T, output_genewise = T, ... ) {
#test for mode: data = null ? --> then only classification and model_matrix oder
###################### 1. Mise en Place ###########################################################
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -8,4 +8,5 @@ The workflow of mmRmeta consists of three major steps (outlined in Figure 2). Fi
This is followed by the identification of bimodally expressed genes. The package mmRmeta uses mclust for model-based clustering to obtain parameter estimates and the cluster assignment of the samples.
The third step is the multi-omics analysis to find relations between bimodally expressed genes and the phenotypes of the patient population via modeling regularized logistic regressions and conducting survival analyses with Cox proportional hazard models.


![Overview](/vignettes/overview.svg)
56 changes: 48 additions & 8 deletions vignettes/Using_mmRmeta.Rmd
Original file line number Diff line number Diff line change
@@ -20,9 +20,17 @@ The developed R package mmRmeta is a framework to analyze multi-omics data such
# Installation
Before doing the data analysis you need to install and load the package.
``` {r eval = FALSE}
devtools::install.package("mmRmeta") # link to package
devtools::install.github("sebastianlieske/mmRmeta", host = "github.molgen.mpg.de/apu/v3") # link to package
library(mmRmeta)
```
There are 7 other packages that are needed that can be downloaded from Bioconductor.
``` {r eval = FALSE}
requiredBio <- c("Biobase", "biomaRt", "edgeR", "GenomicRanges", "SummarizedExperiment", "EnsDb.Hsapiens.v86", "ComplexHeatmap")
for (i in requiredBio) {
BiocManager::install(i)
}
```

# Preprocessing
Let us start with the preprocessing of the data. To proceed this vignette, make sure you have available data either in form of a SummarizedExperiment containing the gene expression data and clinical data or at least one gene expression count matrix and clinical data in form of a data frame in your R environment. In this example, the data was downloaded from NCI's [Genomic Data Commons Data Portal](https://portal.gdc.cancer.gov/) via [TCGAbiolinks](https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html). To see an example of how to download the data, please go to the following section [Data Acquisition](#anchor1). To combine RNA expression count matrix and additional data into a summarized experiment head to [Make Summarized Experiment](#anchor2).

@@ -91,28 +99,29 @@ if (!exists("rnaCountLGG")) {
Due to the data design, the colData of the summarized experiment may contain nested elements; like a list as a cell entry. Especially one column is creating trouble.
```{r eval = FALSE}
#delete column "treatments" because its leading to errors and it contains no information
library(mmRmeta)
if ("treatments" %in% colnames(colData((rnaCountLGG)))) {
rnaCountLGG$treatments <- NULL
}
```
The preprocessing of the summarized experiment consists of two parts: the first one handles the gene expression matrix in the assay and the second one deals with the colData containing the clinical data. Both need to be altered for downstream functions to properly work.

### Processing colData
Starting with the clinical data in the colData there are several things to pay attention to. We want to filter out columns with only missing values, change all categorical entries to factors and filter out single level columns. Additionally, you only want to have samples from primary solid tumors and filter out metastatic samples.
Starting with the clinical data in the colData, there are several things to pay attention to. We want to filter out columns with only missing values, change all categorical entries to factors and filter out single level columns. Additionally, you only want to have samples from primary solid tumours and filter out metastatic samples.
```{r eval = FALSE}
#new object for applied changes
editedCountLGG <- filter_for_factor(rnaCountLGG,
column_value = "Primary solid Tumor") #only keep primary solid tumours
editedCountLGG <- preprocess_coldata(editedCountLGG)
editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, col_char = "_id", percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name
editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name
editedCountLGG <- factorize_columns(editedCountLGG)
editedCountLGG <- drop_unused_levels(editedCountLGG)
editedCountLGG <- filter_small_binary(editedCountLGG, min_counts = 10) #drop binary factor columns where one level has less than 10 counts
editedCountLGG <- filter_single_level(editedCountLGG) #drop factor columns with only one level
colnames(editedCountLGG) <- substr(colnames(editedCountLGG), 1, 12)
```
The function _preprocess_coldata_ incorporates 3 single functions that also can be applied individually. When using the standard parameters all not numerical columns were set to factors, columns with only NAs or "not reported" are deleted and single factor level entries are discarded as well.
The function _preprocess_coldata_ incorporates 3 single functions that also can be applied individually. When using the standard parameters, all non-numerical columns were set to factors, columns with only NAs or "not reported" are deleted and single factor level entries are discarded as well.

#### Additional Functions for Coldata
These functions are called by _preprocess_coldata_setw_
@@ -142,7 +151,7 @@ editedCountLGG <- preprocess_assay(editedCountLGG)
In this case we want to filter out genes located on the Y and X chromosome because they would be flagged as bimodal genes due to the gender. It is necessary that the summarized experiment has also a GRanges object.
By applying *preprocess_assay* duplicated genes and genes with less than 5% of detected samples are discarded with a overall read count lower than 5% of samples having at least 5. Also at least 10% of all samples that have reads must have more than 5 reads for that gene because a gene is considered as expressed when it has at least a count of 5-10 in a library.

Some patients have multiple existing samples in the expression matrix. We want to have only one enty per patient so we need to merge those additional samples by either calculating the mean or median of all samples belonging to one patient.
Some patients have multiple existing samples in the expression matrix. We want to have only one entry per patient so we need to merge those additional samples by either calculating the mean or median of all samples belonging to one patient.
```{r eval = FALSE}
editedCountLGG <- merge_duplicated(editedCountLGG, assay_index = 1,
method = "median", shorten_names = T, original_names = T, parallel = T) #combine duplicated genes
@@ -183,7 +192,7 @@ colnames(methylLGG) <- substr(colnames(methylall), 1, 12) #shorten IDs
# Finding Bimodal Genes

## Using SIBERG
After preprocessing we are set to analyse the data, starting with finding bimodal genes. Our matrix has several thousand genes and with the help of [SIBERG](https://cran.r-project.org/web/packages/SIBERG/index.html) we can prefilter possible bimodal candidates relatively fast.
After preprocessing, we are set to analyse the data. Starting with the identification bimodal genes. Our matrix has several thousand genes and with the help of [SIBERG](https://cran.r-project.org/web/packages/SIBERG/index.html) we can prefilter possible bimodal candidates relatively fast.

```{r eval = FALSE}
siberLGG <- use_siber(editedCountLGG, 1) #use SIBER in parallel
@@ -218,8 +227,10 @@ Those flaws have to be corrected.
```{r eval = FALSE}
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification
correctedClust <- verify_mclust(correctedClust, min_proportion = 0.10) #look for imbalanced models
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification# again
```
The function *correctedClust* iterates through the list and searches for flawed results. It has two modi. The first one without the parameter *min_proportion* looks for misclassfied samples. With that parameter, it searches for results with imbalanced proportion of the modes. In this example we want to have a model where at least 10% of the samples are assigned to one mode. If the function encounters inconsistencies or flaws, if will call *use_mclust* again with different parameters.
The function *correctedClust* iterates through the list and searches for flawed results. It has two modes. The first one without the parameter *min_proportion* looks for misclassified samples. With that parameter, it searches for results with imbalanced proportion of the modes. In this example we want to have a model where at least 10% of the samples are assigned to one mode. If the function encounters inconsistencies or flaws, if will call *use_mclust* again with different parameters.
Next up we filter results that returned an unimodal model and models with a low fold change.

```{r eval = FALSE}
@@ -359,8 +370,37 @@ resMatrixCombined <- do.call(rbind, args = resMatrixCombined)
```

## Multinomial Regression
Besides doing a binomial regression it is also possible to allow for multiple outcome levels. The function *use_glmnet* and the relevant functions can handle do multinomial regression as well. For example we can test, which genes and variables are related the most to the subtype of the patient.

```{r eval = FALSE}
exprModel <- assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values
rownames(editedCountLGG) <- editedCountLGG@rowRanges$external_gene_name #change gene names
variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count",
"paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score",
"paper_ATRX.status", "paper_MGMT.promoter.status")
multiMatrix <- make_model_matrix(editedCountLGG, predictors = variables)
megedMatrix <- merge_model_matrices(multiMatrix, exprModel)
```
Then you have to make sure, that the Matrix and the outcome variable have the same length and names.
```{r eval = FALSE}
multiclass <- as.vector(editedCountLGG$paper_IDH.codel.subtype)
names(multiclass) <- colnames(editedCountLGG) #give it names
multiclass <- multiclass[!is.na(multiclass)] #omit NAs
intersection <- dplyr::intersect(rownames(megedMatrix), names(multiclass))
multimatrix <- megedMatrix[intersection,] #subset
multiclass <- factor(multiclass[intersection]) #subset
```
Then you can apply glmnet to the data.

```{r eval = FALSE}
multimodelResult <- use_glmnet(classification = multiclass,
model_matrix = megedMatrix, type.multinomial = "grouped", fast_cv = F,
family = "multinomial", standardize = F, repeats = 10, stratify_class = T)
```
From the result one can obtain the active coefficients.
```{r eval = FALSE}
activeCoef <- get_active_coef(multimodelResult)
```
## Survival Analysis
This package is able to conduct survival analysis. Within this implementation, glmnet and the included regularized Cox regression (Coxnet) is used. Coxnet fits a Cox regression model that is regularized by the elastic net penalty. The regression model describes the relation between an event (e.g. death) and a set of covariates as expressed by the hazard function. For that were looking at the gene expression and with addition of clinical variables. A nested cross-validation is used.

@@ -370,7 +410,7 @@ joinedModel <- merge_model_matrices(modelMatrix, exprModel)
#apply cox regression
resultSurvival <- use_cv_coxglmnet(editedCountLGG, model_matrix = joinedModel, stratify = T, k_inner_loop = 5, k_outer_loop = 8, verbose = T, add_survival = F, seed = 15)
```
The inner cross validation will be used for finding the optimal value for the penality factor lambda, while the outer loop is used to estimate the prediction accuracy with the C-index. Also a final model is build to calculate the hazard ratios/risk values and build risk groups.
The inner cross validation will be used for finding the optimal value for the penalty factor lambda, while the outer loop is used to estimate the prediction accuracy with the C-index. Also a final model is build to calculate the hazard ratios/risk values and build risk groups.

As a result one obtains a list with 4 elements. First one is a vector of C (auc) statistics. Second, the whole glmnet fit. Third, risk values of all patients according to beta values oft the model. And fourth, the final survival or censoring times. To see which variables are active in the final model, one can extract them out of the results.
```{r eval = FALSE}
58 changes: 56 additions & 2 deletions vignettes/Using_mmRmeta.md
Original file line number Diff line number Diff line change
@@ -28,10 +28,20 @@ hazard models.
Before doing the data analysis you need to install and load the package.

``` r
devtools::install.package("mmRmeta") # link to package
devtools::install.github("sebastianlieske/mmRmeta", host = "github.molgen.mpg.de/apu/v3") # link to package
library(mmRmeta)
```

There are 7 other packages that are needed that can be downloaded from
Bioconductor.

``` r
requiredBio <- c("Biobase", "biomaRt", "edgeR", "GenomicRanges", "SummarizedExperiment", "EnsDb.Hsapiens.v86", "ComplexHeatmap")
for (i in requiredBio) {
BiocManager::install(i)
}
```

# Preprocessing

Let us start with the preprocessing of the data. To proceed this
@@ -164,6 +174,7 @@ column is creating trouble.

``` r
#delete column "treatments" because its leading to errors and it contains no information
library(mmRmeta)
if ("treatments" %in% colnames(colData((rnaCountLGG)))) {
rnaCountLGG$treatments <- NULL
}
@@ -187,7 +198,7 @@ solid tumors and filter out metastatic samples.
editedCountLGG <- filter_for_factor(rnaCountLGG,
column_value = "Primary solid Tumor") #only keep primary solid tumours
editedCountLGG <- preprocess_coldata(editedCountLGG)
editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, col_char = "_id", percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name
editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name
editedCountLGG <- factorize_columns(editedCountLGG)
editedCountLGG <- drop_unused_levels(editedCountLGG)
editedCountLGG <- filter_small_binary(editedCountLGG, min_counts = 10) #drop binary factor columns where one level has less than 10 counts
@@ -382,6 +393,7 @@ Those flaws have to be corrected.
``` r
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification
correctedClust <- verify_mclust(correctedClust, min_proportion = 0.10) #look for imbalanced models
correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification# again
```

The function *correctedClust* iterates through the list and searches for
@@ -626,6 +638,48 @@ resMatrixCombined <- do.call(rbind, args = resMatrixCombined)

## Multinomial Regression

Besides doing a binomial regression it is also possible to allow for
multiple outcome levels. The function *use\_glmnet* and the relevant
functions can handle do multinomial regression as well. For example we
can test, which genes and variables are related the most to the subtype
of the patient.

``` r
exprModel <- assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values
rownames(editedCountLGG) <- editedCountLGG@rowRanges$external_gene_name #change gene names
variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count",
"paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score",
"paper_ATRX.status", "paper_MGMT.promoter.status")
multiMatrix <- make_model_matrix(editedCountLGG, predictors = variables)
megedMatrix <- merge_model_matrices(multiMatrix, exprModel)
```

Then you have to make sure, that the Matrix and the outcome variable
have the same length and names.

``` r
multiclass <- as.vector(editedCountLGG$paper_IDH.codel.subtype)
names(multiclass) <- colnames(editedCountLGG) #give it names
multiclass <- multiclass[!is.na(multiclass)] #omit NAs
intersection <- dplyr::intersect(rownames(megedMatrix), names(multiclass))
multimatrix <- megedMatrix[intersection,] #subset
multiclass <- factor(multiclass[intersection]) #subset
```

Then you can apply glmnet to the data.

``` r
multimodelResult <- use_glmnet(classification = multiclass,
model_matrix = megedMatrix, type.multinomial = "grouped", fast_cv = F,
family = "multinomial", standardize = F, repeats = 10, stratify_class = T)
```

From the result one can obtain the active coefficients.

``` r
activeCoef <- get_active_coef(multimodelResult)
```

## Survival Analysis

This package is able to conduct survival analysis. Within this
Binary file added vignettes/classifiedLGG.RDS
Binary file not shown.

0 comments on commit 837de1d

Please sign in to comment.