Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
multimodalR/vignettes/multimodalR.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
931 lines (740 sloc)
37.9 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "An introduction to the multimodalR package" | |
author: "Svenja Kristina Beck" | |
date: "`r Sys.Date()`" | |
#output: rmarkdown::word_document | |
output: | |
html_document: | |
toc: true | |
number_sections: true | |
theme: readable | |
highlight: haddock | |
# output: rmarkdown::word_document | |
# output: pdf_document | |
vignette: > | |
%\VignetteIndexEntry{An introduction to the multimodalR package} | |
%\VignetteEngine{knitr::rmarkdown} | |
\usepackage[utf8]{inputenc} | |
--- | |
<!-- Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The `html_vignette` output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The `html_vignette` format: --> | |
<!-- - Never uses retina figures --> | |
<!-- - Has a smaller default figure size --> | |
<!-- - Uses a custom CSS stylesheet instead of the default Twitter Bootstrap style --> | |
<!-- ```{r knitr-options, echo = FALSE, message = FALSE, warning = FALSE} --> | |
<!-- # To render an HTML version that works nicely with github and web pages, do: --> | |
<!-- # rmarkdown::render("vignettes/multimodalR.Rmd", "all") --> | |
<!-- knitr::opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, --> | |
<!-- dev = 'png') --> | |
<!-- ``` --> | |
# Introduction | |
Welcome to multimodalR! multimodalR is an R package to simulate | |
multimodal gene expression data, to evaluate different algorithms | |
detecting multimodality and to detect bimodality or multimodality | |
in gene expression data sets. | |
This vignette gives an overview and introduction to multimodalR's functionalities. | |
It was written in R Markdown, using the knitr package for production. | |
See help(package="multimodalR") for further details and references provided by citation("multimodalR"). | |
# Installation | |
To install the most recent development version from Github use: | |
```{r install-github, eval = FALSE} | |
# biocLite("loosolab/multimodalR", dependencies = TRUE,build_vignettes = TRUE) | |
``` | |
# Quickstart | |
Assuming you want to simulate gene expression data from scratch | |
there are two simple steps to create a simulated data set with multimodalR. | |
Here is an example: | |
```{r quickstart,eval = FALSE} | |
# Load package | |
library(multimodalR) | |
# Create parameter object | |
params <- newParams() | |
# Simulate data and generating the validationData behind it | |
simulation <- simulateExpression(params = params) | |
validationData <- simulation$validationData | |
expression <- simulation$expressionData | |
``` | |
These steps will be explained in detail in the following sections but in summary | |
the first step generates a Params object containing the parameters for | |
the simulation. The second step takes those parameters and simulates a new data set | |
containing the simulated gene expressions. | |
# The general background | |
When observing gene expression data, mostly 3 different scenarios can be observed: | |
1. a unimodal distribution of the gene expression | |
2. a multimodal (often bimodal) distribution of the gene expression, | |
where lesser and greater expression are observed | |
3. a multimodal (often bimodal) distribution of the gene expression, | |
where a nonexistent expression (0, or nearly 0) | |
and higher expressions are observed | |
For our simulation the first scenario is generated with a Gaussian distribution. | |
The second scenario is generated with multiple Gaussian distributions. | |
The third scenario is generated with a Gamma distribution and multiple Gaussian distributions. | |
Examples for bimodal gene expression: | |
* The gene expression of Plexin-B1 is bimodal in ErbB-2 over-expressing cells ([Worzfeld et al.]) | |
# Workflow | |
The following section describes how multimodalR is used. The framework and all its properties are described and used in examples. | |
{width=650px} | |
## Parameters | |
The parameters required for the gene expression simulation are briefly described here: | |
* **Global parameters** | |
* `nGenes` - A list containing the numbers of genes to simulate for each scenario and modality. | |
* `nPatients` - The number of patients to simulate and | |
how many columns the simulated gene expression should have per gene. | |
* `seed` - Seed to use to generate random numbers. | |
* **Distribution parameters** | |
* `means` - A list containing the parameter range for the mean for each scenario and modality. | |
* `foldChanges` - A list containing the parameter range for the foldChanges between distributions for multimodal genes. Has to be one less than the modality that is simulated. | |
* `sd` - Parameters for generating the standard deviations via an Inverse Gaussian distribution | |
* `gamma` - Shape and rate parameter to generate the gamma distributions. | |
* `proportions` - Parameters for the allowed proportions of multimodal distributions. | |
## The `Params` Object | |
All the parameters to simulate gene expression data | |
are stored in a `Params` object. | |
Let's create a new one and see what it looks like. | |
```{r Params,eval = FALSE} | |
params <- newParams() | |
params | |
``` | |
As well as telling us what type of object we have ("An object of class | |
`Params`") and showing us the values of the parameters | |
this output gives us some extra information. | |
We can see which parameters have been changed from their | |
default values (those in ALL CAPS). | |
### `Params` object explained & Template | |
The names "1","2","3",... stand for the modality of the scenario. If you want to simulate gene expression with a modality > 3, it is very helpful to use the following template: | |
```{r params-Template,eval = FALSE} | |
#template for highest modality >= 3 | |
params <- newParams( | |
nGenes=list( | |
"1"=c(300), | |
"2"=list(gauss = 150, gamma = 150), | |
"3"=list(gauss = 150, gamma = 150)#add more rows | |
), | |
means=list( | |
"1"= c(2, 4), | |
"2"= c(2, 4), | |
"3"= c(2, 4) | |
), | |
foldChanges = list( | |
"1"= NA, | |
"2"= list(gauss = c(2, 4), gamma = c(2, 4)), | |
"3"= list(gauss = list(c(2, 4), c(2, 4)), gamma = list(c(2, 4), c(2, 4))) | |
), | |
sd = list( | |
"mu" = 0.61, | |
"lambda" = 2.21 | |
), | |
gamma = list( | |
shape = 2, | |
rate = 2 | |
), | |
proportions = c(10,20,30,40,50,60,70,80,90), | |
nPatients = c(200) | |
) | |
``` | |
CAUTION: if you want to simulate only certain modalities (e.g. 2,4), you have to use blanks for the others: | |
```{r params-TemplateOnlyCertainModailities,eval = FALSE} | |
params <- newParams( | |
nGenes=list( | |
"1"= c(0), | |
"2"= list(gauss = 10, gamma = 10), | |
"3"= list(gauss = 0, gamma = 0), | |
"4"= list(gauss = 10, gamma = 10) | |
), | |
means=list( | |
"1"= 0, | |
"2"= c(2, 4), | |
"3"= 0, | |
"4"= c(2,4) | |
), | |
foldChanges = list( | |
"1"= NA, | |
"2"= list(gauss = c(2, 4), gamma = c(2, 4)), | |
"3"= NA, | |
"4"= list(gauss = list(c(2, 4),c(2,4),c(2,4)), gamma = list(c(2, 4),c(2,4),c(2,4))) | |
), | |
sd = list( | |
"mu" = 0.61, | |
"lambda" = 2.21 | |
), | |
gamma = list( | |
shape = 2, | |
rate = 2 | |
), | |
proportions = c(10,20,30,40,50,60,70,80,90), | |
nPatients = c(200) | |
) | |
``` | |
### Getting and setting | |
If we want to look at a particular parameter, for example the number of genes to | |
simulate, we can extract it using the `getParam` function: | |
```{r getParam,eval = FALSE} | |
getParam(params, "nGenes") | |
``` | |
Alternatively, to give a parameter a new value we can use the `setParam` | |
function: | |
```{r setParam,eval = FALSE} | |
params <- setParam(params, "nGenes", list("1" = 100,"2"=list("gauss"=150,"gamma"=200))) | |
getParam(params, "nGenes") | |
``` | |
If we want to extract multiple parameters (as a list) or set multiple parameters | |
we can use the `getParams` or `setParams` functions: | |
```{r getParams-setParams,eval = FALSE} | |
# Set multiple parameters at once (using a list) | |
params <- setParams(params, | |
update = list("nGenes" = list("1" = 100,"2"=list("gauss"=150,"gamma"=200)), | |
"nPatients" = 300)) | |
# Extract multiple parameters as a list | |
getParams(params, c("nGenes", "gamma")) | |
# Set multiple parameters at once (using additional arguments) | |
params <- setParams(params, | |
"nGenes" = list("1" = 100,"2"=list("gauss"=150,"gamma"=200)), | |
"nPatients" = 300) | |
params | |
``` | |
The parameters which have changed are now shown in ALL CAPS to indicate that they have | |
been changed from the default. | |
We can also set parameters directly when we call `newParams`: | |
```{r newParams-set,eval = FALSE} | |
# # A basic example for simulating unimodal, bimodal and trimodal distributions | |
params <- newParams( | |
nGenes=list("1"=c(300), | |
"2"=list(gauss = 150, gamma = 150), | |
"3"=list(gauss = 150, gamma = 150)), | |
means=list("1"= c(2, 4), | |
"2"= c(2, 4), | |
"3"= c(2, 4)), | |
foldChanges = list("1"= NA, | |
"2"= list(gauss = c(2, 4), | |
gamma = c(2, 4)), | |
"3"= list(gauss = list(c(2, 4), c(2, 4)), | |
gamma = list(c(2, 4), c(2, 4)))), | |
nPatients=50) | |
params | |
A Params object of class Params | |
Parameters can be 'Default' or 'NOT DEFAULT'. | |
Global: | |
NGENES | |
10, list(gauss = 10, gamma = 10) | |
MEANS | |
c(2, 4), c(2, 4) | |
FOLDCHANGES | |
NA, list(gauss = c(2, 4), gamma = c(2, 4)) | |
sd | |
0.61, 2.21 | |
Gamma | |
2, 2 | |
Proportions | |
10, 20, 30, 40, 50, 60, 70, 80, 90 | |
NPATIENTS | |
50 | |
SEED | |
61414 | |
``` | |
To change the parameters, please consider the following table. | |
*Type* shows with which type the slot has to be filled, | |
*Usage* shows further information for the usage of the values, | |
*Default values* describes the default values used to generate a `Params` | |
object. | |
```{r params-table, eval=FALSE} | |
Type Usage Default values | |
nGenes list values: unimodal,bimodal (gauss,gamma),... 700, list(gauss = 150, gamma = 150) | |
means list values: unimodal,bimodal (gauss),... c(2, 4), c(2, 4) | |
foldChanges list values: unimodal,bimodal (gauss,gamma),... c(2, 4), list(gauss = c(2, 4), gamma = c(3, 5)) | |
sd list use values as µ,lambda 0.61, 2.21 | |
gamma list use values as shape,rate 2, 2 | |
proportions vector pick random value from vector c(10,20,30,40,50,60,70,80,90) | |
nPatients numeric - 100 | |
seed numeric - sample(1:1e6,1) | |
``` | |
## The gene expression simulation | |
Now that we looked at the Params Objects and saw which parameters it contains, | |
let's look at how multimodalR simulates gene expression data. | |
The core of the simulation is the validationData, which is created in the simulation | |
process with the parameters from the created Params object. | |
As the result of the simulation process, the simulated gene expression matrix | |
is returned with the validationData in a list object. | |
The simulation of a gene expression data set requires the parameters (which are | |
stored in the Params object) and the validationData behind the simulation (which was | |
not created yet). | |
The Params object was already created and if needed, default values were | |
changed. The validationData is automatically created when the simulate() command is | |
executed. | |
### validationData | |
The structure of the validationData is the following: | |
* each simulated gene has its own entry in the validationData and can be accessed with validationData$[GeneName] | |
* each entry contains the following information: | |
* modus - modality of this gene shown as a number (1 = unimodal, 2 = bimodal, ...) | |
* scenario - scenario of this gene, either ("unimodal-gaussian","multimodal-gaussian" or "multimodal-gamma+gaussian") | |
* means - means of this gene | |
* foldChanges - foldChanges between the means (NA for unimodal genes) | |
* sds - standard deviation of each mean | |
* proportions - proportions of the different distributions | |
* sizes - sizes of the different groups | |
* groups - patientNames are listed in the corresponding group | |
To generate the validationData, a `Params` object is needed. The parameters from this | |
object are used for generating the validationData by randomizing values starting with a | |
generated `seed`. This is done by drawing values from the `Params` object. | |
Mean values are generated from a span of values, standard | |
deviations are generated by using an inversed Gaussian distribution from the package [LaplacesDemon].That is, | |
because over a large data set a inverseGaussian-like distribution of all | |
standard deviations was noticed. | |
Because of those needed parameters the validationData looks for example like that: | |
```{r validationData-example,eval=FALSE} | |
validationData$Gene0001 | |
$modus | |
[1] 2 | |
$scenario | |
[1] "multimodal-gaussian" | |
$means | |
[1] 3.561552 7.309845 | |
$foldChanges | |
[1] 2.052433 | |
$sds | |
[1] 0.2826536 0.6026159 | |
$proportions | |
[1] 20 80 | |
$sizes | |
[1] 10 40 | |
$groups | |
$groups[[1]] | |
[1] "Patient0001" "Patient0002" "Patient0003" "Patient0004" | |
[5] "Patient0005" "Patient0006" "Patient0007" "Patient0008" | |
[9] "Patient0009" "Patient0010" | |
$groups[[2]] | |
[1] "Patient0011" "Patient0012" "Patient0013" "Patient0014" | |
[5] "Patient0015" "Patient0016" "Patient0017" "Patient0018" | |
[9] "Patient0019" "Patient0020" "Patient0021" "Patient0022" | |
[13] "Patient0023" "Patient0024" "Patient0025" "Patient0026" | |
[17] "Patient0027" "Patient0028" "Patient0029" "Patient0030" | |
[21] "Patient0031" "Patient0032" "Patient0033" "Patient0034" | |
[25] "Patient0035" "Patient0036" "Patient0037" "Patient0038" | |
[29] "Patient0039" "Patient0040" "Patient0041" "Patient0042" | |
[33] "Patient0043" "Patient0044" "Patient0045" "Patient0046" | |
[37] "Patient0047" "Patient0048" "Patient0049" "Patient0050" | |
``` | |
To understand why all those parameters are needed, it is essential to | |
understand how the validationData is used to simulate the gene expression. | |
### Simulation | |
With the validationData the gene expression matrix is simulated. | |
For each gene (each entry of the validationData) the parameters in the validationData are used to simulate | |
the gene expression. | |
How many columns the gene expression matrix will have depends on the number of | |
patients. Per patient one gene expression is simulated. | |
If a negative value was simulated, it is changed to be 0 in the simulation process, | |
as normally no negative gene expressions can be observed. | |
How the gene expression is simulated depends on the modus. | |
The "modus" is the number of occurring distributions within one gene. | |
####Modus "1" - Unimodal gaussian distribution | |
Gene expression for the unimodal gaussian distribution (modus "1") is | |
simulated from the mean and the standard deviation taken from the | |
validation data for this gene. The Gaussian distribution is calculated | |
with [rnorm]. The number of values that are simulated for each distribution | |
equals the number of Patients {"nPatients"}. | |
All other modi consist of two different scenarios: | |
In the "gauss" scenario all occurring distributions are Gaussian distributions | |
and calculated with [rnorm]. | |
In the "gamma" scenario the first occuring distribution is a Gamma distribution | |
and is calculated with [rgamma]. | |
All other distributions are gaussian distributions and are calculated with [rnorm]. | |
####Modus "2" or higher - Bimodal distributions or higher modalities | |
For each distribution gene expressions of the multimodal genes are | |
automatically simulated. | |
#####Scenario "gauss" | |
The different gaussian distributions are simulated with the corresponding mean | |
and a standard deviation and number of values taken from the validation data | |
for this gene. | |
#####Scenario "gamma" | |
The gamma distribution is simulated with the gamma shape and rate values from | |
the params object. The other gaussian distributions are simulated with the values | |
from the validation data. | |
####Exemplary Gene Expression Matrix | |
After using the `simulateExpression()` command, the gene expression matrix looks like | |
this: | |
```{r expression-sample, eval=FALSE} | |
Patient0001 Patient0002 Patient0003 Patient0004 Patient0005 | |
Gene0001 3.9674492 2.80396473 3.78980664 4.095237721 4.00587345 | |
Gene0002 4.1045561 2.26303081 3.75889335 3.522737760 3.02592372 | |
Gene0003 2.4582632 2.22295497 3.17911283 2.947560965 2.74938887 | |
Gene0004 5.9032577 5.95497476 4.75760279 5.161985499 3.70910216 | |
Gene0005 3.7431061 3.68041039 3.62509463 3.740929320 3.66194851 | |
Gene0006 3.0784338 1.79963109 2.16929818 1.160848158 3.49503635 | |
Gene0007 0.7993267 2.21868266 2.62367680 0.152856110 1.58145211 | |
Gene0008 2.0018416 2.22588333 2.90464907 3.049822819 2.28450055 | |
Gene0009 2.2986108 2.08720943 2.02138626 2.407369792 2.00220259 | |
Gene0010 2.1793609 2.85387502 2.23943177 2.423610586 2.35072765 | |
``` | |
### Executing the simulation | |
As the `Params`object can be generated before starting the | |
simulation, the `simulateExpression()` command can be used in two different ways: | |
1. with a created params object (recommended) | |
2. without a created params object (not recommended) | |
The validationData is automatically generated when using the `simulateExpression()` command. | |
The `simulateExpression()` command returns a list object, which contains the created | |
validationData and the simulated expressionData. | |
###Simulating the gene expression | |
#### 1. With a created params object(recommended) | |
It is recommended to create a `Params` object before using the `simulateExpression()` | |
command, because then you can change the parameters and have the params object | |
saved in your global environment. | |
```{r simulateExpression-with-params,eval = FALSE} | |
#with messages printed to console | |
params <- newParams() | |
simulation <- simulateExpression(params = params) | |
#without messages printed to console | |
simulation <- simulateExpression(params = params,verbose = FALSE) | |
#storing the validation data and the simulated gene expression data as | |
#separate data.frames | |
validationData <- simulation$validationData | |
expression <- simulation$expressionData | |
``` | |
#### 2. Without a created params object (not recommended) | |
If the simulation should be executed with default parameters only (or with | |
only certain parameters changed) and you do not mind, having no separate | |
`Params`object in your global environment to be able to look up the different | |
parameters (especially in the case that you changed them), you can use this | |
commands, though it is not recommended, as it is useful to have a `Params` | |
object to look up the parameters behind the validationData. | |
In this case, without a created `Params`object the validationData is not saved as an | |
Rdata object, so there is no way of keeping track of changes to your params object. | |
```{r simulateExpression-without-params,eval = FALSE} | |
#with default parameters and messages printed to console | |
params <- newParams() | |
simulation <- simulateExpression(params = newParams()) | |
#with one changed parameter and messages not printed to console | |
simulation <- simulateExpression(params = newParams("seed"=23),verbose = FALSE) | |
#with some changed parameters and messages printed to console | |
simulation <- simulateExpression(params = newParams("seed"=23,"nPatients"=200)) | |
#storing the validation data and the simulated gene expression data as | |
#a list and a data.frame | |
validationData <- simulation$validationData | |
expression <- simulation$expressionData | |
``` | |
## The unified output data format | |
The output data format is a unified data format, that is needed to work with | |
and use all of the evaluation functions. The output of all implemented | |
algorithms matches the unified data format. | |
The unified data format has the following structure: | |
The data format is structured in three parts: | |
* the Genes section, | |
* the Clinical Data section and | |
* expression matrix section | |
The different parts are accessible with: | |
```{r accessing-the-different-sections, eval=FALSE} | |
mclustOutput$Genes[1:10] #mclustOutput$Genes for all genes | |
mclustOutput$ClinicalData | |
mclustOutput$Expressionmatrix | |
``` | |
The Genes section has many parts. For each gene a separate section is generated. | |
Each gene section contains the following information, which were calculated | |
by using one of the algorithms: | |
* $modus - the modus is the number that equals the found number of occurring distributions | |
within one gene (the modality) | |
* $means - the means section contains the calculated mean values for each modus | |
* $sds - the sds section contains the calculated standard deviation values for each modus | |
* $sizes - the sizes section contains the calculated size of each group | |
* $groups - the groups section contains the patients that are assigned to the different groups | |
The Genes can be accessed by using: | |
```{r accessing-a-specific-gene-section, eval=FALSE} | |
mclustOutput$Genes$Gene0001 | |
#Accessing a specific section within the genes is possible by using the $ tags | |
mclustOutput$Genes$Gene0001$modus | |
mclustOutput$Genes$Gene0001$means | |
mclustOutput$Genes$Gene0001$sds | |
mclustOutput$Genes$Gene0001$sizes | |
mclustOutput$Genes$Gene0001$groups | |
output | |
$Genes | |
$Genes$Gene0001 | |
$Genes$Gene0001$modus | |
[1] 2 | |
$Genes$Gene0001$means | |
[1] 3.613995 7.221655 | |
$Genes$Gene0001$sds | |
[1] 0.1851148 0.5368126 | |
$Genes$Gene0001$sizes | |
[1] 10 40 | |
$Genes$Gene0001$groups | |
$Genes$Gene0001$groups[[1]] | |
[1] "Patient0001" "Patient0002" "Patient0003" "Patient0004" | |
[5] "Patient0005" "Patient0006" "Patient0007" "Patient0008" | |
[9] "Patient0009" "Patient0010" | |
$Genes$Gene0001$groups[[2]] | |
[1] "Patient0011" "Patient0012" "Patient0013" "Patient0014" | |
[5] "Patient0015" "Patient0016" "Patient0017" "Patient0018" | |
[9] "Patient0019" "Patient0020" "Patient0021" "Patient0022" | |
[13] "Patient0023" "Patient0024" "Patient0025" "Patient0026" | |
[17] "Patient0027" "Patient0028" "Patient0029" "Patient0030" | |
[21] "Patient0031" "Patient0032" "Patient0033" "Patient0034" | |
[25] "Patient0035" "Patient0036" "Patient0037" "Patient0038" | |
[29] "Patient0039" "Patient0040" "Patient0041" "Patient0042" | |
[33] "Patient0043" "Patient0044" "Patient0045" "Patient0046" | |
[37] "Patient0047" "Patient0048" "Patient0049" "Patient0050" | |
$ClinicalData | |
[1] "mnt/data/clinicalData.Rdata" | |
$Expressionmatrix | |
[1] "mnt/data/expression.Rdata" | |
``` | |
If this package is used with an algorithm which was not implemented in this package, | |
the output has to be changed to match this unified data format. | |
Then, all the evaluation functions will work with the output of not implemented algorithms. | |
### Requirements for implementing new algorithms | |
The output of algorithms that are implemented in multimodalR have to be in the unified output data format. | |
Whether the output fulfils all requirements of the unified output data format can be easily checked | |
with the 'isValidUnifiedOuputDataFormat()' function. | |
```{r isValidUnifiedOutputDataFormat, eval=FALSE} | |
isValidUnifiedOutputDataFormat(output = output,expressionmatrix = expression) | |
``` | |
The requirements are listed as the structure of the unified output data format is constructed: | |
* output | |
* should be of type 'list' | |
* should hold the 'Genes','ClinicalData' and 'Expressionmatrix' section | |
* should have a length of 3 | |
* ClinicalData | |
* should be of type 'character' | |
* should have a length of 1 | |
* ExpressionMatrix | |
* should be of type 'character' | |
* should have a length of 1 | |
* Genes | |
* should be of type 'list' | |
* should have a length of nrow(expressionmatrix) | |
* EACH GENE | |
* modus | |
* should be of type 'integer' | |
* means | |
* should be of type 'double' | |
* should have a lenth of modus | |
* sds | |
* should be of type 'double' | |
* should have a lenth of modus | |
* sizes | |
* should be of type 'integer' | |
* should have a lenth of modus | |
* sum should equal ncol(expressionmatrix) | |
* groups | |
* should be of type 'list' and should have a lenth of modus | |
* EACH GROUP | |
* should be of type 'character' | |
* size of each group should correspond to sizes[[groupnumber]] | |
If all of those requirements are fulfilled, the function should return 'TRUE', if any of the requirements is not fulfilled, the function returns 'FALSE'. The messages printed to the console will help to determine which requirements are not fulfilled yet. | |
## Usage of the algorithms | |
In this package three algorithms that can be used for detecting multimodality, | |
are implemented. The mclust algorithm that can be used with useMclust() is the one recommended to detect multimodal genes in gene expression data sets. | |
### mclust | |
Mclust uses a finite Gaussian mixture modeling fitted via EM algorithm for | |
model-based clustering, classification, and density estimation, including | |
Bayesian regularization and dimension reduction. | |
For more information look at the vignette of [mclust] "A quick tour of mclust" | |
#### mclust References | |
Scrucca, L. et al. (2016) ‘mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models’, | |
The R journal, 8(1), pp. 289–317. | |
C. Fraley and A. E. Raftery (2002). Model-based clustering, discriminant | |
analysis, and density estimation. Journal of the American Statistical | |
Association 97:611:631. | |
### HDBSCAN | |
The HDBSCAN (Hierarchical Density-based | |
spatial clustering of applications with noise) clustering algorithm | |
computes the hierarchical cluster tree representing density estimates | |
along with the stability-based flat cluster extraction proposed by Campello et | |
al. (2013). HDBSCAN essentially computes the hierarchy of all DBSCAN* | |
clusterings, and then uses a stability-based extraction method to find optimal | |
cuts in the hierarchy, thus producing a flat solution. | |
For more information look at the way [HDBSCAN] works or use the documentation | |
at [HDBSCANdocs] | |
#### HDBSCAN References | |
Campello, R.J.G.B., Moulavi, D. and Sander, J. (2013) ‘Density-Based Clustering Based on Hierarchical Density Estimates’, | |
in Pei, J. et al. (eds.) Advances in knowledge discovery and data mining: 17th Pacific Asia Conference, | |
PAKDD 2013, Gold Coast, Australia, April 14-17, 2013 ; proceedings. | |
(Lecture Notes in Computer Science, 7819 : Lecture notes in artificial intelligence). Berlin: Springer, pp. 160–172 | |
Ester, M. et al. (1996) ‘A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’, | |
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining: AAAI Press, | |
pp. 226–231. Available at: http://dl.acm.org/citation.cfm?id=3001460.3001507. | |
McInnes, L., Healy, J. and Astels, S. (2016) How HDBSCAN Works. | |
Available at: https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html. | |
McInnes, L., Healy, J. and Astels, S. (2017) ‘hdbscan: Hierarchical density based clustering’, | |
The Journal of Open Source Software, 2(11), p. 205. doi: 10.21105/joss.00205 | |
McInnes, L. and Healy, J. (2017) ‘Accelerated Hierarchical Density Based Clustering’, | |
2017 IEEE International Conference on Data Mining Workshops (ICDMW), | |
pp. 33–42. doi: 10.1109/ICDMW.2017.12 | |
### flexmix | |
FlexMix implements a general framework for finite mixtures of regression | |
models. Parameter estimation is performed using the EM algorithm: the E-step | |
is implemented by flexmix, while the user can specify the M-step. | |
For more information look at the vignette of [flexmix] "flexmix-intro" | |
#### flexmix References | |
Friedrich Leisch. FlexMix: A general framework for finite mixture models and | |
latent class regression in R. Journal of Statistical Software, 11(8), 2004. | |
doi:10.18637/jss.v011.i08 | |
Bettina Gruen and Friedrich Leisch. Fitting finite mixtures of generalized | |
linear regressions in R. Computational Statistics & Data Analysis, 51(11), | |
5247-5252, 2007. doi:10.1016/j.csda.2006.08.014 | |
Bettina Gruen and Friedrich Leisch. FlexMix Version 2: Finite mixtures with | |
concomitant variables and varying and constant parameters Journal of | |
Statistical Software, 28(4), 1-35, 2008. doi:10.18637/jss.v028.i04 | |
### Using the different algorithms | |
To use the different algorithms, it is only needed to have created a simulated | |
expression data frame (which can be done with the `simulateExpression()` command). | |
Of course, it is also possible to use the different algorithms with a data-set | |
that was created elsewhere, if it has the same structure used in this packag, then | |
it is useful to type in the path to the expression data and clinical data. | |
To use this package for detecting bimodality in your data set, we recommend | |
using the **not clear yet** command, because our tests with a stratified | |
data set (included as the test-data set (**not yet**)) have shown that this | |
algorithm is the most accurate. | |
The output of all implemented algorithms was unified and overall shows the | |
same structure. | |
In the following it is shown, how to use the different implemented algorithms | |
and how the output should roughly look like. | |
####Using mclust | |
To use the mclust algorithm on your expression data frame (simulated or not) | |
and gain information which genes are bimodal use the useMclust() command: | |
```{r useMclust,eval = FALSE} | |
#use mclust with the expression data frame expression and messages printed to the console | |
mclustOutput <- useMclust(expression = expression) | |
#use mclust with the expression data frame expression and messages not printed to the console | |
mclustOutput <- useMclust(expression = expression,verbose = FALSE) | |
``` | |
####Using HDBSCAN | |
To use the HDBSCAN algorithm on your expression data frame (simulated or not) | |
and gain information which genes are bimodal use the useHdbscan() command. | |
Using HDBSCAN also needs the component `minClusterSize`, which acts as a | |
minimum cluster size to detect: | |
```{r useHdbscan,eval = FALSE} | |
#use HDBSCAN with the expression data frame expression | |
hdbscanOutput <- useHdbscan(expression = expression,minClusterSize = 10) | |
``` | |
####Using FlexMix: | |
To use the FlexMix algorithm on your expression data frame (simulated or not) | |
and gain information which genes are bimodal use the useFlexmix() command: | |
```{r useFlexmix,eval = FALSE} | |
#use FlexMix with the expression data frame expression | |
flexmixOutput <- useFlexmix(expression = expression,maxModality = 2) | |
``` | |
## Evaluating the algorithms output | |
After using one or more of the implemented algorithms on the gene expression | |
matrix, the packages functions to evaluate which genes are bimodal can be | |
used. All outputs can be used with the same evaluation functions. | |
Using the `evaluateAlgorithmOutput()` command is the first step for using the | |
`getStatistics()` function. | |
```{r evaluateAlgorithmOutput,eval = FALSE} | |
#evaluating for bimodal genes after using useMclust() | |
mclustEvaluation <- evaluateAlgorithmOutput(modality= 2,algorithmOutput = mclustOutput,validationData = validationData, maxDifference = 10) | |
#evaluating for bimodal genes after using useHdbscan() | |
hdbscanEvaluation <- evaluateAlgorithmOutput(modality= 2,algorithmOutput = hdbscanOutput,validationData = validationData, maxDifference = 10) | |
#evaluating for bimodal genes after using useFlexmix() | |
flexmixEvaluation <- evaluateAlgorithmOutput(modality= 2,algorithmOutput = flexmixOutput,validationData = validationData, maxDifference = 10) | |
#the evaluation output looks like this: | |
head(mclustEvaluation) | |
``` | |
After using the algorithms and using the `evaluateAlgorithmOutput` command, all | |
requirements to create the statistics and classifications were met. | |
### Statistics | |
The `getStatistics()` function calculates the percentage of values that were | |
correctly (*TRUE*) and not correctly (*FALSE*) estimated by one or more of the | |
algorithms. To get this statistic, execute the following code: | |
```{r getStatistic(),eval = FALSE} | |
#getStatistic of one evaluation | |
getStatistics(evaluations = list("mclust"=mclustEvaluation)) | |
#getStatistic of two or more evaluations | |
getStatistics(evaluations = list("mclust"=mclustEvaluation,"HDBSCAN"=hdbscanEvaluation)) | |
``` | |
### Classifications | |
To get the classification of correctly and not correctly estimated scenarios, | |
the `getClassifications()` function can be used. | |
#### Reading the output | |
* FN (FALSE Negatives) - validationData: modality, estimated scenario: not bimodal | |
(unimodal or more than bimodal) | |
* FP (FALSE Positives) - validationData: unimodal, estimated scenario: bimodal | |
* TN (TRUE Negatives) - validationData: unimodal, estimated scenario: unimodal | |
* TP (TRUE Positives) - validationData: bimodal, estimated scenario: bimodal | |
#### Using the `getClassifications()` command | |
```{r getClassifications(),eval = FALSE} | |
#get the classification of one algorithmOutput | |
getClassifications(modality = 2,validationData = validationData,outputs = list("mclust"=mclustOutput)) | |
#get the classification of two or more algorithms | |
getClassifications(modality = 2,validationData = validationData, | |
outputs = list("mclust"=mclustOutput,"HDBSCAN"=hdbscanOutput)) | |
``` | |
## Plotting | |
###Plotting the gene expression | |
In order to plot the gene expression of all genes, four components should be | |
specified: | |
* `expression` - The gene expression matrix that should be plotted | |
* `plotsPerPage` - The number of genes that are to be plotted on the same | |
page. This number has to be a divisor of the number of `GenesToPlot`. | |
Please consider that too many plots on one page make the plots unreadable! | |
* `GenesToPlot` - The number of genes to be plotted. Should be the | |
number of genes. | |
* `plotName` - The name under which a pdf file with the plots will be stored | |
in your folder. The pdf must not be added in the plotName! | |
```{r plotExpression(),eval=FALSE} | |
#This will create a pdf called "expression.pdf" with a plot of each gene on a separate page. | |
plotExpression(expression = expression,plotsPerPage = 1,GenesToPlot = 1000,plotName = "expression") | |
#This will create a pdf called "expression200.pdf" with the first 200 genes plotted with two plots on each page. | |
plotExpression(expression = expression,plotsPerPage = 2,GenesToPlot = 200,plotName = "expression200") | |
``` | |
The plotted gene expression will look like this: | |
 | |
###Plotting the gene expression and output | |
The outputs of all algorithms producing output with the unified output format can be plotted. | |
In order to plot the gene expression with the output, four components should be | |
specified: | |
* `expression` - The gene expression matrix that should be plotted | |
* `algorithmOutput` - The algorithm output that should be plotted | |
* `name` - The name under which a pdf file with the plots will be stored | |
in your folder. The pdf must not be added in the plotName! | |
* `GenesToPlot` - The number of genes to be plotted. Should be the | |
number of genes. | |
```{r plotOutput, eval=FALSE} | |
#Plotting one algorithms output | |
## This function call will create a pdf called "mclustOutput.pdf", where the expression for each gene and the estimated expression of mclust is plotted. | |
## The expression is plotted in black, the algorithms output is plotted in red. | |
plotOutput(expression = expression,algorithmOutput = mclustOutput,name = "mclustOutput",GenesToPlot = nrow(expression)) | |
``` | |
The plotted output with the gene expression will look like this: | |
 | |
###Plotting the statistics | |
The produced statistics can be plotted. | |
In order to plot the statistics, two components should be | |
specified: | |
* `statistics` - The statistics that should be plotted | |
* `name` - The name under which a pdf file with the plots will be stored | |
in your folder. The pdf must not be added in the plotName! | |
```{r plotStatistics, eval=FALSE} | |
# This function call will create a pdf called "statistics.pdf",where the statistics are plotted. | |
plotStatistics(statistics = statistics,name = "statistics") | |
``` | |
The plotted statistics will look like this: | |
 | |
###Plotting the classifications | |
The produced classifications can be plotted. | |
In order to plot the classifications, two components should be | |
specified: | |
* `classifications` - The classifications that should be plotted | |
* `name` - The name under which a pdf file with the plots will be stored | |
in your folder. The pdf must not be added in the plotName! | |
```{r plotClassifications, eval=FALSE} | |
# This function call will create a pdf called "classifications.pdf",where the classifications are plotted. | |
plotClassifications(classifications = classifications,name = "classifications") | |
``` | |
The plotted classifications will look like this: | |
 | |
# Citing multimodalR | |
If you use multimodalR in your work please cite our paper: | |
```{r citation,eval = FALSE} | |
citation("multimodalR") | |
``` | |
# Session information {-} | |
```{r sessionInfo} | |
sessionInfo() | |
``` | |
<!-- [gamma]: https://en.wikipedia.org/wiki/Gamma_distribution --> | |
<!-- [poisson]: https://en.wikipedia.org/wiki/Poisson_distribution --> | |
[Worzfeld et al.]: https://doi.org/10.1172/JCI60568 | |
[rnorm]: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Normal.html | |
[rgamma]: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/GammaDist.html | |
[mclust]: http://127.0.0.1:40225/help/library/mclust/doc/mclust.html | |
[hdbscan]: http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html | |
[HDBSCANdocs]: http://hdbscan.readthedocs.io/ | |
[flexmix]: https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf | |
[LaplacesDemon]: https://www.rdocumentation.org/packages/LaplacesDemon/versions/16.0.1/topics/dist.Inverse.Gamma |