Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
title: "An introduction to the multimodalR package"
author: "Svenja Kristina Beck"
date: "`r Sys.Date()`"
#output: rmarkdown::word_document
output:
html_document:
toc: true
number_sections: true
theme: readable
highlight: haddock
# output: rmarkdown::word_document
# output: pdf_document
vignette: >
%\VignetteIndexEntry{An introduction to the multimodalR package}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
<!-- Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The `html_vignette` output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The `html_vignette` format: -->
<!-- - Never uses retina figures -->
<!-- - Has a smaller default figure size -->
<!-- - Uses a custom CSS stylesheet instead of the default Twitter Bootstrap style -->
<!-- ```{r knitr-options, echo = FALSE, message = FALSE, warning = FALSE} -->
<!-- # To render an HTML version that works nicely with github and web pages, do: -->
<!-- # rmarkdown::render("vignettes/multimodalR.Rmd", "all") -->
<!-- knitr::opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, -->
<!-- dev = 'png') -->
<!-- ``` -->
# Introduction
Welcome to multimodalR! multimodalR is an R package to simulate
multimodal gene expression data, to evaluate different algorithms
detecting multimodality and to detect bimodality or multimodality
in gene expression data sets.
This vignette gives an overview and introduction to multimodalR's functionalities.
It was written in R Markdown, using the knitr package for production.
See help(package="multimodalR") for further details and references provided by citation("multimodalR").
# Installation
To install the most recent development version from Github use:
```{r install-github, eval = FALSE}
# biocLite("loosolab/multimodalR", dependencies = TRUE,build_vignettes = TRUE)
```
# Quickstart
Assuming you want to simulate gene expression data from scratch
there are two simple steps to create a simulated data set with multimodalR.
Here is an example:
```{r quickstart,eval = FALSE}
# Load package
library(multimodalR)
# Create parameter object
params <- newParams()
# Simulate data and generating the validationData behind it
simulation <- simulateExpression(params = params)
validationData <- simulation$validationData
expression <- simulation$expressionData
```
These steps will be explained in detail in the following sections but in summary
the first step generates a Params object containing the parameters for
the simulation. The second step takes those parameters and simulates a new data set
containing the simulated gene expressions.
# The general background
When observing gene expression data, mostly 3 different scenarios can be observed:
1. a unimodal distribution of the gene expression
2. a multimodal (often bimodal) distribution of the gene expression,
where lesser and greater expression are observed
3. a multimodal (often bimodal) distribution of the gene expression,
where a nonexistent expression (0, or nearly 0)
and higher expressions are observed
For our simulation the first scenario is generated with a Gaussian distribution.
The second scenario is generated with multiple Gaussian distributions.
The third scenario is generated with a Gamma distribution and multiple Gaussian distributions.
Examples for bimodal gene expression:
* The gene expression of Plexin-B1 is bimodal in ErbB-2 over-expressing cells ([Worzfeld et al.])
# Workflow
The following section describes how multimodalR is used. The framework and all its properties are described and used in examples.
![Workflow to simulate gene expression and evaluate which algorithm works best](Workflow_multimodalR.png){width=650px}
## Parameters
The parameters required for the gene expression simulation are briefly described here:
* **Global parameters**
* `nGenes` - A list containing the numbers of genes to simulate for each scenario and modality.
* `nPatients` - The number of patients to simulate and
how many columns the simulated gene expression should have per gene.
* `seed` - Seed to use to generate random numbers.
* **Distribution parameters**
* `means` - A list containing the parameter range for the mean for each scenario and modality.
* `foldChanges` - A list containing the parameter range for the foldChanges between distributions for multimodal genes. Has to be one less than the modality that is simulated.
* `sd` - Parameters for generating the standard deviations via an Inverse Gaussian distribution
* `gamma` - Shape and rate parameter to generate the gamma distributions.
* `proportions` - Parameters for the allowed proportions of multimodal distributions.
## The `Params` Object
All the parameters to simulate gene expression data
are stored in a `Params` object.
Let's create a new one and see what it looks like.
```{r Params,eval = FALSE}
params <- newParams()
params
```
As well as telling us what type of object we have ("An object of class
`Params`") and showing us the values of the parameters
this output gives us some extra information.
We can see which parameters have been changed from their
default values (those in ALL CAPS).
### `Params` object explained & Template
The names "1","2","3",... stand for the modality of the scenario. If you want to simulate gene expression with a modality > 3, it is very helpful to use the following template:
```{r params-Template,eval = FALSE}
#template for highest modality >= 3
params <- newParams(
nGenes=list(
"1"=c(300),
"2"=list(gauss = 150, gamma = 150),
"3"=list(gauss = 150, gamma = 150)#add more rows
),
means=list(
"1"= c(2, 4),
"2"= c(2, 4),
"3"= c(2, 4)
),
foldChanges = list(
"1"= NA,
"2"= list(gauss = c(2, 4), gamma = c(2, 4)),
"3"= list(gauss = list(c(2, 4), c(2, 4)), gamma = list(c(2, 4), c(2, 4)))
),
sd = list(
"mu" = 0.61,
"lambda" = 2.21
),
gamma = list(
shape = 2,
rate = 2
),
proportions = c(10,20,30,40,50,60,70,80,90),
nPatients = c(200)
)
```
CAUTION: if you want to simulate only certain modalities (e.g. 2,4), you have to use blanks for the others:
```{r params-TemplateOnlyCertainModailities,eval = FALSE}
params <- newParams(
nGenes=list(
"1"= c(0),
"2"= list(gauss = 10, gamma = 10),
"3"= list(gauss = 0, gamma = 0),
"4"= list(gauss = 10, gamma = 10)
),
means=list(
"1"= 0,
"2"= c(2, 4),
"3"= 0,
"4"= c(2,4)
),
foldChanges = list(
"1"= NA,
"2"= list(gauss = c(2, 4), gamma = c(2, 4)),
"3"= NA,
"4"= list(gauss = list(c(2, 4),c(2,4),c(2,4)), gamma = list(c(2, 4),c(2,4),c(2,4)))
),
sd = list(
"mu" = 0.61,
"lambda" = 2.21
),
gamma = list(
shape = 2,
rate = 2
),
proportions = c(10,20,30,40,50,60,70,80,90),
nPatients = c(200)
)
```
### Getting and setting
If we want to look at a particular parameter, for example the number of genes to
simulate, we can extract it using the `getParam` function:
```{r getParam,eval = FALSE}
getParam(params, "nGenes")
```
Alternatively, to give a parameter a new value we can use the `setParam`
function:
```{r setParam,eval = FALSE}
params <- setParam(params, "nGenes", list("1" = 100,"2"=list("gauss"=150,"gamma"=200)))
getParam(params, "nGenes")
```
If we want to extract multiple parameters (as a list) or set multiple parameters
we can use the `getParams` or `setParams` functions:
```{r getParams-setParams,eval = FALSE}
# Set multiple parameters at once (using a list)
params <- setParams(params,
update = list("nGenes" = list("1" = 100,"2"=list("gauss"=150,"gamma"=200)),
"nPatients" = 300))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "gamma"))
# Set multiple parameters at once (using additional arguments)
params <- setParams(params,
"nGenes" = list("1" = 100,"2"=list("gauss"=150,"gamma"=200)),
"nPatients" = 300)
params
```
The parameters which have changed are now shown in ALL CAPS to indicate that they have
been changed from the default.
We can also set parameters directly when we call `newParams`:
```{r newParams-set,eval = FALSE}
# # A basic example for simulating unimodal, bimodal and trimodal distributions
params <- newParams(
nGenes=list("1"=c(300),
"2"=list(gauss = 150, gamma = 150),
"3"=list(gauss = 150, gamma = 150)),
means=list("1"= c(2, 4),
"2"= c(2, 4),
"3"= c(2, 4)),
foldChanges = list("1"= NA,
"2"= list(gauss = c(2, 4),
gamma = c(2, 4)),
"3"= list(gauss = list(c(2, 4), c(2, 4)),
gamma = list(c(2, 4), c(2, 4)))),
nPatients=50)
params
A Params object of class Params
Parameters can be 'Default' or 'NOT DEFAULT'.
Global:
NGENES
10, list(gauss = 10, gamma = 10)
MEANS
c(2, 4), c(2, 4)
FOLDCHANGES
NA, list(gauss = c(2, 4), gamma = c(2, 4))
sd
0.61, 2.21
Gamma
2, 2
Proportions
10, 20, 30, 40, 50, 60, 70, 80, 90
NPATIENTS
50
SEED
61414
```
To change the parameters, please consider the following table.
*Type* shows with which type the slot has to be filled,
*Usage* shows further information for the usage of the values,
*Default values* describes the default values used to generate a `Params`
object.
```{r params-table, eval=FALSE}
Type Usage Default values
nGenes list values: unimodal,bimodal (gauss,gamma),... 700, list(gauss = 150, gamma = 150)
means list values: unimodal,bimodal (gauss),... c(2, 4), c(2, 4)
foldChanges list values: unimodal,bimodal (gauss,gamma),... c(2, 4), list(gauss = c(2, 4), gamma = c(3, 5))
sd list use values as µ,lambda 0.61, 2.21
gamma list use values as shape,rate 2, 2
proportions vector pick random value from vector c(10,20,30,40,50,60,70,80,90)
nPatients numeric - 100
seed numeric - sample(1:1e6,1)
```
## The gene expression simulation
Now that we looked at the Params Objects and saw which parameters it contains,
let's look at how multimodalR simulates gene expression data.
The core of the simulation is the validationData, which is created in the simulation
process with the parameters from the created Params object.
As the result of the simulation process, the simulated gene expression matrix
is returned with the validationData in a list object.
The simulation of a gene expression data set requires the parameters (which are
stored in the Params object) and the validationData behind the simulation (which was
not created yet).
The Params object was already created and if needed, default values were
changed. The validationData is automatically created when the simulate() command is
executed.
### validationData
The structure of the validationData is the following:
* each simulated gene has its own entry in the validationData and can be accessed with validationData$[GeneName]
* each entry contains the following information:
* modus - modality of this gene shown as a number (1 = unimodal, 2 = bimodal, ...)
* scenario - scenario of this gene, either ("unimodal-gaussian","multimodal-gaussian" or "multimodal-gamma+gaussian")
* means - means of this gene
* foldChanges - foldChanges between the means (NA for unimodal genes)
* sds - standard deviation of each mean
* proportions - proportions of the different distributions
* sizes - sizes of the different groups
* groups - patientNames are listed in the corresponding group
To generate the validationData, a `Params` object is needed. The parameters from this
object are used for generating the validationData by randomizing values starting with a
generated `seed`. This is done by drawing values from the `Params` object.
Mean values are generated from a span of values, standard
deviations are generated by using an inversed Gaussian distribution from the package [LaplacesDemon].That is,
because over a large data set a inverseGaussian-like distribution of all
standard deviations was noticed.
Because of those needed parameters the validationData looks for example like that:
```{r validationData-example,eval=FALSE}
validationData$Gene0001
$modus
[1] 2
$scenario
[1] "multimodal-gaussian"
$means
[1] 3.561552 7.309845
$foldChanges
[1] 2.052433
$sds
[1] 0.2826536 0.6026159
$proportions
[1] 20 80
$sizes
[1] 10 40
$groups
$groups[[1]]
[1] "Patient0001" "Patient0002" "Patient0003" "Patient0004"
[5] "Patient0005" "Patient0006" "Patient0007" "Patient0008"
[9] "Patient0009" "Patient0010"
$groups[[2]]
[1] "Patient0011" "Patient0012" "Patient0013" "Patient0014"
[5] "Patient0015" "Patient0016" "Patient0017" "Patient0018"
[9] "Patient0019" "Patient0020" "Patient0021" "Patient0022"
[13] "Patient0023" "Patient0024" "Patient0025" "Patient0026"
[17] "Patient0027" "Patient0028" "Patient0029" "Patient0030"
[21] "Patient0031" "Patient0032" "Patient0033" "Patient0034"
[25] "Patient0035" "Patient0036" "Patient0037" "Patient0038"
[29] "Patient0039" "Patient0040" "Patient0041" "Patient0042"
[33] "Patient0043" "Patient0044" "Patient0045" "Patient0046"
[37] "Patient0047" "Patient0048" "Patient0049" "Patient0050"
```
To understand why all those parameters are needed, it is essential to
understand how the validationData is used to simulate the gene expression.
### Simulation
With the validationData the gene expression matrix is simulated.
For each gene (each entry of the validationData) the parameters in the validationData are used to simulate
the gene expression.
How many columns the gene expression matrix will have depends on the number of
patients. Per patient one gene expression is simulated.
If a negative value was simulated, it is changed to be 0 in the simulation process,
as normally no negative gene expressions can be observed.
How the gene expression is simulated depends on the modus.
The "modus" is the number of occurring distributions within one gene.
####Modus "1" - Unimodal gaussian distribution
Gene expression for the unimodal gaussian distribution (modus "1") is
simulated from the mean and the standard deviation taken from the
validation data for this gene. The Gaussian distribution is calculated
with [rnorm]. The number of values that are simulated for each distribution
equals the number of Patients {"nPatients"}.
All other modi consist of two different scenarios:
In the "gauss" scenario all occurring distributions are Gaussian distributions
and calculated with [rnorm].
In the "gamma" scenario the first occuring distribution is a Gamma distribution
and is calculated with [rgamma].
All other distributions are gaussian distributions and are calculated with [rnorm].
####Modus "2" or higher - Bimodal distributions or higher modalities
For each distribution gene expressions of the multimodal genes are
automatically simulated.
#####Scenario "gauss"
The different gaussian distributions are simulated with the corresponding mean
and a standard deviation and number of values taken from the validation data
for this gene.
#####Scenario "gamma"
The gamma distribution is simulated with the gamma shape and rate values from
the params object. The other gaussian distributions are simulated with the values
from the validation data.
####Exemplary Gene Expression Matrix
After using the `simulateExpression()` command, the gene expression matrix looks like
this:
```{r expression-sample, eval=FALSE}
Patient0001 Patient0002 Patient0003 Patient0004 Patient0005
Gene0001 3.9674492 2.80396473 3.78980664 4.095237721 4.00587345
Gene0002 4.1045561 2.26303081 3.75889335 3.522737760 3.02592372
Gene0003 2.4582632 2.22295497 3.17911283 2.947560965 2.74938887
Gene0004 5.9032577 5.95497476 4.75760279 5.161985499 3.70910216
Gene0005 3.7431061 3.68041039 3.62509463 3.740929320 3.66194851
Gene0006 3.0784338 1.79963109 2.16929818 1.160848158 3.49503635
Gene0007 0.7993267 2.21868266 2.62367680 0.152856110 1.58145211
Gene0008 2.0018416 2.22588333 2.90464907 3.049822819 2.28450055
Gene0009 2.2986108 2.08720943 2.02138626 2.407369792 2.00220259
Gene0010 2.1793609 2.85387502 2.23943177 2.423610586 2.35072765
```
### Executing the simulation
As the `Params`object can be generated before starting the
simulation, the `simulateExpression()` command can be used in two different ways:
1. with a created params object (recommended)
2. without a created params object (not recommended)
The validationData is automatically generated when using the `simulateExpression()` command.
The `simulateExpression()` command returns a list object, which contains the created
validationData and the simulated expressionData.
###Simulating the gene expression
#### 1. With a created params object(recommended)
It is recommended to create a `Params` object before using the `simulateExpression()`
command, because then you can change the parameters and have the params object
saved in your global environment.
```{r simulateExpression-with-params,eval = FALSE}
#with messages printed to console
params <- newParams()
simulation <- simulateExpression(params = params)
#without messages printed to console
simulation <- simulateExpression(params = params,verbose = FALSE)
#storing the validation data and the simulated gene expression data as
#separate data.frames
validationData <- simulation$validationData
expression <- simulation$expressionData
```
#### 2. Without a created params object (not recommended)
If the simulation should be executed with default parameters only (or with
only certain parameters changed) and you do not mind, having no separate
`Params`object in your global environment to be able to look up the different
parameters (especially in the case that you changed them), you can use this
commands, though it is not recommended, as it is useful to have a `Params`
object to look up the parameters behind the validationData.
In this case, without a created `Params`object the validationData is not saved as an
Rdata object, so there is no way of keeping track of changes to your params object.
```{r simulateExpression-without-params,eval = FALSE}
#with default parameters and messages printed to console
params <- newParams()
simulation <- simulateExpression(params = newParams())
#with one changed parameter and messages not printed to console
simulation <- simulateExpression(params = newParams("seed"=23),verbose = FALSE)
#with some changed parameters and messages printed to console
simulation <- simulateExpression(params = newParams("seed"=23,"nPatients"=200))
#storing the validation data and the simulated gene expression data as
#a list and a data.frame
validationData <- simulation$validationData
expression <- simulation$expressionData
```
## The unified output data format
The output data format is a unified data format, that is needed to work with
and use all of the evaluation functions. The output of all implemented
algorithms matches the unified data format.
The unified data format has the following structure:
The data format is structured in three parts:
* the Genes section,
* the Clinical Data section and
* expression matrix section
The different parts are accessible with:
```{r accessing-the-different-sections, eval=FALSE}
mclustOutput$Genes[1:10] #mclustOutput$Genes for all genes
mclustOutput$ClinicalData
mclustOutput$Expressionmatrix
```
The Genes section has many parts. For each gene a separate section is generated.
Each gene section contains the following information, which were calculated
by using one of the algorithms:
* $modus - the modus is the number that equals the found number of occurring distributions
within one gene (the modality)
* $means - the means section contains the calculated mean values for each modus
* $sds - the sds section contains the calculated standard deviation values for each modus
* $sizes - the sizes section contains the calculated size of each group
* $groups - the groups section contains the patients that are assigned to the different groups
The Genes can be accessed by using:
```{r accessing-a-specific-gene-section, eval=FALSE}
mclustOutput$Genes$Gene0001
#Accessing a specific section within the genes is possible by using the $ tags
mclustOutput$Genes$Gene0001$modus
mclustOutput$Genes$Gene0001$means
mclustOutput$Genes$Gene0001$sds
mclustOutput$Genes$Gene0001$sizes
mclustOutput$Genes$Gene0001$groups
output
$Genes
$Genes$Gene0001
$Genes$Gene0001$modus
[1] 2
$Genes$Gene0001$means
[1] 3.613995 7.221655
$Genes$Gene0001$sds
[1] 0.1851148 0.5368126
$Genes$Gene0001$sizes
[1] 10 40
$Genes$Gene0001$groups
$Genes$Gene0001$groups[[1]]
[1] "Patient0001" "Patient0002" "Patient0003" "Patient0004"
[5] "Patient0005" "Patient0006" "Patient0007" "Patient0008"
[9] "Patient0009" "Patient0010"
$Genes$Gene0001$groups[[2]]
[1] "Patient0011" "Patient0012" "Patient0013" "Patient0014"
[5] "Patient0015" "Patient0016" "Patient0017" "Patient0018"
[9] "Patient0019" "Patient0020" "Patient0021" "Patient0022"
[13] "Patient0023" "Patient0024" "Patient0025" "Patient0026"
[17] "Patient0027" "Patient0028" "Patient0029" "Patient0030"
[21] "Patient0031" "Patient0032" "Patient0033" "Patient0034"
[25] "Patient0035" "Patient0036" "Patient0037" "Patient0038"
[29] "Patient0039" "Patient0040" "Patient0041" "Patient0042"
[33] "Patient0043" "Patient0044" "Patient0045" "Patient0046"
[37] "Patient0047" "Patient0048" "Patient0049" "Patient0050"
$ClinicalData
[1] "mnt/data/clinicalData.Rdata"
$Expressionmatrix
[1] "mnt/data/expression.Rdata"
```
If this package is used with an algorithm which was not implemented in this package,
the output has to be changed to match this unified data format.
Then, all the evaluation functions will work with the output of not implemented algorithms.
### Requirements for implementing new algorithms
The output of algorithms that are implemented in multimodalR have to be in the unified output data format.
Whether the output fulfils all requirements of the unified output data format can be easily checked
with the 'isValidUnifiedOuputDataFormat()' function.
```{r isValidUnifiedOutputDataFormat, eval=FALSE}
isValidUnifiedOutputDataFormat(output = output,expressionmatrix = expression)
```
The requirements are listed as the structure of the unified output data format is constructed:
* output
* should be of type 'list'
* should hold the 'Genes','ClinicalData' and 'Expressionmatrix' section
* should have a length of 3
* ClinicalData
* should be of type 'character'
* should have a length of 1
* ExpressionMatrix
* should be of type 'character'
* should have a length of 1
* Genes
* should be of type 'list'
* should have a length of nrow(expressionmatrix)
* EACH GENE
* modus
* should be of type 'integer'
* means
* should be of type 'double'
* should have a lenth of modus
* sds
* should be of type 'double'
* should have a lenth of modus
* sizes
* should be of type 'integer'
* should have a lenth of modus
* sum should equal ncol(expressionmatrix)
* groups
* should be of type 'list' and should have a lenth of modus
* EACH GROUP
* should be of type 'character'
* size of each group should correspond to sizes[[groupnumber]]
If all of those requirements are fulfilled, the function should return 'TRUE', if any of the requirements is not fulfilled, the function returns 'FALSE'. The messages printed to the console will help to determine which requirements are not fulfilled yet.
## Usage of the algorithms
In this package three algorithms that can be used for detecting multimodality,
are implemented. The mclust algorithm that can be used with useMclust() is the one recommended to detect multimodal genes in gene expression data sets.
### mclust
Mclust uses a finite Gaussian mixture modeling fitted via EM algorithm for
model-based clustering, classification, and density estimation, including
Bayesian regularization and dimension reduction.
For more information look at the vignette of [mclust] "A quick tour of mclust"
#### mclust References
Scrucca, L. et al. (2016) ‘mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models’,
The R journal, 8(1), pp. 289–317.
C. Fraley and A. E. Raftery (2002). Model-based clustering, discriminant
analysis, and density estimation. Journal of the American Statistical
Association 97:611:631.
### HDBSCAN
The HDBSCAN (Hierarchical Density-based
spatial clustering of applications with noise) clustering algorithm
computes the hierarchical cluster tree representing density estimates
along with the stability-based flat cluster extraction proposed by Campello et
al. (2013). HDBSCAN essentially computes the hierarchy of all DBSCAN*
clusterings, and then uses a stability-based extraction method to find optimal
cuts in the hierarchy, thus producing a flat solution.
For more information look at the way [HDBSCAN] works or use the documentation
at [HDBSCANdocs]
#### HDBSCAN References
Campello, R.J.G.B., Moulavi, D. and Sander, J. (2013) ‘Density-Based Clustering Based on Hierarchical Density Estimates’,
in Pei, J. et al. (eds.) Advances in knowledge discovery and data mining: 17th Pacific Asia Conference,
PAKDD 2013, Gold Coast, Australia, April 14-17, 2013 ; proceedings.
(Lecture Notes in Computer Science, 7819 : Lecture notes in artificial intelligence). Berlin: Springer, pp. 160–172
Ester, M. et al. (1996) ‘A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’,
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining: AAAI Press,
pp. 226–231. Available at: http://dl.acm.org/citation.cfm?id=3001460.3001507.
McInnes, L., Healy, J. and Astels, S. (2016) How HDBSCAN Works.
Available at: https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html.
McInnes, L., Healy, J. and Astels, S. (2017) ‘hdbscan: Hierarchical density based clustering’,
The Journal of Open Source Software, 2(11), p. 205. doi: 10.21105/joss.00205
McInnes, L. and Healy, J. (2017) ‘Accelerated Hierarchical Density Based Clustering’,
2017 IEEE International Conference on Data Mining Workshops (ICDMW),
pp. 33–42. doi: 10.1109/ICDMW.2017.12
### flexmix
FlexMix implements a general framework for finite mixtures of regression
models. Parameter estimation is performed using the EM algorithm: the E-step
is implemented by flexmix, while the user can specify the M-step.
For more information look at the vignette of [flexmix] "flexmix-intro"
#### flexmix References
Friedrich Leisch. FlexMix: A general framework for finite mixture models and
latent class regression in R. Journal of Statistical Software, 11(8), 2004.
doi:10.18637/jss.v011.i08
Bettina Gruen and Friedrich Leisch. Fitting finite mixtures of generalized
linear regressions in R. Computational Statistics & Data Analysis, 51(11),
5247-5252, 2007. doi:10.1016/j.csda.2006.08.014
Bettina Gruen and Friedrich Leisch. FlexMix Version 2: Finite mixtures with
concomitant variables and varying and constant parameters Journal of
Statistical Software, 28(4), 1-35, 2008. doi:10.18637/jss.v028.i04
### Using the different algorithms
To use the different algorithms, it is only needed to have created a simulated
expression data frame (which can be done with the `simulateExpression()` command).
Of course, it is also possible to use the different algorithms with a data-set
that was created elsewhere, if it has the same structure used in this packag, then
it is useful to type in the path to the expression data and clinical data.
To use this package for detecting bimodality in your data set, we recommend
using the **not clear yet** command, because our tests with a stratified
data set (included as the test-data set (**not yet**)) have shown that this
algorithm is the most accurate.
The output of all implemented algorithms was unified and overall shows the
same structure.
In the following it is shown, how to use the different implemented algorithms
and how the output should roughly look like.
####Using mclust
To use the mclust algorithm on your expression data frame (simulated or not)
and gain information which genes are bimodal use the useMclust() command:
```{r useMclust,eval = FALSE}
#use mclust with the expression data frame expression and messages printed to the console
mclustOutput <- useMclust(expression = expression)
#use mclust with the expression data frame expression and messages not printed to the console
mclustOutput <- useMclust(expression = expression,verbose = FALSE)
```
####Using HDBSCAN
To use the HDBSCAN algorithm on your expression data frame (simulated or not)
and gain information which genes are bimodal use the useHdbscan() command.
Using HDBSCAN also needs the component `minClusterSize`, which acts as a
minimum cluster size to detect:
```{r useHdbscan,eval = FALSE}
#use HDBSCAN with the expression data frame expression
hdbscanOutput <- useHdbscan(expression = expression,minClusterSize = 10)
```
####Using FlexMix:
To use the FlexMix algorithm on your expression data frame (simulated or not)
and gain information which genes are bimodal use the useFlexmix() command:
```{r useFlexmix,eval = FALSE}
#use FlexMix with the expression data frame expression
flexmixOutput <- useFlexmix(expression = expression,maxModality = 2)
```
## Evaluating the algorithms output
After using one or more of the implemented algorithms on the gene expression
matrix, the packages functions to evaluate which genes are bimodal can be
used. All outputs can be used with the same evaluation functions.
Using the `evaluateAlgorithmOutput()` command is the first step for using the
`getStatistics()` function.
```{r evaluateAlgorithmOutput,eval = FALSE}
#evaluating for bimodal genes after using useMclust()
mclustEvaluation <- evaluateAlgorithmOutput(modality= 2,algorithmOutput = mclustOutput,validationData = validationData, maxDifference = 10)
#evaluating for bimodal genes after using useHdbscan()
hdbscanEvaluation <- evaluateAlgorithmOutput(modality= 2,algorithmOutput = hdbscanOutput,validationData = validationData, maxDifference = 10)
#evaluating for bimodal genes after using useFlexmix()
flexmixEvaluation <- evaluateAlgorithmOutput(modality= 2,algorithmOutput = flexmixOutput,validationData = validationData, maxDifference = 10)
#the evaluation output looks like this:
head(mclustEvaluation)
```
After using the algorithms and using the `evaluateAlgorithmOutput` command, all
requirements to create the statistics and classifications were met.
### Statistics
The `getStatistics()` function calculates the percentage of values that were
correctly (*TRUE*) and not correctly (*FALSE*) estimated by one or more of the
algorithms. To get this statistic, execute the following code:
```{r getStatistic(),eval = FALSE}
#getStatistic of one evaluation
getStatistics(evaluations = list("mclust"=mclustEvaluation))
#getStatistic of two or more evaluations
getStatistics(evaluations = list("mclust"=mclustEvaluation,"HDBSCAN"=hdbscanEvaluation))
```
### Classifications
To get the classification of correctly and not correctly estimated scenarios,
the `getClassifications()` function can be used.
#### Reading the output
* FN (FALSE Negatives) - validationData: modality, estimated scenario: not bimodal
(unimodal or more than bimodal)
* FP (FALSE Positives) - validationData: unimodal, estimated scenario: bimodal
* TN (TRUE Negatives) - validationData: unimodal, estimated scenario: unimodal
* TP (TRUE Positives) - validationData: bimodal, estimated scenario: bimodal
#### Using the `getClassifications()` command
```{r getClassifications(),eval = FALSE}
#get the classification of one algorithmOutput
getClassifications(modality = 2,validationData = validationData,outputs = list("mclust"=mclustOutput))
#get the classification of two or more algorithms
getClassifications(modality = 2,validationData = validationData,
outputs = list("mclust"=mclustOutput,"HDBSCAN"=hdbscanOutput))
```
## Plotting
###Plotting the gene expression
In order to plot the gene expression of all genes, four components should be
specified:
* `expression` - The gene expression matrix that should be plotted
* `plotsPerPage` - The number of genes that are to be plotted on the same
page. This number has to be a divisor of the number of `GenesToPlot`.
Please consider that too many plots on one page make the plots unreadable!
* `GenesToPlot` - The number of genes to be plotted. Should be the
number of genes.
* `plotName` - The name under which a pdf file with the plots will be stored
in your folder. The pdf must not be added in the plotName!
```{r plotExpression(),eval=FALSE}
#This will create a pdf called "expression.pdf" with a plot of each gene on a separate page.
plotExpression(expression = expression,plotsPerPage = 1,GenesToPlot = 1000,plotName = "expression")
#This will create a pdf called "expression200.pdf" with the first 200 genes plotted with two plots on each page.
plotExpression(expression = expression,plotsPerPage = 2,GenesToPlot = 200,plotName = "expression200")
```
The plotted gene expression will look like this:
![Example picture for how the plotted output will look like.](plotExpression.png)
###Plotting the gene expression and output
The outputs of all algorithms producing output with the unified output format can be plotted.
In order to plot the gene expression with the output, four components should be
specified:
* `expression` - The gene expression matrix that should be plotted
* `algorithmOutput` - The algorithm output that should be plotted
* `name` - The name under which a pdf file with the plots will be stored
in your folder. The pdf must not be added in the plotName!
* `GenesToPlot` - The number of genes to be plotted. Should be the
number of genes.
```{r plotOutput, eval=FALSE}
#Plotting one algorithms output
## This function call will create a pdf called "mclustOutput.pdf", where the expression for each gene and the estimated expression of mclust is plotted.
## The expression is plotted in black, the algorithms output is plotted in red.
plotOutput(expression = expression,algorithmOutput = mclustOutput,name = "mclustOutput",GenesToPlot = nrow(expression))
```
The plotted output with the gene expression will look like this:
![Example picture for how the plotted output will look like. Black: gene expression, red: estimated distribution](PlottingExample.png)
###Plotting the statistics
The produced statistics can be plotted.
In order to plot the statistics, two components should be
specified:
* `statistics` - The statistics that should be plotted
* `name` - The name under which a pdf file with the plots will be stored
in your folder. The pdf must not be added in the plotName!
```{r plotStatistics, eval=FALSE}
# This function call will create a pdf called "statistics.pdf",where the statistics are plotted.
plotStatistics(statistics = statistics,name = "statistics")
```
The plotted statistics will look like this:
![Example picture for how the plotted statistics will look like.](getStatistics.png)
###Plotting the classifications
The produced classifications can be plotted.
In order to plot the classifications, two components should be
specified:
* `classifications` - The classifications that should be plotted
* `name` - The name under which a pdf file with the plots will be stored
in your folder. The pdf must not be added in the plotName!
```{r plotClassifications, eval=FALSE}
# This function call will create a pdf called "classifications.pdf",where the classifications are plotted.
plotClassifications(classifications = classifications,name = "classifications")
```
The plotted classifications will look like this:
![Example picture for how the plotted classification will look like.](getClassifications.png)
# Citing multimodalR
If you use multimodalR in your work please cite our paper:
```{r citation,eval = FALSE}
citation("multimodalR")
```
# Session information {-}
```{r sessionInfo}
sessionInfo()
```
<!-- [gamma]: https://en.wikipedia.org/wiki/Gamma_distribution -->
<!-- [poisson]: https://en.wikipedia.org/wiki/Poisson_distribution -->
[Worzfeld et al.]: https://doi.org/10.1172/JCI60568
[rnorm]: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Normal.html
[rgamma]: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/GammaDist.html
[mclust]: http://127.0.0.1:40225/help/library/mclust/doc/mclust.html
[hdbscan]: http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
[HDBSCANdocs]: http://hdbscan.readthedocs.io/
[flexmix]: https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf
[LaplacesDemon]: https://www.rdocumentation.org/packages/LaplacesDemon/versions/16.0.1/topics/dist.Inverse.Gamma