README.md

![PARrOT](./PARrOT/vignettes/parrot_logo.PNG "PARrOT")

## PAthway pRedictiOn by mulTimodal genes

## Abstract

The aim of this framework is to gather new insights into the regulation, connection and correlation between genes.
It could be shown that the expression level of some genes have a significant impact in the survival prognoses of cancer patients. To determine if a patient belongs to the high or low expressing group a multi modal distribution has to be fitted on the patient distribution of a gene.
Multimodal genes of a dataset can be identified with the help of [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR).
The needed input file (JSON) can also be produced with the help of this R-package.

![bimodal](./PARrOT/vignettes/bimodality_example.png "bimodal")

On base of these information gene interactions are suggested.

## Availability

All components of the PARrOT R package and the belonging Python scripts are available for download from the Github repository [PARrOT](https://github.molgen.mpg.de/loosolab/PARrOT/).

### Installation

To install the R-package the following commands have to be executed in R.

```r
library(devtools)
install_github(repo = "loosolab/PARrOT/PARrOT",host = "github.molgen.mpg.de/api/v3")
```
Also you need to obtain the python scripts by cloning the GitHub repository:

```bash
git clone https://github.molgen.mpg.de/loosolab/PARrOT.git
```

The install of graph-tool which is needed for the python script can be taken from [here](https://git.skewed.de/count0/graph-tool/wikis/installation-instructions).

Get a Docker container [here](https://cloud.docker.com/u/loosolab/repository/docker/loosolab/parrot).

Please make sure to check our other projects at [loosolab](http://loosolab.mpi-bn.mpg.de/).

## Input

As input all data can be used, which show a multi modal distribution. The package was first tested with mRNA data from TCGA.
A standardized format was needed to allow the formation of a connection between the multi modality recognition and the pathway analyses. This JavaScript Object Notation (JSON) contains each gene with its number of modalities and the connected means, standard deviation, sizes and the belonging patients in the parameter “groups”. Also, the location of the files that contain clinical data and the expression matrix should be given. An example of the needed data format is given in Table 1.

![Table1](./PARrOT/vignettes/table1.PNG "Table 1")

The JSON contains of 3 main parts. The last two sting value pairs are ClinicalData and Expressionmatrix which are supposed to hold the metadata. The first entry holds an entry for each gene with the Ensemble gene name as key and the belonging properties of the modality which was found in this gene.

|key|value|
| :----: |:--:|
|modus|number of modalities which where found in this gene|
|means|the mean for each modality|
|sds|the standard deviation for each modality|
|sizes|the size for each modality|
|groups|the belonging patients for each modality|

We recommend the [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR) package to find multimodal genes and write them into a valid JSON file which can be loaded by the readjsonsheet() function of PArOT.

## Example

In the first step a table is generated which contains a calculated data set with two slightly stronger connected spikes in a bigger noise:

```r
table <- spike_in()
```

The gathered table is then given to the calcscorematrix() function. It uses the information in the patient composition and the parameter of the distribution (sizes, proportion, fold change) to compute and normalize the connection probability between each modality.

```r
matrix <- calcscorematrix(table = table)
```

To perform a significant results it is highly recommended to use datasets with 150 multi modal genes or even more. For this case the amount of vertices and edges is too high. A reduction of the amount of data can be performed with the help of the buildsinglenode() function. In order to reduce the number of vertices the probability of a gene-gene interaction is calculated by building the mean of all belonging modality-modality interaction is computed.

By default the reduction of edges is performed by only using the top 20 edges of each gene. This number can also be selected by the cutoff parameter. It is a list (c()) which contains first the string "pernode" and as second entry the number of desired edges. By using the "global" mode the threshold is set to the top.

```r
snode <- buildSinglenode(matrix = matrix)
```

This saves a file named 'adjMatrix_singlenode.csv' in the working directory. It contains three rows where the first and the second contain the genename and the third the weight for the connection.
Also it performs the normalization of adjacency matrix and statistics for the given data.
The following command has to be executed from the commandline to perform the graph analyses.

````bash
python <git repository>/Graph.py -i <R working dir>/adjMatrix_singlenode.csv -o <your OUTPUT dir>
````

This command will generate different models for the given connection list (graph) and the belonging graphics and logs.
Also it produces the block_member.csv file which contains the most probable gene clustering.
This can be loaded into R and prepared for a second analyses run where all information of the database are used.

The following commands are supposed to find structures in the found clusters and to validate the results against the KEGG database. The generated test data does not contain a inner structure and the names of the vertice are just "spike" and "noise" which leads to the circumstance that no genes will be found in KEGG. The following commands are supposed to be executed for real data.

````r
readcluster(clustermember = "<your output dir>/block_member.csv", matrix = matrix)
````

By this a file for each found cluster is generated. Naming the files as follows: subcluster_< number of cluster >.txt.
This files contain all edges and all modalities for a found cluster of the first run of Graph analyses.

````bash
python <git repository>/Graph_subcluster.py -i <R working dir> -o <your OUTPUT dir>
````

This performs a complete analysis for each cluster.

For a more detailed discription of the example visit the [wiki](https://github.molgen.mpg.de/loosolab/PARrOT/wiki/example).

### Docker

The whole analysis can also be performed with the help of the docker container.
It can easily started by this command.

````bash
docker run -i -v <dir containing JSON>:/INPUT/ -v <desired OUTPUT dir>:/OUTPUT/ parrot:latest
````

In the input directory only one JSON file is supposed to be located and the output directory is supposed to be empty.
The Docker container can be obtained [here](https://cloud.docker.com/u/loosolab/repository/docker/loosolab/parrot).

## Structure

The workflow of PARrOT can be displayed in the following flowchart.

![flowchart](./PARrOT/vignettes/flowchart_parrot.png "Flowcchart")

The flowchart displays the data stream of the whole framework. The blue rectangles represent functions of the R-package, while the orange rectangles represent the python scripts. All displayed graphs are generated in those functions. The transfer format is mentioned in the arrows between the functions.


As entry into the framework, a JSON file in the presented format is necessary. With the functions readjsonmatrix() or spike_in(), a table, that contains all properties for each modality is generated. This is given to the calcscorematrix() function which generates the first plots that display the statistical properties of the given data. The normalization process is also completed at this point.


The generated adjacency matrix is passed to the buildsinglenode() function, which reduces the number of vertices and filters the edges to obtain a computable amount of edges.


The list of these genes is given to the graph analysis and the results are handled and validated by the readcluster() function. It also generates new lists for each found cluster, that contains all edges and modalities, that have been filtered in the buildsinglenode() function. In the last step of the framework, those lists are given to a second graph analysis. This is done in order to gain a higher resolution of the clusters, which has an impact on the significance of the discovered structures due to the fact that non-specific members of clusters are sorted out.

## Functions of the R package

### readjsonsheet

This function forms a data.table object from the given JSON file. The columns contain Ensemble ID, component, groupsize, proportion, variance, groupmean, FC (distance between means) and groupmember for each modality.

Therefore the packages JSONIO and data.table are used.

### calcscorematrix

This function counts the common patients between each modality and performs the normalization. Therefore it takes the size of the patient groups, their distribution in the modalities and the fold change into account

### buildsinglenode

Because of the amount of vertices and edges it is necessary to reduce the graph. Therefore only the top n genes of each vertex is used. As well as the reduction of the number of vertices by reducing from modalities to genes.
Also a global cutoff can be set. In this case only the top n edges are used.

### Python analyses Graph.py

The python script performs four different generative models for the given graphs. The implementation of the algorithm are taken from [Graph-tool](https://graph-tool.skewed.de/).

Those generative models are then compared by their minimum description length and the best fitting algorithm is used for the ongoing analyses.

The first two models are statistic block models. The first is the default version and the second a variation where a vertex can be part of several clusters/blocks.

The other models are nested block models (NBM). The first analyses is again the default version of a NBM. The second in this case is a NBM where a step of equilibration is performed.

For all models a normal and a degree corrected model is compared and only the more precise model is used.

### readcluster

This function reads in the results of the graph analyses. By using the clusterprofiler package it performs a gene set enrichment analyses against KEGG. Also it writes the files for the second graph analyses.

### Python analyses Graph_subcluster.py

The second script performs the whole analyses of the first script for each subcluster which is saved with all information by readcluster.

### Spike in
The functionality has been tested with a randomized data pool. Into this data pool a slightly stronger connected spike was added. The function provides the possibility to determine sizes, proportions, fold changes, overlap and number of patients. Most important also the number of spikes can be determined.

For a more detailed description of the functions visit the [wiki](https://github.molgen.mpg.de/loosolab/PARrOT/wiki/Functions).

## How to cite


## License
This project is licensed under the MIT license.
	![PARrOT](./PARrOT/vignettes/parrot_logo.PNG "PARrOT")

	## PAthway pRedictiOn by mulTimodal genes

	## Abstract

	The aim of this framework is to gather new insights into the regulation, connection and correlation between genes.
	It could be shown that the expression level of some genes have a significant impact in the survival prognoses of cancer patients. To determine if a patient belongs to the high or low expressing group a multi modal distribution has to be fitted on the patient distribution of a gene.
	Multimodal genes of a dataset can be identified with the help of [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR).
	The needed input file (JSON) can also be produced with the help of this R-package.

	![bimodal](./PARrOT/vignettes/bimodality_example.png "bimodal")

	On base of these information gene interactions are suggested.

	## Availability

	All components of the PARrOT R package and the belonging Python scripts are available for download from the Github repository [PARrOT](https://github.molgen.mpg.de/loosolab/PARrOT/).

	### Installation

	To install the R-package the following commands have to be executed in R.

	```r
	library(devtools)
	install_github(repo = "loosolab/PARrOT/PARrOT",host = "github.molgen.mpg.de/api/v3")
	```
	Also you need to obtain the python scripts by cloning the GitHub repository:

	```bash
	git clone https://github.molgen.mpg.de/loosolab/PARrOT.git
	```

	The install of graph-tool which is needed for the python script can be taken from [here](https://git.skewed.de/count0/graph-tool/wikis/installation-instructions).

	Get a Docker container [here](https://cloud.docker.com/u/loosolab/repository/docker/loosolab/parrot).

	Please make sure to check our other projects at [loosolab](http://loosolab.mpi-bn.mpg.de/).

	## Input

	As input all data can be used, which show a multi modal distribution. The package was first tested with mRNA data from TCGA.
	A standardized format was needed to allow the formation of a connection between the multi modality recognition and the pathway analyses. This JavaScript Object Notation (JSON) contains each gene with its number of modalities and the connected means, standard deviation, sizes and the belonging patients in the parameter “groups”. Also, the location of the files that contain clinical data and the expression matrix should be given. An example of the needed data format is given in Table 1.

	![Table1](./PARrOT/vignettes/table1.PNG "Table 1")

	The JSON contains of 3 main parts. The last two sting value pairs are ClinicalData and Expressionmatrix which are supposed to hold the metadata. The first entry holds an entry for each gene with the Ensemble gene name as key and the belonging properties of the modality which was found in this gene.

	\|key\|value\|
	\| :----: \|:--:\|
	\|modus\|number of modalities which where found in this gene\|
	\|means\|the mean for each modality\|
	\|sds\|the standard deviation for each modality\|
	\|sizes\|the size for each modality\|
	\|groups\|the belonging patients for each modality\|

	We recommend the [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR) package to find multimodal genes and write them into a valid JSON file which can be loaded by the readjsonsheet() function of PArOT.

	## Example

	In the first step a table is generated which contains a calculated data set with two slightly stronger connected spikes in a bigger noise:

	```r
	table <- spike_in()
	```

	The gathered table is then given to the calcscorematrix() function. It uses the information in the patient composition and the parameter of the distribution (sizes, proportion, fold change) to compute and normalize the connection probability between each modality.

	```r
	matrix <- calcscorematrix(table = table)
	```

	To perform a significant results it is highly recommended to use datasets with 150 multi modal genes or even more. For this case the amount of vertices and edges is too high. A reduction of the amount of data can be performed with the help of the buildsinglenode() function. In order to reduce the number of vertices the probability of a gene-gene interaction is calculated by building the mean of all belonging modality-modality interaction is computed.

	By default the reduction of edges is performed by only using the top 20 edges of each gene. This number can also be selected by the cutoff parameter. It is a list (c()) which contains first the string "pernode" and as second entry the number of desired edges. By using the "global" mode the threshold is set to the top.

	```r
	snode <- buildSinglenode(matrix = matrix)
	```

	This saves a file named 'adjMatrix_singlenode.csv' in the working directory. It contains three rows where the first and the second contain the genename and the third the weight for the connection.
	Also it performs the normalization of adjacency matrix and statistics for the given data.
	The following command has to be executed from the commandline to perform the graph analyses.

	````bash
	python <git repository>/Graph.py -i <R working dir>/adjMatrix_singlenode.csv -o <your OUTPUT dir>
	````

	This command will generate different models for the given connection list (graph) and the belonging graphics and logs.
	Also it produces the block_member.csv file which contains the most probable gene clustering.
	This can be loaded into R and prepared for a second analyses run where all information of the database are used.

	The following commands are supposed to find structures in the found clusters and to validate the results against the KEGG database. The generated test data does not contain a inner structure and the names of the vertice are just "spike" and "noise" which leads to the circumstance that no genes will be found in KEGG. The following commands are supposed to be executed for real data.

	````r
	readcluster(clustermember = "<your output dir>/block_member.csv", matrix = matrix)
	````

	By this a file for each found cluster is generated. Naming the files as follows: subcluster_< number of cluster >.txt.
	This files contain all edges and all modalities for a found cluster of the first run of Graph analyses.

	````bash
	python <git repository>/Graph_subcluster.py -i <R working dir> -o <your OUTPUT dir>
	````

	This performs a complete analysis for each cluster.

	For a more detailed discription of the example visit the [wiki](https://github.molgen.mpg.de/loosolab/PARrOT/wiki/example).

	### Docker

	The whole analysis can also be performed with the help of the docker container.
	It can easily started by this command.

	````bash
	docker run -i -v <dir containing JSON>:/INPUT/ -v <desired OUTPUT dir>:/OUTPUT/ parrot:latest
	````

	In the input directory only one JSON file is supposed to be located and the output directory is supposed to be empty.
	The Docker container can be obtained [here](https://cloud.docker.com/u/loosolab/repository/docker/loosolab/parrot).

	## Structure

	The workflow of PARrOT can be displayed in the following flowchart.

	![flowchart](./PARrOT/vignettes/flowchart_parrot.png "Flowcchart")

	The flowchart displays the data stream of the whole framework. The blue rectangles represent functions of the R-package, while the orange rectangles represent the python scripts. All displayed graphs are generated in those functions. The transfer format is mentioned in the arrows between the functions.


	As entry into the framework, a JSON file in the presented format is necessary. With the functions readjsonmatrix() or spike_in(), a table, that contains all properties for each modality is generated. This is given to the calcscorematrix() function which generates the first plots that display the statistical properties of the given data. The normalization process is also completed at this point.


	The generated adjacency matrix is passed to the buildsinglenode() function, which reduces the number of vertices and filters the edges to obtain a computable amount of edges.


	The list of these genes is given to the graph analysis and the results are handled and validated by the readcluster() function. It also generates new lists for each found cluster, that contains all edges and modalities, that have been filtered in the buildsinglenode() function. In the last step of the framework, those lists are given to a second graph analysis. This is done in order to gain a higher resolution of the clusters, which has an impact on the significance of the discovered structures due to the fact that non-specific members of clusters are sorted out.

	## Functions of the R package

	### readjsonsheet

	This function forms a data.table object from the given JSON file. The columns contain Ensemble ID, component, groupsize, proportion, variance, groupmean, FC (distance between means) and groupmember for each modality.

	Therefore the packages JSONIO and data.table are used.

	### calcscorematrix

	This function counts the common patients between each modality and performs the normalization. Therefore it takes the size of the patient groups, their distribution in the modalities and the fold change into account

	### buildsinglenode

	Because of the amount of vertices and edges it is necessary to reduce the graph. Therefore only the top n genes of each vertex is used. As well as the reduction of the number of vertices by reducing from modalities to genes.
	Also a global cutoff can be set. In this case only the top n edges are used.

	### Python analyses Graph.py

	The python script performs four different generative models for the given graphs. The implementation of the algorithm are taken from [Graph-tool](https://graph-tool.skewed.de/).

	Those generative models are then compared by their minimum description length and the best fitting algorithm is used for the ongoing analyses.

	The first two models are statistic block models. The first is the default version and the second a variation where a vertex can be part of several clusters/blocks.

	The other models are nested block models (NBM). The first analyses is again the default version of a NBM. The second in this case is a NBM where a step of equilibration is performed.

	For all models a normal and a degree corrected model is compared and only the more precise model is used.

	### readcluster

	This function reads in the results of the graph analyses. By using the clusterprofiler package it performs a gene set enrichment analyses against KEGG. Also it writes the files for the second graph analyses.

	### Python analyses Graph_subcluster.py

	The second script performs the whole analyses of the first script for each subcluster which is saved with all information by readcluster.

	### Spike in
	The functionality has been tested with a randomized data pool. Into this data pool a slightly stronger connected spike was added. The function provides the possibility to determine sizes, proportions, fold changes, overlap and number of patients. Most important also the number of spikes can be determined.

	For a more detailed description of the functions visit the [wiki](https://github.molgen.mpg.de/loosolab/PARrOT/wiki/Functions).

	## How to cite


	## License
	This project is licensed under the MIT license.