Skip to content
Permalink
e73c7dae8b
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

PARrOT

PAthway pRedictiOn by mulTimodal genes

Abstract

This package uses the information of big datasets to perform an graph analysis. Therefore an adjacence matrix is computed from information of multimodal genes. Multimodal genes in this case are genes which show two or more distinctable normal distributions in their patient distribution.

bimodal

Those bimodal genes can be found by the help of multimodalR

Availability

All components of the PARrOT R package and the belonging Python scripts are available for download from the Github repository PARrOT.

Get a Docker container here.

Please make sure to check our other projects at loosolab.

Input

As input all data can be used, which show a multimodel distribution. The package was first tested with mRNA data from TCGA. A standardized format was needed to allow the formation of a connection between the multimodality recognition and the pathway analyses. This JavaScript Object Notation (JSON) contains each gene with its number of modalities and the connected means, standard deviation, sizes and the belonging patients in the parameter “groups”. Also, the location of the files that contain clinical data and the expression matrix should be given. An example of the needed data format is given in Table 1.

Table1

Example

table <- readjsonsheet(JSON = JSON)
matrix <- calcscorematrix(table = table)
snode <- buildsinglenode(matrix = matrix)

This saves a file named 'adjMatrix_singlenode.csv' in the working directory. It contains three rows where the first and the second contain the genename and the third the weight for the connection. Also it performs the normalization of adjacency matrix and statistics for the given data.

python ./Graph.py -i adjMatrix_singlenode.csv -o <your OUTPUT dir>

This command will generate different models for the given connection list (graph) and the belonging graphics and logs. Also it produces the block_member.csv file which contains the most probable gene clustering. This can be loaded into R and prepared for a second anaylsis run where all information of the database are used.

readcluster(clustermember = block_member.csv, matrix = matrix)

By this a file for each found cluster is generated. Naming the files as follows: subcluster_.txt. This files contain all edges and all modalties for a found cluster of the first run of Graph analyses.

python ./Graph_subcluster.py -i <OUTPUT dir> -o <your OUTPUT dir>

This performs a whole analysis for each cluster.

Docker

The whole analysis can also be performed with the help of the dockker container. It can easily started by this command.

docker run -i -v <dir containing JSON>:/INPUT/ -v <desired OUTPUT dir>:/OUTPUT/ parrot:latest

In the input directory only one json file is suposed to be located and the output directory is supposed to be empty. The Docker container can be obtained here.

Structure

The workflow of PARrOT can be displayed in the following flowchart.

flowchart

The flowchart displays the data stream of the whole framework. The blue rectangles represent functions of the R-package, while the orange rectangles represent the python scripts. All displayed graphs are generated in those functions. The transfer format is mentioned in the arrows between the functions.

As entry into the framework, a JSON file in the presented format is necessary. With the functions readjsonmatrix() or spike_in(), a table, that contains all properties for each modality is generated. This is given to the calcscorematrix() function which generates the first plots that display the statistical properties of the given data. The normalization process is also completed at this point.

The generated adjacency matrix is passed to the buildsinglenode() function, which reduces the number of vertices and filters the edges to obtain a computable amount of edges.

The list of these genes is given to the graph analysis and the results are handled and validated by the readcluster() function. It also generates new lists for each found cluster, that contains all edges and modalities, that have been filtered in the buildsinglenode() function. In the last step of the framework, those lists are given to a second graph analysis. This is done in order to gain a higher resolution of the clusters, which has an impact on the significance of the discovered structures due to the fact that non-specific members of clusters are sorted out.

Functions of the R package

readjsonsheet

This function forms a data.table object from the given JSON file. The columns contain Ensemble ID, component, groupsize, proportion, variance, groupmean, FC (distance between means) and groupmember for each modality.

Therefore the packages JSONIO and data.table are used.

calcscorematrix

This function counts the common patients between each modaility and performs the normalization. Therefore it takes the size of the patientsgroups, their distribution in the modalities and the foldchange into account

buildsinglenode

Because of the amount of vertices and edges it is necessary to reduce the graph. Therefore only the top n genes of each vertice is used. As well as the reduction of the number of vertices by reducing from modalities to genes. Also a global cutoff can be set.In this case only the top n edges are used.

Python analyses Graph.py

The python script performs four different generative models for the given graphs. The implementation of the anlgorithm is performed by Graph-tool

Those generative models are then commpared by their minimum description length and the best fitting algorithm is used for the ongoing analyses.

The first two models are statistic block models. The first is the default version and the second a variation where a vertice can be part of several clusters/blocks.

The other models are nested block models (nbm). The first analyses is again the default version of a nbm. The second in this case is a nbm where a step of equilibration is performed.

For all models a normal and a degree corrected model is compared and only the more precise model is used.

readcluster

This function reads in the results of the graph analyses. By using the clusterprofiler package it performs a gene set enrichment analyses angainst KEGG. Also it writes the files for the second graph analyses.

Python analyses Graph_subcluster.py

The second script performs the whole analyses of the first script for each subcluster which is saved with all information by readcluster.

Spike in

The functionality has been tested with a randomised data pool. Into this data pool a slightly stronger connected spike was added. The function provides the possibilty to determine sizes, propotions, fold changes, overlap and number of patients. Most important also the number of spikes can be determined.

Installation

To install the R-package the following commands have to be executed in R.

library(devtools)
install_github(repo = "loosolab/PARrOT/PARrOT",host = "github.molgen.mpg.de/api/v3")

The install of graph-tool which is needed for the python script can be taken from here.

The docker container can be easily obtained from here.

How to cite

License

This project is licensed under the MIT license.