Skip to content

Example

Kai Schmid edited this page Mar 25, 2019 · 14 revisions

For the example a table is generated by the spike_in() function of the R-package.

By default it generates a table with 500 genes. These 500 genes are bimodal genes which list 500 patients with a 70:30 proportion into low and high expressing patients. These groups are called modalities. The modalities are connected by their patient composition. This composition is completely randomized by the function sample() out of the 500 patients. The distance between the means of the two modalities is set to 2, which is the fold change for this gene. These genes are called noise.

In this noise a spike is weaved. By default it also contains 500 patients with a 70:30 proportion and a fold change (FC) of 2. But only 25 genes are generated for the spike. The first part of the patient composition in the spike is performed in the same way as in the noise. To this randomized connection an overlap between the genes of the noise is added. By default this value is set to 65%, which means that 65% of the patients in one modality of the spike can be found in an other modality of the spike. This can be visualized by the following graphic.

spike_graphic

library(PARrOT)
table <- spike_in()

The generated table looks like:

table_out

This table is then given to the calcscorematrix() by:

matrix <- calcscorematrix(table = table)

This generates a histogram for all weights before and after the normalization and a PDF which contains a heatmap before and after normalization. The generated adjacency matrix is structured as follows:

matrix_out

As mentioned in the readme the amount of 500 genes or rather 1000 modalities and 1000^2 connections (edges) would be a too big workload to compute in an acceptable amount of time. For this reason the modalities are here reduced to genes and only the top 20 (by default) edges per gene are taken. This reduction is made by:

reduced_graph <- buildSinglenode(matrix = matrix)

This command generates a histogram from all edges after the reduction and the adjMatrix_singlenode.csv file which contains 3 rows where the first two contain a vertex and the third weight of the edge between them, which looks like this:

reduced_graph

This file can be read and computed by the Graph.py script which can be started from the command line.

python <git repository>/Graph.py -i <R working dir>/adjMatrix_singlenode.csv -o <desired output directory>

The python script performs four different generative models for the given graphs. The implementation of the anlgorithm is performed by Graph-tool.

Those generative models are then compared by their minimum description length and the best fitting algorithm is used for the ongoing analyses.

The first two models are statistic block models. The first is the default version and the second a variation where a vertex can be part of several clusters/blocks.

The other models are nested block models (NBM). The first analyses is again the default version of a NBM. The second in this case is a NBM where a step of equilibration is performed.

For all models a normal and a degree corrected model is compared and only the more precise model is used.

In this test scenario the best fitting algorithm is represented by the nested model, which can be visualized by this:

nbm NBM where each dot on the circle represents a vertex and the connections between the edges are represent by the lines between them.

The structure of the graph with its bigger noise and the insert spike can even be better visualized by the stochastic block model.

sbm

SBM where the blue noise can be clearly separated from the yellow vertices which represent the spike.