Many biological systems involve multiple interacting factors affecting an outcome synergistically and/or redunantly, e.g. genetic contribution to a phenotype or the tight interplay of genes within a gene-regulatory network (GRN)2. Information theory provides a set of measures that allow us to characterize statistical dependencies between pairs of random variables with considerable advantages over simpler measures such as (Pearson) correlation, as it is capable of capturing non-linear dependencies, and reflecting the dynamics between pairs or groups of genes3. In these settings, we are concerned with the statistics of how two (or more) random variables X1, X2, called source variables, jointly or separately specify/predict another random variable Z, called a target random variable. The source variables can provide information about the target uniquely, redundantly, or synergistically (see PI-diagram below). The mutual information (I) between the source variables and the target variable is equal to the sum of four partial information terms:
Here, we implemented the nonnegative decomposition of Multivariate Information for three random vectors as proposed by Williams and Beer1 as an R package.
Given three random vectors x1
, x2
and z
, the PID can be calculated from the pid
function.
library(rPID)
z <- rnorm(100)
x1 <- rnorm(100)
x2 <- rnorm(100)
zd <- discretize(z)
x1d <- discretize(x1)
x2d <- discretize(x2)
decomposition <- pid(zd, x1d, x2d)
Let's assume that the proteins AmtR (green) and SrpR (blue) can form a heterodimer to act on the YFP promoter (red) and drive YFP expression. They are connected in a AND gated genetic circuit, which means that either expression of AmtR or SrpR alone does not express YFP.
The corresponding expression truth table looks like this:
AmtR expression | SrpR expression | YFP expression |
---|---|---|
low | low | low |
high | low | low |
low | high | low |
high | high | high |
Now assume we collected expression data for AmtR, SrpR and YFP from 400 cells, with all expression combinations present in 100 cells, respectively.
data(circuit_data)
head(circuit_data)
# YFP AmtR SrpR
# 1 0.006546362 0.04528976 0.003372873
# 2 0.016749429 0.07870458 0.002703958
# 3 0.008629785 0.04528976 0.001836538
The correlation values alone do lead to the association of YFP with AmtR or SrpR:
cor(circuit_data$YFP, circuit_data$AmtR, method="pearson")
# [1] 0.3688194
cor(circuit_data$YFP, circuit_data$SrpR, method="pearson")
# [1] 0.3539511
However, using partial information decomposition, we can explore three way interactions and quantify unique, synergistic and redundant information between the target varibal (z = YFP
) and the set of two source variables (x1 = AmtR, x2 = SrpR
):
# Use discretized expression data
data(circuit_data_discrete)
pid(z = circuit_data_discrete$YFP, x1 = circuit_data_discrete$AmtR, x2 = circuit_data_discrete$SrpR)
# $unique_x1
# [1] 0.0370862
#
# $unique_x2
# [1] 0.0368186
#
# $synergy
# [1] 0.597708
#
# $redundancy
# [1] 0.3251131
The very high value of synergy hints that only both source variables together are able to provide full information about the target variable. Please feel free to explore other combinations of source and target variables:
# No unique or synergistic information, total redundancy:
pid(z = circuit_data_discrete$YFP, x1 = circuit_data_discrete$AmtR, x2 = circuit_data_discrete$AmtR)
# No synergistic information, most of the information is unique to x1:
pid(z = circuit_data_discrete$YFP, x1 = circuit_data_discrete$YFP, x2 = circuit_data_discrete$SrpR)
# High synergistic information and unique information from x2:
pid(z = circuit_data_discrete$SrpR, x1 = circuit_data_discrete$AmtR, x2 = circuit_data_discrete$YFP)
To install the package, use:
library(devtools)
install.packages("entropy")
install_github("loosolab/rPID", host="github.molgen.mpg.de")
In order to be able to use the Bayesian Blocks discretizer, additionally install the astroML
python package via pip install astroML
.
The project is licensed under the MIT license.
1. Williams PL and Beer RD. Nonnegative Decomposition of Multivariate Information. arXiv (2010), https://arxiv.org/abs/1004.2515v1 ↩
2. Griffith V and Ho T. Quantifying Redundant Information in Predicting a Target Random Variable. Entropy (2015), doi:10.3390/e17074644 ↩
3. Chan et. al. Network inference and hypotheses-generation from single-cell transcriptomic data using multivariate information measures. bioRxiv (2016), doi:10.1101/082099 ↩