Here is the source code for the biclustering method in the paper "Xinyi Yang and Martin Vingron, Grouping of human promoters by occupancy patterns yields recurring sequence elements, binding combinatorics, and spatial interactions."
Our method is based on a robust version of the s4vd biclustering algorithm [1], which is available as an R package. Due to its use of a randomized selection step, s4vd produces somewhat different results in different runs of the program. We exploit this with the goal of obtaining robust biclustering results by extracting the common cluster assignments from many runs. This script aim to merge biclustering results from many runs according to a probability matrix, so that we can have a more reliable and stable results.
Details can be found in the method part of the original paper. Additional explanation in a more specific mathematical form can be found in the Document/method.pdf
Besides, in the folder original_data/
, we provide the orignal input matrix in our paper, including active and inactive promoter matrix for GM12878 and K562.
Users can run this code in R. Required library:
- s4vd ( https://cran.r-project.org/web/packages/s4vd/index.html )
- lpSolve ( https://cran.r-project.org/web/packages/lpSolve/index.html )
First of all, users need to run s4vd multiple times and collect the results. We created a wrap function for the users who want to run in the R interface and a cmd line R script for users who want to submit to their clusters.
## generate_bicluster_res(Input_matrix, Output_dir,index=1,...)
## Input_mat: A matrix. The normalized input matrix for biclustering. In our papers, rows are TSS and columns are ChIP-seq experiments.
## Output_dir: A character. the directory where you want to collect the output
## index: An integer. User has to give each round of running a different index number, so that the new results won't overwrite the previous ones.
## ...: parameters in s4vd method
> setwd('biclustering/')
> source('source_code/bicluster.r')
> load("orignal_data/Input_matrix.RData")
> generate_bicluster_res(GM12878_active,"OUTPUT",index=1,nbiclust = 50,iter=10000,cols.nc=FALSE,rows.nc=FALSE,row.overlap=FALSE,col.overlap=FALSE,row.min=20,col.min=2)
In the command line you can run R script cmd_code/bicluster_cmd.r
.
# Rscript cmd_code/bicluster_cmd.r input_matrix.txt output_dir index
# input_matrix.txt: a file contain normalized input matrix. Matrix should not contain any header lines. In our papers, rows are TSS and columns are ChIP-seq experiments.
# output_dir: the directory where you want to collect the results
# index: user has to give each round of running a different index number, so that the new results won't overwrite the previous ones. It should be an integer.
# note that in the cmd script user can't change the s4vd parameter, but it is easy to hack the code.
$ Rscript cmd_code/bicluster_cmd.r original_data/GM12878_active.txt OUTPUT 1
This step merges several rounds of s4vd results you generated in Step 1. Note that if you generated your s4vd results without our wrap, you results should be saved as a RData named as "res_INDEX
", where INDEX
is a number. in the each of the RData file, the result of s4vd function should be saved and named as "res".
Here we give a demo folder, where we saved 50 rounds of s4vd results on GM12878_active matrix. Note that to you might want to have a more reliable and reasonable result based on more repeats.
#merge_bicluster(res_dir, cutoff=0.5, rowMin=20, colMin=2)
#res_dir: A character. The directory where you collect your s4vd results
#cutoff: Cutoff for probability matrix. A number in [0,1]
#rowMin: minimal number of rows in each of the cluster
#colMin: minimal number of columns in each of the cluster
#result of this function is a list of length 2. $RowCluster and $Colcluster are two vectors giving assigned cluster index for rows and columns, respectively.
cluster=merge_bicluster_res("OUTPUT_demo/")
[1] Martin Sill, Sebastian Kaiser, Axel Benner and Annette Kopp-Schneider "Robust biclustering by sparse singular value decomposition incorporating stability selection", Bioinformatics, 2011