README.md

# Introduction

Here is the source code for the biclustering method in the paper "Xinyi Yang and Martin Vingron, Grouping of human promoters by occupancy patterns yields recurring sequence elements, binding combinatorics, and spatial interactions."

Our method is based on a robust version of the s4vd biclustering algorithm [1], which is available as an R package. Due to its use of a randomized selection step, s4vd produces somewhat different results in different runs of the program. We exploit this with the goal of obtaining robust biclustering results by extracting the common cluster assignments from many runs. This script aim to merge biclustering results from many runs according to a probability matrix, so that we can have a more reliable and stable results.

Details can be found in the method part of the original paper. Additional explanation in a more specific mathematical form can be found in the `Document/method.pdf`

Besides, in the folder `original_data/`, we provide the orignal input matrix in our paper, including active and inactive promoter matrix for GM12878 and K562.

# Installation

Users can run this code in R. Required library:

 - s4vd ( https://cran.r-project.org/web/packages/s4vd/index.html )
 - lpSolve ( https://cran.r-project.org/web/packages/lpSolve/index.html )

# Step 1: Generate several rounds of biclustering results

First of all, users need to run s4vd multiple times and collect the results. We created a wrap function for the users who want to run in the R interface and a cmd line R script for users who want to submit to their clusters.

### R interface wrap

```R
## generate_bicluster_res(Input_matrix, Output_dir,index=1,...)
## Input_mat: A matrix. The normalized input matrix for biclustering. In our papers, rows are TSS and columns are ChIP-seq experiments.
## Output_dir: A character. the directory where you want to collect the output
## index: An integer. User has to give each round of running a different index number, so that the new results won't overwrite the previous ones.
## ...: parameters in s4vd method
> setwd('biclustering/')
> source('source_code/bicluster.r')
> load("orignal_data/Input_matrix.RData")
> generate_bicluster_res(GM12878_active,"OUTPUT",index=1,nbiclust = 50,iter=10000,cols.nc=FALSE,rows.nc=FALSE,row.overlap=FALSE,col.overlap=FALSE,row.min=20,col.min=2)
```
### Command line wrap

In the command line you can run R script `cmd_code/bicluster_cmd.r`.
```sh
# Rscript cmd_code/bicluster_cmd.r input_matrix.txt output_dir index
# input_matrix.txt: a file contain normalized input matrix. Matrix should not contain any header lines. In our papers, rows are TSS and columns are ChIP-seq experiments.
# output_dir: the directory where you want to collect the results
# index: user has to give each round of running a different index number, so that the new results won't overwrite the previous ones. It should be an integer.
# note that in the cmd script user can't change the s4vd parameter, but it is easy to hack the code.
$ Rscript cmd_code/bicluster_cmd.r original_data/GM12878_active.txt OUTPUT 1
```
# Step 2: Merge results

This step merges several rounds of s4vd results you generated in Step 1. Note that if you generated your s4vd results without our wrap, you results should be saved as a RData named as "res_`INDEX`", where `INDEX` is a number. in the each of the RData file, the result of s4vd function should be saved and named as "res".

Here we give a demo folder, where we saved 50 rounds of s4vd results on GM12878_active matrix. Note that to you might want to have a more reliable and reasonable result based on more repeats.

```R
#merge_bicluster(res_dir, cutoff=0.5, rowMin=20, colMin=2)
#res_dir: A character. The directory where you collect your s4vd results
#cutoff: Cutoff for probability matrix. A number in [0,1]
#rowMin: minimal number of rows in each of the cluster
#colMin: minimal number of columns in each of the cluster
#result of this function is a list of length 2. $RowCluster and $Colcluster are two vectors giving assigned cluster index for rows and columns, respectively.
cluster=merge_bicluster_res("OUTPUT_demo/")
```

# Reference

[1] Martin Sill, Sebastian Kaiser, Axel Benner and Annette Kopp-Schneider "Robust biclustering by sparse singular value decomposition incorporating stability selection", Bioinformatics, 2011
	# Introduction

	Here is the source code for the biclustering method in the paper "Xinyi Yang and Martin Vingron, Grouping of human promoters by occupancy patterns yields recurring sequence elements, binding combinatorics, and spatial interactions."

	Our method is based on a robust version of the s4vd biclustering algorithm [1], which is available as an R package. Due to its use of a randomized selection step, s4vd produces somewhat different results in different runs of the program. We exploit this with the goal of obtaining robust biclustering results by extracting the common cluster assignments from many runs. This script aim to merge biclustering results from many runs according to a probability matrix, so that we can have a more reliable and stable results.

	Details can be found in the method part of the original paper. Additional explanation in a more specific mathematical form can be found in the `Document/method.pdf`

	Besides, in the folder `original_data/`, we provide the orignal input matrix in our paper, including active and inactive promoter matrix for GM12878 and K562.

	# Installation

	Users can run this code in R. Required library:

	- s4vd ( https://cran.r-project.org/web/packages/s4vd/index.html )
	- lpSolve ( https://cran.r-project.org/web/packages/lpSolve/index.html )

	# Step 1: Generate several rounds of biclustering results

	First of all, users need to run s4vd multiple times and collect the results. We created a wrap function for the users who want to run in the R interface and a cmd line R script for users who want to submit to their clusters.

	### R interface wrap

	```R
	## generate_bicluster_res(Input_matrix, Output_dir,index=1,...)
	## Input_mat: A matrix. The normalized input matrix for biclustering. In our papers, rows are TSS and columns are ChIP-seq experiments.
	## Output_dir: A character. the directory where you want to collect the output
	## index: An integer. User has to give each round of running a different index number, so that the new results won't overwrite the previous ones.
	## ...: parameters in s4vd method
	> setwd('biclustering/')
	> source('source_code/bicluster.r')
	> load("orignal_data/Input_matrix.RData")
	> generate_bicluster_res(GM12878_active,"OUTPUT",index=1,nbiclust = 50,iter=10000,cols.nc=FALSE,rows.nc=FALSE,row.overlap=FALSE,col.overlap=FALSE,row.min=20,col.min=2)
	```
	### Command line wrap

	In the command line you can run R script `cmd_code/bicluster_cmd.r`.
	```sh
	# Rscript cmd_code/bicluster_cmd.r input_matrix.txt output_dir index
	# input_matrix.txt: a file contain normalized input matrix. Matrix should not contain any header lines. In our papers, rows are TSS and columns are ChIP-seq experiments.
	# output_dir: the directory where you want to collect the results
	# index: user has to give each round of running a different index number, so that the new results won't overwrite the previous ones. It should be an integer.
	# note that in the cmd script user can't change the s4vd parameter, but it is easy to hack the code.
	$ Rscript cmd_code/bicluster_cmd.r original_data/GM12878_active.txt OUTPUT 1
	```
	# Step 2: Merge results

	This step merges several rounds of s4vd results you generated in Step 1. Note that if you generated your s4vd results without our wrap, you results should be saved as a RData named as "res_`INDEX`", where `INDEX` is a number. in the each of the RData file, the result of s4vd function should be saved and named as "res".

	Here we give a demo folder, where we saved 50 rounds of s4vd results on GM12878_active matrix. Note that to you might want to have a more reliable and reasonable result based on more repeats.

	```R
	#merge_bicluster(res_dir, cutoff=0.5, rowMin=20, colMin=2)
	#res_dir: A character. The directory where you collect your s4vd results
	#cutoff: Cutoff for probability matrix. A number in [0,1]
	#rowMin: minimal number of rows in each of the cluster
	#colMin: minimal number of columns in each of the cluster
	#result of this function is a list of length 2. $RowCluster and $Colcluster are two vectors giving assigned cluster index for rows and columns, respectively.
	cluster=merge_bicluster_res("OUTPUT_demo/")
	```

	# Reference

	[1] Martin Sill, Sebastian Kaiser, Axel Benner and Annette Kopp-Schneider "Robust biclustering by sparse singular value decomposition incorporating stability selection", Bioinformatics, 2011