README.md

# CoMIK : Conformal Multi-Instance Kernels

_CoMIK_ is a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences in a discriminative setting (classification).
_CoMIK_ identifies not just the features useful towards classification but also their locations in the variable-length sequences aided by recently introduced visualization techniques.

Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer

If you use _CoMIK_, please cite us as follows:
```
Nikumbh S, Ebert P, Pfeifer N: All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Schloss Dagstuhl; 2017. [Leibniz International Proceedings in Informatics, vol. 88]
```

More information (e.g., Bibtex entry) is available at: http://drops.dagstuhl.de/opus/volltexte/2017/7645/

bioRxiv DOI: https://doi.org/10.1101/139618

## Requirements:
- MATLAB 9.0.0.341360 (R2016a) [We developed _CoMIK_ using this version of MATLAB (R2016a). Compatbility checked with version 8.6 (R2015b)]
- Python 2.7 or higher
- SHOGUN Release version 3.2.0 [See issue #7 (issues/7) in this regard. _CoMIK_ also works with the latest SHOGUN version 6.0.0]

For the visualizations, the following R (R version 3.4.0) packages are needed
- Lattice (lattice_0.20-35)


This repository contains Matlab code for the project *CoMIK*. The various ".m" files define corresponding Matlab functions.
We use MKL implementation from Shogun's modular interface for Python. The Python script mkl.py handles solving of the MKL problem.

If you have MATLAB installed, you can run _CoMIK_ from inside MATLAB. But, in case you do not have MATLAB installed, we provide an executable version for which you would additionally need the MATLAB Runtime. MATLAB Runtime can be downloaded from https://mathworks.com/products/compiler/mcr.html . We recommend getting version 9.0 (R2016a) installed. If that does not work, 8.6 (R2015) would also work.

## Installation:
```
git clone https://github.molgen.mpg.de/snikumbh/comik.git
cd comik
sh install.sh
```
- Install SHOGUN
- Only if you are planning to use the executable, install MATLAB Runtime. Follow the instructions for installation of the MATLAB Runtime; install at any location of your choice on the disk.
Once the dependencies are handled, e.g., SHOGUN, MATLAB runtime etc., and the paths are set, you can test _CoMIK_ as follows
```
a) sh test_install.sh matlab

 OR

b) sh test_install.sh executable <your_MCR_path_here> <version>
```
With (a), _CoMIK_ is tested from inside of MATLAB, and with (b), the _CoMIK_ executable is tested. Example command to test the executable for version 9.0 (R2016a)
```
sh test_install.sh executable /usr/lib/matlab-9.0 v90
```

## Usage:
If you have MATLAB, an example function call from inside Matlab is as follows:
For simulated dataset 1 provided in the folder `sample_data/simulated_dataset1`
```Matlab
comik_wrapper('config-comik.txt');
```

If not, you can use the executable as follows:

```
# ./run_CoMIK_v90.sh <MATLAB_Runtime_location> <config_file>
# for version 9.0 (R2016a)
./run_CoMIK_v90.sh /usr/lib/matlab-9.0 config-comik.txt
 OR
# for version 8.6 (R2015b)
./run_CoMIK_v86.sh /usr/lib/matlab-8.6 config-comik.txt
```
where `/usr/lib/matlab-9.0` could be replaced with the location of the MATLAB Runtime on your machine. Additionally, when required, you can add your own paths to the `LD_LIBRARY_PATH` environment variable in the file `run_CoMIK_v86.sh` or `run_CoMIK_v90.sh` (for example, the path for shogun can be added here).


_CoMIK_ requires two FASTA files as input -- the first FASTA file containing sequences in the positive class; the second FASTA file containing negative class sequences. Other params are explained below.

Values for the following parameters are required to be set:

- positive FASTA filename [type: str]
- negative FASTA filename [type: str]
- number of positive sequences [type: int]
- number of negative sequences [type: int]
- Indices of the test sequences [type: int, given as a Matlab vector]
- output-folder-name [type: str]

The rest have default values, which can be good starting points.

Param name | type  | default value | Additional comments
-----------|-------|---------------|---------------------
 oligomer-lengths-as-vector | int | [2] | Required for the _ODH_ representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run _comik_ first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see **Note 2** below
 maximum-distance | int | 50 | Required for the _ODH_ representation. Typically, a maximum distance of 100 basepairs suffices even if the segment-size is larger
 segment-size | int | 100 | See **Note 1** below
 number-of-clusters | vector of ints | [2 5] | Recommended maximum number of clusters: 7
 sigma-values-for-Gaussian-transformation | vector of floats | 10.^[-1:1:2] | typical values: 10.^[-1:1:2] which is Matlab notation to obtain the vector [0.1, 1.0, 10.0, 100.0]]
 cost-values-for-SVM | vector of floats | 10.^[-3:1:3] | typical values: 10.^[-3:1:3]
 mklNorm | int/float | 2.0 | typically 1.0 or 2.0
 number-of-inner-folds | int | 10 |
 number-of-outer-folds | int | 5 |
 whetherToPlotHeatmap | str | 'No' | Possible values: 'Yes' or 'No'; Set 'Yes' only if all sequences are of the same length.
 whetherToVisualizeWVector | str | 'Yes' | Possible values: 'Yes' or 'No'; Set 'Yes' when you wish to have the distance-centric k-mer visualization.
 debugLevel | int | 2 | Possible values: [0, 1, 2]. Value 0 makes _comik_ completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.
 debugMsgLocation | int/str | 1 | Value 1 denotes the command prompt, else specify a filename, say 'runLog.txt'. This file is written for each outer fold separately.
 computationVersion | str | 'Looping'| Possible values: 'Looping' or 'AccumArray'. 'Looping' is faster and preferred/recommended for large datasets when 'AccumArray' can be memory intensive.

The `comik_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).


**Note 1**: In case you are interested in performing a quick, exploratory run using _CoMIK_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:

oligomer-length: [2]
maximum-distance: min(segment-size, 50)
number-of-clusters: [2 5]
segment-size: 100 or 200

**Note 2**: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive.

**Note 3**: Presently, _CoMIK_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.

During the run,
* _CoMIK_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got omitted, their FASTA-Ids in a separate file named `omittedFastaIds.txt` per outer fold separately.

* the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages.
    - Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run.
    - The various train and test kernels (as .csv files)
    - The weight vectors corresponding to the kernels
    - The support vector indices
    - The alpha values
    - SVM bias value
    - Kernel weights upon performing MKL
    - The visualizations of the sequence logos
    - Heatmaps if the flag has been set


For comments and questions, feel free to [report an issue](https://github.molgen.mpg.de/snikumbh/comik/issues/new)
	# CoMIK : Conformal Multi-Instance Kernels

	_CoMIK_ is a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences in a discriminative setting (classification).
	_CoMIK_ identifies not just the features useful towards classification but also their locations in the variable-length sequences aided by recently introduced visualization techniques.

	Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer

	If you use _CoMIK_, please cite us as follows:
	```
	Nikumbh S, Ebert P, Pfeifer N: All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Schloss Dagstuhl; 2017. [Leibniz International Proceedings in Informatics, vol. 88]
	```

	More information (e.g., Bibtex entry) is available at: http://drops.dagstuhl.de/opus/volltexte/2017/7645/

	bioRxiv DOI: https://doi.org/10.1101/139618

	## Requirements:
	- MATLAB 9.0.0.341360 (R2016a) [We developed _CoMIK_ using this version of MATLAB (R2016a). Compatbility checked with version 8.6 (R2015b)]
	- Python 2.7 or higher
	- SHOGUN Release version 3.2.0 [See issue #7 (issues/7) in this regard. _CoMIK_ also works with the latest SHOGUN version 6.0.0]

	For the visualizations, the following R (R version 3.4.0) packages are needed
	- Lattice (lattice_0.20-35)


	This repository contains Matlab code for the project CoMIK. The various ".m" files define corresponding Matlab functions.
	We use MKL implementation from Shogun's modular interface for Python. The Python script mkl.py handles solving of the MKL problem.

	If you have MATLAB installed, you can run _CoMIK_ from inside MATLAB. But, in case you do not have MATLAB installed, we provide an executable version for which you would additionally need the MATLAB Runtime. MATLAB Runtime can be downloaded from https://mathworks.com/products/compiler/mcr.html . We recommend getting version 9.0 (R2016a) installed. If that does not work, 8.6 (R2015) would also work.

	## Installation:
	```
	git clone https://github.molgen.mpg.de/snikumbh/comik.git
	cd comik
	sh install.sh
	```
	- Install SHOGUN
	- Only if you are planning to use the executable, install MATLAB Runtime. Follow the instructions for installation of the MATLAB Runtime; install at any location of your choice on the disk.
	Once the dependencies are handled, e.g., SHOGUN, MATLAB runtime etc., and the paths are set, you can test _CoMIK_ as follows
	```
	a) sh test_install.sh matlab

	OR

	b) sh test_install.sh executable <your_MCR_path_here> <version>
	```
	With (a), _CoMIK_ is tested from inside of MATLAB, and with (b), the _CoMIK_ executable is tested. Example command to test the executable for version 9.0 (R2016a)
	```
	sh test_install.sh executable /usr/lib/matlab-9.0 v90
	```

	## Usage:
	If you have MATLAB, an example function call from inside Matlab is as follows:
	For simulated dataset 1 provided in the folder `sample_data/simulated_dataset1`
	```Matlab
	comik_wrapper('config-comik.txt');
	```

	If not, you can use the executable as follows:

	```
	# ./run_CoMIK_v90.sh <MATLAB_Runtime_location> <config_file>
	# for version 9.0 (R2016a)
	./run_CoMIK_v90.sh /usr/lib/matlab-9.0 config-comik.txt
	OR
	# for version 8.6 (R2015b)
	./run_CoMIK_v86.sh /usr/lib/matlab-8.6 config-comik.txt
	```
	where `/usr/lib/matlab-9.0` could be replaced with the location of the MATLAB Runtime on your machine. Additionally, when required, you can add your own paths to the `LD_LIBRARY_PATH` environment variable in the file `run_CoMIK_v86.sh` or `run_CoMIK_v90.sh` (for example, the path for shogun can be added here).


	_CoMIK_ requires two FASTA files as input -- the first FASTA file containing sequences in the positive class; the second FASTA file containing negative class sequences. Other params are explained below.

	Values for the following parameters are required to be set:

	- positive FASTA filename [type: str]
	- negative FASTA filename [type: str]
	- number of positive sequences [type: int]
	- number of negative sequences [type: int]
	- Indices of the test sequences [type: int, given as a Matlab vector]
	- output-folder-name [type: str]

	The rest have default values, which can be good starting points.

	Param name \| type \| default value \| Additional comments
	-----------\|-------\|---------------\|---------------------
	oligomer-lengths-as-vector \| int \| [2] \| Required for the _ODH_ representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run _comik_ first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see Note 2 below
	maximum-distance \| int \| 50 \| Required for the _ODH_ representation. Typically, a maximum distance of 100 basepairs suffices even if the segment-size is larger
	segment-size \| int \| 100 \| See Note 1 below
	number-of-clusters \| vector of ints \| [2 5] \| Recommended maximum number of clusters: 7
	sigma-values-for-Gaussian-transformation \| vector of floats \| 10.^[-1:1:2] \| typical values: 10.^[-1:1:2] which is Matlab notation to obtain the vector [0.1, 1.0, 10.0, 100.0]]
	cost-values-for-SVM \| vector of floats \| 10.^[-3:1:3] \| typical values: 10.^[-3:1:3]
	mklNorm \| int/float \| 2.0 \| typically 1.0 or 2.0
	number-of-inner-folds \| int \| 10 \|
	number-of-outer-folds \| int \| 5 \|
	whetherToPlotHeatmap \| str \| 'No' \| Possible values: 'Yes' or 'No'; Set 'Yes' only if all sequences are of the same length.
	whetherToVisualizeWVector \| str \| 'Yes' \| Possible values: 'Yes' or 'No'; Set 'Yes' when you wish to have the distance-centric k-mer visualization.
	debugLevel \| int \| 2 \| Possible values: [0, 1, 2]. Value 0 makes _comik_ completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.
	debugMsgLocation \| int/str \| 1 \| Value 1 denotes the command prompt, else specify a filename, say 'runLog.txt'. This file is written for each outer fold separately.
	computationVersion \| str \| 'Looping'\| Possible values: 'Looping' or 'AccumArray'. 'Looping' is faster and preferred/recommended for large datasets when 'AccumArray' can be memory intensive.

	The `comik_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).


	Note 1: In case you are interested in performing a quick, exploratory run using _CoMIK_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:

	oligomer-length: [2]
	maximum-distance: min(segment-size, 50)
	number-of-clusters: [2 5]
	segment-size: 100 or 200

	Note 2: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive.

	Note 3: Presently, _CoMIK_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.

	During the run,
	* _CoMIK_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got omitted, their FASTA-Ids in a separate file named `omittedFastaIds.txt` per outer fold separately.

	* the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages.
	- Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run.
	- The various train and test kernels (as .csv files)
	- The weight vectors corresponding to the kernels
	- The support vector indices
	- The alpha values
	- SVM bias value
	- Kernel weights upon performing MKL
	- The visualizations of the sequence logos
	- Heatmaps if the flag has been set


	For comments and questions, feel free to [report an issue](https://github.molgen.mpg.de/snikumbh/comik/issues/new)