CoMIK : Conformal Multi-Instance Kernels

Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer

To appear in Proceedings of WABI 2017. DOI at bioRxiv: https://doi.org/10.1101/139618

Requirements:

MATLAB 9.0.0.341360 (R2016a) [We developed comik using this version of MATLAB. Compatibility with earlier versions is to be checked.]
Python 2.7 or higher
SHOGUN Release version 3.2.0

This repository contains Matlab code for the project CoMIK. The various ".m" files define corresponding Matlab functions. We use MKL implementation from Shogun's modular interface for Python. The Python script mkl.py handles solving of the MKL problem.

Installation:

git clone https://github.molgen.mpg.de/snikumbh/comik
cd comik
chmod +x mkl.py
export PATH=".:$PATH"

Usage:

Example function call from inside Matlab: For simulated dataset 1 provided in the folder sample_data

comik_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fasta', 600, 600, [501:600 1101:1200], 'comik_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt');

CoMIK accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below.

Further details:

The CoMIK wrapper function takes the following arguments as input

positive FASTA filename [type: str]
negative FASTA filename [type: str]
number of positive sequences [type: int]
number of negative sequences [type: int]
Indices of the test sequences [type: int, given as a Matlab vector]
output-folder-name [type: str]

The above are required while the following have default values

Param name	type	default value	Additional comments
oligomer-lengths-as-vector	int	[2]	Required for the ODH representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run comik first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see Note 2 below
maximum-distance	int	50	Required for the ODH representation. Typically, a maximum distance of 100 basepairs suffices even if the segment-size is larger
segment-size	int	100	See Note 1 below
number-of-clusters	vector of ints	[2 5]	Recommended maximum number of clusters: 7
sigma-values-for-Gaussian-transformation	vector of floats	10.^[-1:1:2]	typical values: 10.^[-1:1:2] which is Matlab notation to obtain the vector [0.1, 1.0, 10.0, 100.0]]
cost-values-for-SVM	vector of floats	10.^[-3:1:3]	typical values: 10.^[-3:1:3]
mklNorm	int/float	2.0	typically 1.0 or 2.0
number-of-inner-folds	int	10
number-of-outer-folds	int	5
whetherToPlotHeatmap	str	'No'	Possible values: 'Yes' or 'No'; Set 'Yes' only if all sequences are of the same length.
whetherToVisualizeWVector	str	'Yes'	Possible values: 'Yes' or 'No'; Set 'Yes' when you wish to have the distance-centric k-mer visualization.
debugLevel	int	2	Possible values: [0, 1, 2]. Value 0 makes comik completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.
debugMsgLocation	int/str	1	Value 1 denotes the command prompt, else specify a filename, say 'runLog.txt'. This file is written for each outer fold separately.
computationVersion	str	'Looping'	Possible values: 'Looping' or 'AccumArray'. 'Looping' is faster and preferred/recommended for large datasets when 'AccumArray' can be memory intensive.

The comik_wrapper function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or test_indices are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).

Note 1: In case you are interested in performing a quick, exploratory run using comik on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:

oligomer-length: [2]
maximum-distance: min(segment-size, 50)
number-of-clusters: [2 5]
segment-size: 100 or 200

Note 2: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive.

Note 3: Presently, comik uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.

During the run,

CoMIK omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named ommittedFastaIds.txt per outer fold separately.
the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages.
- Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run.
- The various train and test kernels (as .csv files)
- The weight vectors corresponding to the kernels
- The support vector indices
- The alpha values
- SVM bias value
- Kernel weights upon performing MKL
- The visualizations of the sequence logos
- Heatmaps if the flag has been set

For comments and questions, write to snikumbh@mpi-inf.mpg.de

comik/README.md

CoMIK : Conformal Multi-Instance Kernels

Requirements:

Installation:

Usage: