From 9a0e9ecc52a0c61d2b6cefd5f4d9b792cd98c300 Mon Sep 17 00:00:00 2001 From: Sarvesh Prakash Nikumbh Date: Mon, 31 Jul 2017 15:04:07 +0200 Subject: [PATCH] README updated to reflect the config-file change. --- README.md | 19 ++++++++++--------- config-comik.txt | 10 ++++++---- 2 files changed, 16 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 2fffd3d..4fd91fc 100644 --- a/README.md +++ b/README.md @@ -3,10 +3,10 @@ Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer To appear in Proceedings of WABI 2017. -DOI at bioRxiv: https://doi.org/10.1101/139618 +bioRxiv DOI: https://doi.org/10.1101/139618 ## Requirements: -- MATLAB 9.0.0.341360 (R2016a) [We developed _comik_ using this version of MATLAB. Compatibility with earlier versions is to be checked.] +- MATLAB 9.0.0.341360 (R2016a) [We developed _CoMIK_ using this version of MATLAB. Compatibility with earlier versions is to be checked.] - Python 2.7 or higher - SHOGUN Release version 3.2.0 @@ -26,14 +26,15 @@ export PATH=".:$PATH" Example function call from inside Matlab: For simulated dataset 1 provided in the folder `sample_data` ```Matlab -comik_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fasta', 600, 600, [501:600 1101:1200], 'comik_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt'); +comik_wrapper('config-comik.txt'); ``` -_CoMIK_ accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below. +See the example config file `config-comik.txt`. +_CoMIK_ requires two FASTA files as input -- the first FASTA file containing sequences in the positive class; the second FASTA file containing the sequences in the negative class. Other params are explained below. Further details: -The _CoMIK_ wrapper function takes the following arguments as input +Values for the following parameters are required - positive FASTA filename [type: str] - negative FASTA filename [type: str] @@ -42,7 +43,7 @@ The _CoMIK_ wrapper function takes the following arguments as input - Indices of the test sequences [type: int, given as a Matlab vector] - output-folder-name [type: str] -The above are required while the following have default values +The rest have default values, which are good starting points. Param name | type | default value | Additional comments -----------|-------|---------------|--------------------- @@ -64,7 +65,7 @@ Param name | type | default value | Additional comments The `comik_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt). -**Note 1**: In case you are interested in performing a quick, exploratory run using _comik_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended: +**Note 1**: In case you are interested in performing a quick, exploratory run using _CoMIK_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended: oligomer-length: [2] maximum-distance: min(segment-size, 50) @@ -73,10 +74,10 @@ segment-size: 100 or 200 **Note 2**: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive. -**Note 3**: Presently, _comik_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel. +**Note 3**: Presently, _CoMIK_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel. During the run, -* _CoMIK_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named `ommittedFastaIds.txt` per outer fold separately. +* _CoMIK_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got omitted, their FASTA-Ids in a separate file named `omittedFastaIds.txt` per outer fold separately. * the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages. - Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run. diff --git a/config-comik.txt b/config-comik.txt index 22f8cca..16c268e 100644 --- a/config-comik.txt +++ b/config-comik.txt @@ -1,6 +1,7 @@ ## CoMIK CONFIG FILE +## Values set for simulated dataset 1 -## Required Input +## Required Input parameters POSITIVE_FASTA_FILE=./sample_data/simulated_dataset1/pos.fasta NEGATIVE_FASTA_FILE=./sample_data/simulated_dataset1/neg.fasta NUMBER_OF_POSITIVES=600 @@ -9,11 +10,12 @@ TEST_INDICES=[501:600 1101:1200] OUTPUT_FOLDER=comik_run_simulated_dataset1 ## ODH requirements -OLIGO_LEN=[2 3] -MAX_DIST=10 +OLIGO_LEN=[2] +#OLIGO_LEN=[2 3] +MAX_DIST=50 ## For CoMIK -SEGMENT_SIZE_IN_BPS=10 +SEGMENT_SIZE_IN_BPS=50 NUMBER_OF_CLUSTERS=[2 5 7] SIGMA_VALUES=10.^[1:1:2] COST_VALUES=10.^[-3:1:3]