Skip to content

Commit

Permalink
README updated to reflect the config-file change.
Browse files Browse the repository at this point in the history
  • Loading branch information
snikumbh committed Jul 31, 2017
1 parent 982dce4 commit 9a0e9ec
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 13 deletions.
19 changes: 10 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer

To appear in Proceedings of WABI 2017.
DOI at bioRxiv: https://doi.org/10.1101/139618
bioRxiv DOI: https://doi.org/10.1101/139618

## Requirements:
- MATLAB 9.0.0.341360 (R2016a) [We developed _comik_ using this version of MATLAB. Compatibility with earlier versions is to be checked.]
- MATLAB 9.0.0.341360 (R2016a) [We developed _CoMIK_ using this version of MATLAB. Compatibility with earlier versions is to be checked.]
- Python 2.7 or higher
- SHOGUN Release version 3.2.0

Expand All @@ -26,14 +26,15 @@ export PATH=".:$PATH"
Example function call from inside Matlab:
For simulated dataset 1 provided in the folder `sample_data`
```Matlab
comik_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fasta', 600, 600, [501:600 1101:1200], 'comik_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt');
comik_wrapper('config-comik.txt');
```

_CoMIK_ accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below.
See the example config file `config-comik.txt`.
_CoMIK_ requires two FASTA files as input -- the first FASTA file containing sequences in the positive class; the second FASTA file containing the sequences in the negative class. Other params are explained below.

Further details:

The _CoMIK_ wrapper function takes the following arguments as input
Values for the following parameters are required

- positive FASTA filename [type: str]
- negative FASTA filename [type: str]
Expand All @@ -42,7 +43,7 @@ The _CoMIK_ wrapper function takes the following arguments as input
- Indices of the test sequences [type: int, given as a Matlab vector]
- output-folder-name [type: str]

The above are required while the following have default values
The rest have default values, which are good starting points.

Param name | type | default value | Additional comments
-----------|-------|---------------|---------------------
Expand All @@ -64,7 +65,7 @@ Param name | type | default value | Additional comments
The `comik_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).


**Note 1**: In case you are interested in performing a quick, exploratory run using _comik_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:
**Note 1**: In case you are interested in performing a quick, exploratory run using _CoMIK_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:

oligomer-length: [2]
maximum-distance: min(segment-size, 50)
Expand All @@ -73,10 +74,10 @@ segment-size: 100 or 200

**Note 2**: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive.

**Note 3**: Presently, _comik_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.
**Note 3**: Presently, _CoMIK_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.

During the run,
* _CoMIK_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named `ommittedFastaIds.txt` per outer fold separately.
* _CoMIK_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got omitted, their FASTA-Ids in a separate file named `omittedFastaIds.txt` per outer fold separately.

* the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages.
- Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run.
Expand Down
10 changes: 6 additions & 4 deletions config-comik.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
## CoMIK CONFIG FILE
## Values set for simulated dataset 1

## Required Input
## Required Input parameters
POSITIVE_FASTA_FILE=./sample_data/simulated_dataset1/pos.fasta
NEGATIVE_FASTA_FILE=./sample_data/simulated_dataset1/neg.fasta
NUMBER_OF_POSITIVES=600
Expand All @@ -9,11 +10,12 @@ TEST_INDICES=[501:600 1101:1200]
OUTPUT_FOLDER=comik_run_simulated_dataset1

## ODH requirements
OLIGO_LEN=[2 3]
MAX_DIST=10
OLIGO_LEN=[2]
#OLIGO_LEN=[2 3]
MAX_DIST=50

## For CoMIK
SEGMENT_SIZE_IN_BPS=10
SEGMENT_SIZE_IN_BPS=50
NUMBER_OF_CLUSTERS=[2 5 7]
SIGMA_VALUES=10.^[1:1:2]
COST_VALUES=10.^[-3:1:3]
Expand Down

0 comments on commit 9a0e9ec

Please sign in to comment.