Skip to content

Commit

Permalink
major edit: name changed to comik. Small edits: various files kernel …
Browse files Browse the repository at this point in the history
…dimension stored in a separate variable name.
  • Loading branch information
snikumbh committed Jul 19, 2017
1 parent 9f9278e commit c9f8192
Show file tree
Hide file tree
Showing 12 changed files with 149 additions and 461 deletions.
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# CoMIKL : Conformal Multi-Instance Kernel Learning
# CoMIK : Conformal Multi-Instance Kernels

Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer

Manuscript under preparation. It can be accessed at the department's SVN repository [link]
To appear in Proceedings of WABI 2017.
Preprint at bioRxiv:

## Requirements:
- MATLAB 9.0.0.341360 (R2016a) [We developed _CoMIKL_ using this version of MATLAB. Compatibility with earlier versions is to be checked.]
- MATLAB 9.0.0.341360 (R2016a) [We developed _comik_ using this version of MATLAB. Compatibility with earlier versions is to be checked.]
- Python 2.7 or higher
- SHOGUN Release version 3.2.0


This repository contains Matlab code for the project *CoMIKL*. The various ".m" files define corresponding Matlab functions.
This repository contains Matlab code for the project *CoMIK*. The various ".m" files define corresponding Matlab functions.
We use MKL implementation from Shogun's modular interface for Python. The Python script mkl.py handles solving of the MKL problem.

## Installation:
```MATLAB
git clone https://github.molgen.mpg.de/snikumbh/comikl
cd comikl
git clone https://github.molgen.mpg.de/snikumbh/comik
cd comik
chmod +x mkl.py
export PATH=".:$PATH"
```
Expand All @@ -25,14 +25,14 @@ export PATH=".:$PATH"
Example function call from inside Matlab:
For simulated dataset 1 provided in the folder `sample_data`
```Matlab
comikl_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fa', 600, 600, [501:600 1101:1200], 'comikl_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt');
comik_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fa', 600, 600, [501:600 1101:1200], 'comik_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt');
```

_CoMIKL_ accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below.
_comik_ accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below.

Further details:

The CoMIKL wrapper function takes the following arguments as input
The comik wrapper function takes the following arguments as input

- positive FASTA filename [type: str]
- negative FASTA filename [type: str]
Expand All @@ -45,7 +45,7 @@ The above are required while the following have default values

Param name | type | default value | Additional comments
-----------|-------|---------------|---------------------
oligomer-lengths-as-vector | int | [2] | Required for the _ODH_ representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run _CoMIKL_ first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see **Note 2** below
oligomer-lengths-as-vector | int | [2] | Required for the _ODH_ representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run _comik_ first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see **Note 2** below
maximum-distance | int | 50 | Required for the _ODH_ representation. Typically, a maximum distance of 100 basepairs suffices even if the segment-size is larger
segment-size | int | 100 | See **Note 1** below
number-of-clusters | vector of ints | [2 5] | Recommended maximum number of clusters: 7
Expand All @@ -55,14 +55,14 @@ Param name | type | default value | Additional comments
number-of-inner-folds | int | 10 |
number-of-outer-folds | int | 5 |
whetherToPlotHeatmap | str | 'No' | Possible values: 'Yes' or 'No'; Set 'Yes' only if all sequences are of the same length.
debugLevel | int | 2 | Possible values: [0, 1, 2]. Value 0 makes _CoMIKL_ completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.
debugLevel | int | 2 | Possible values: [0, 1, 2]. Value 0 makes _comik_ completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.
debugMsgLocation | int/str | 1 | Value 1 denotes the command prompt, else specify a filename, say 'runLog.txt'. This file is written for each outer fold separately.
computationVersion | str | 'Looping'| Possible values: 'Looping' or 'AccumArray'. 'Looping' is faster and preferred/recommended for large datasets when 'AccumArray' can be memory intensive.

The `comikl_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).
The `comik_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).


**Note 1**: In case you are interested in performing a quick, exploratory run using _CoMIKL_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:
**Note 1**: In case you are interested in performing a quick, exploratory run using _comik_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:

oligomer-length: [2]
maximum-distance: min(segment-size, 50)
Expand All @@ -71,10 +71,10 @@ segment-size: 100 or 200

**Note 2**: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive.

**Note 3**: Presently, _CoMIKL_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.
**Note 3**: Presently, _comik_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.

During the run,
* _CoMIKL_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named `ommittedFastaIds.txt` per outer fold separately.
* _comik_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named `ommittedFastaIds.txt` per outer fold separately.

* the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages.
- Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run.
Expand Down
2 changes: 1 addition & 1 deletion analyze_wvector.m
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@
%set(new_ax, 'Ydir', 'reverse');

pdf_fname = strcat(folderName, '/SeqLogos_MinMax_at_distances_rank', num2str(rankId), '_oligoLen', num2str(oligoLen));
print(pdf_fname,'-dpdf');
print(pdf_fname,'-dpdf', '-r900', '-bestfit');
%print(pdf_fname, '-depsc');
pdfcrop_lines = [pdfcrop_lines; sprintf('pdfcrop %s\n', pdf_fname)];
grid off;
Expand Down
52 changes: 52 additions & 0 deletions barplot-heatmap-visualization.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@


#rscaled <- scale(r, center = FALSE, scale = colSums(r,na.rm=T))
#may not need re-scaling
pdf('samarth_viz_in_r.pdf', width=10.0, height=6.0)
flds <- count.fields('samarth_test1.csv',sep=',')
s <- read.csv('samarth_test1.csv', fill=TRUE, nrows=length(flds), header=F, sep=",", col.names=paste0("Seg-", seq(1,max(flds),by=1)))
r <- s[1:25,]
r_ones <- r
r_ones[!is.na(r)] <- 1
#ncolors <- max(r[complete.cases(r),])
#or
ncolors <- max(flds)
#color_palette <- rainbow(ncolors)
color_palette <- heat.colors(ncolors)

jet.colors <-
colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
"#7FFF7F", "yellow", "#FF7F00"))#, "red", "#7F0000"))
#color_palette <- jet.colors(ncolors)

ypos <- apply(r, 2, cumsum)
ypos <- ypos - r/2

#xpos <- barplot(t(r_ones), col=rainbow(ncolors), axes=F, horiz=T)
xpos <- barplot(t(r_ones), col=c("white"), axes=F, horiz=T, axisnames=F)
cp <- 0
for (i in 1:ncol(r_ones)){
temp <- color_palette[r[seq(1,nrow(r),by=1), i]]
cp <- cbind(cp, temp)
#barplot(t(r_ones[]), col=color_palette[], add=T, axes=F)
}
new_cp <- cp[,-1]

for (i in 1:nrow(r_ones)){
#xx = r_ones
#xx[,-i] <- NA
#colnames(xx)[-i] <- NA
barplot(t(r_ones[i,]), col=new_cp[i,], horiz=T, add=T, space=(i-1)*1.2, axes=F, xlab="", axisnames=F)
}

axis(1,at=0.5+seq(0,ncolors-1,1), labels=seq(1,ncolors,1))
#Additional axis showing the numcleotide ranges
#axis(1,at=seq(0,ncolors-1,1), labels=seq(1,ncolors,1))
legend("topright", legend = seq(1,ncolors), fill = color_palette, title="Segment ranks", ncol = 2, x.intersp = 0.1, bty="n", cex=0.8)
#pie(rep(1,ncolors), col=color_palette, labels=paste0("Rank ", seq(1,ncolors)), add=T)

dev.off()
#text(rep(xpos,5), as.vector(t(ypos)), labels=t(r)) # ??



Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
function [allSeqsAsBags, allSeqsConformedSetKernel, subkernelWeights, thetaVals, instanceWeightsInEachBag, resultString, bestParamComb, test_teAUROC, test_teAUPRC, predictions] = comikl_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, givenNPos, givenNNeg, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector)
function [allSeqsAsBags, allSeqsConformedSetKernel, subkernelWeights, thetaVals, instanceWeightsInEachBag, resultString, bestParamComb, test_teAUROC, test_teAUPRC, predictions] = comik_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, givenNPos, givenNNeg, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector)

% Author: snikumbh
%
Expand Down Expand Up @@ -93,11 +93,11 @@
thisInstances = [];

for k=1:nBags
if oligoLen == 3
%if oligoLen == 3
if rem(k,100) == 0
logMessages(debugMsgLocation,sprintf('%d--', k), debugLevel);
end
end
%end
if any(k == testIndices)
% do-nothing now, we collect them later
% Collection of test sequences can take place later,
Expand Down Expand Up @@ -328,7 +328,7 @@
logMessages(debugMsgLocation, sprintf('---Prediction using weight vector---\n'), debugLevel);
% 6.1 Use the subkernel weights and get transformed instances among the test examples, may be use applyTransformation.m
% instanceWeights for instances in training bags
[thetaVals, instanceWeightsInEachBag] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel);
[thetaVals, instanceWeightsInEachBag] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel, 'samarth_train1.csv');
% Get the transformations for test examples, and the diagonal elements which will be used for normalization further
% Should one proceed block-wise at this juncture if the number of test examples is extremely large as is typically the case?
%
Expand All @@ -338,7 +338,7 @@
logMessages(debugMsgLocation, sprintf('done in %.3f seconds\n', toc), debugLevel);
%
% Getting the instance weightings for the test examples
[thetaVals, instanceWeightsInEachBag_Test] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel_Test);
[thetaVals, instanceWeightsInEachBag_Test] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel_Test, 'samarth_test1.csv');
% At this point, all (train + test examples) instance weights have been obtained.
%
% 6.2 Use function obtainWeightVector and procure the weight vectors and the bias value
Expand Down Expand Up @@ -387,6 +387,16 @@
end
%
%
% Write the instanceWeightInEachBag to file
%dlmwrite('samarth_test.csv', instanceWeightsInEachBag, '-append');
%disp(instanceWeightsInEachBag{1});
%disp(size(instanceWeightsInEachBag));
%disp(size(instanceWeightsInEachBag{1}));
%disp(size(instanceWeightsInEachBag{2}));
%for i=1:size(instanceWeightsInEachBag,2)
% dlmwrite('samarth_test.csv', cell2mat(instanceWeightsInEachBag{i})', '-append');
%end

if strcmp(whetherToPlotHeatmap, 'Yes')
logMessages(debugMsgLocation,sprintf('Plotting heatmaps...'), debugLevel);
if(predictUsingWeightVector == 1)
Expand All @@ -397,7 +407,7 @@
%
heatmapFname = strcat(outputFolder,'/Heatmap-InstanceWeights', '_oligoLen', num2str(oligoLen), '_segmentSize', num2str(chunkSizeInBps) , '_maxDist', num2str(maxDist), '_nClusters', num2str(bestParamComb.best_nClusters), '_svmC', num2str(bestParamComb.best_C) ,'_sigma', num2str(bestParamComb.best_sigma), '.pdf');
%
p = plot_heatmap(normInstanceWeightsInEachBag, heatmapFname, 1, 2);
p = plot_heatmap(normInstanceWeightsInEachBag, heatmapFname, 0, 1);
logMessages(debugMsgLocation,sprintf('done!\n'), debugLevel);
%
end
Expand Down
17 changes: 9 additions & 8 deletions comikl_wrapper.m → comik_wrapper.m
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
function [] = comikl_wrapper(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, testIndices, outputFolder, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, nOuterFolds, whetherToPlotHeatmap, whetherToVisualizeWVector, debugLevel, debugMsgLocation, computationVersion)
function [] = comik_wrapper(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, testIndices, outputFolder, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, nOuterFolds, whetherToPlotHeatmap, whetherToVisualizeWVector, debugLevel, debugMsgLocation, computationVersion)

% COMIKL_WRAPPER
% Usage:
% comikl_wrapper(givenFastaFilename, nPosSequences, nNegSequences, ...
% comik_wrapper(givenFastaFilename, nPosSequences, nNegSequences, ...
% outputFolder, oligoLen, maxDist, chunkSizeInBps, nClusterVals, ...
% sigmaVals, Cs, mklNorm, nFolds, nOuterFolds, whetherToPlotHeatmap, ...
% computationVersion, whetherToVisualizeWVector)
Expand Down Expand Up @@ -306,8 +306,8 @@ case char('distal-proximal-cgi')
posTestIndices = intersect(testIndices, posIndices);
negTestIndices = setdiff(testIndices, posIndices) - nPosSequences;

%thisTestIndices{1} = testIndices;
for i=1:nOuterFolds
thisTestIndices{1} = testIndices;
for i=2:nOuterFolds
thisPosTestIndices = circshift(posIndices, [0, (i-1)*length(posTestIndices)]);
thisNegTestIndices = circshift(negIndices, [0, (i-1)*length(negTestIndices)]) + length(posIndices);
thisTestIndices{i} = [thisPosTestIndices(end-(length(posTestIndices))+1:end) thisNegTestIndices(end-(length(negTestIndices))+1:end)];
Expand All @@ -327,7 +327,7 @@ case char('distal-proximal-cgi')
end
parfor (i=1:nOuterFolds, nOuterFolds)
thisFoldDirectory = strcat(outputFolder, '/outer_fold_', num2str(i));
returnStatusVal = perform_one_outer_fold_for_comikl(i, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, ...
returnStatusVal = perform_one_outer_fold_for_comik(i, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, ...
maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, ...
thisTestIndices{i}, debugLevel, thisFoldDirectory, whetherToPlotHeatmap, ...
computationVersion, whetherToVisualizeWVector, debugMsgLocation);
Expand All @@ -336,9 +336,9 @@ case char('distal-proximal-cgi')
logMessages(1, sprintf('All outer folds completed!\n'), debugLevel);


end % comikl_wrapper function ends
end % comik_wrapper function ends

function statusVal = perform_one_outer_fold_for_comikl(outerFoldID, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, outputFolder, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector, debugMsgLocation)
function statusVal = perform_one_outer_fold_for_comik(outerFoldID, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, outputFolder, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector, debugMsgLocation)
% Runs one outer fold for CoMIKL

% Make the outputFolder for this outer fold
Expand Down Expand Up @@ -414,11 +414,12 @@ case char('distal-proximal-cgi')
fclose(fid);
%
%
[allSeqsAsBags, allSeqsConformedSetKernel, kernelweights, thetaVals, instanceWeightsInEachBag, resultString, bestVals, test_teAUROC, test_teAUPRC, predictions] = comikl_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen(l), maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector);
[allSeqsAsBags, allSeqsConformedSetKernel, kernelweights, thetaVals, instanceWeightsInEachBag, resultString, bestVals, test_teAUROC, test_teAUPRC, predictions] = comik_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen(l), maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector);
fid=fopen(runSummaryFilename, 'a');
fprintf(fid, '\n\n');
fprintf(fid, '=====Validation results - SAMARTH=====\n');
fprintf(fid, 'OligoLen: %d, maxDist: %d, segment size: %d\n', oligoLen(l), maxDist, chunkSizeInBps);
fprintf(fid, 'Validation teAUPRC: %.4f\n', test_teAUPRC);
fprintf(fid, 'Validation teAUROC: %.4f\n', test_teAUROC);
fprintf(fid, 'Best C: %.3f\n', bestVals.best_C);
fprintf(fid, 'Best sigma: %.3f\n', bestVals.best_sigma);
Expand Down
Loading

0 comments on commit c9f8192

Please sign in to comment.