major edit: name changed to comik. Small edits: various files kernel …

…dimension stored in a separate variable name.
snikumbh · Jul 19, 2017 · c9f8192 · c9f8192
1 parent 9f9278e
commit c9f8192
Show file tree

Hide file tree

Showing 12 changed files with 149 additions and 461 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,22 @@
-# CoMIKL : Conformal Multi-Instance Kernel Learning
+# CoMIK : Conformal Multi-Instance Kernels
 
 Manuscript authors: Sarvesh Nikumbh, Peter Ebert, Nico Pfeifer
-
-Manuscript under preparation. It can be accessed at the department's SVN repository [link]
+To appear in Proceedings of WABI 2017.
+Preprint at bioRxiv: 
 
 ## Requirements:
-- MATLAB 9.0.0.341360 (R2016a) [We developed _CoMIKL_ using this version of MATLAB. Compatibility with earlier versions is to be checked.]
+- MATLAB 9.0.0.341360 (R2016a) [We developed _comik_ using this version of MATLAB. Compatibility with earlier versions is to be checked.]
 - Python 2.7 or higher
 - SHOGUN Release version 3.2.0
 
 
-This repository contains Matlab code for the project *CoMIKL*. The various ".m" files define corresponding Matlab functions.
+This repository contains Matlab code for the project *CoMIK*. The various ".m" files define corresponding Matlab functions.
 We use MKL implementation from Shogun's modular interface for Python. The Python script mkl.py handles solving of the MKL problem.
 
 ## Installation:
 ```MATLAB
-git clone https://github.molgen.mpg.de/snikumbh/comikl
-cd comikl
+git clone https://github.molgen.mpg.de/snikumbh/comik
+cd comik
 chmod +x mkl.py
 export PATH=".:$PATH"
 ```
@@ -25,14 +25,14 @@ export PATH=".:$PATH"
 Example function call from inside Matlab:
 For simulated dataset 1 provided in the folder `sample_data`
 ```Matlab
-comikl_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fa', 600, 600, [501:600 1101:1200], 'comikl_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt');
+comik_wrapper('sample_data/simulated_dataset1/pos.fasta', 'sample_data/simulated_dataset1/neg.fa', 600, 600, [501:600 1101:1200], 'comik_run_simulated_dataset1', [2], 10, 10, [2 5 7], 10.^[1:1:2], 10.^[-3:1:3], 2.0, 10, 5, 'No', 'Yes', 2, 'runLog.txt');
 ```
 
-_CoMIKL_ accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below.
+_comik_ accepts two FASTA files as input -- the first FASTA file containing sequences in the positive class followed by a second FASTA file containing the sequences in the negative class. Other params are explained below.
 
 Further details:
 
-The CoMIKL wrapper function takes the following arguments as input
+The comik wrapper function takes the following arguments as input
 
 - positive FASTA filename [type: str]
 - negative FASTA filename [type: str]
@@ -45,7 +45,7 @@ The above are required while the following have default values
 
 Param name | type  | default value | Additional comments
 -----------|-------|---------------|---------------------
- oligomer-lengths-as-vector | int | [2] | Required for the _ODH_ representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run _CoMIKL_ first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see **Note 2** below
+ oligomer-lengths-as-vector | int | [2] | Required for the _ODH_ representation. Recommended values: 2 or 3 (suffices). Passing a vector [2 3] will run _comik_ first with oligomer-length 2 followed by an independent second run with oligomer-length 3. Further, see **Note 2** below
  maximum-distance | int | 50 | Required for the _ODH_ representation. Typically, a maximum distance of 100 basepairs suffices even if the segment-size is larger
  segment-size | int | 100 | See **Note 1** below
  number-of-clusters | vector of ints | [2 5] | Recommended maximum number of clusters: 7
@@ -55,14 +55,14 @@ Param name | type  | default value | Additional comments
  number-of-inner-folds | int | 10 |  
  number-of-outer-folds | int | 5 |  
  whetherToPlotHeatmap | str | 'No' | Possible values: 'Yes' or 'No'; Set 'Yes' only if all sequences are of the same length.  
- debugLevel | int | 2 | Possible values: [0, 1, 2]. Value 0 makes _CoMIKL_ completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.   
+ debugLevel | int | 2 | Possible values: [0, 1, 2]. Value 0 makes _comik_ completely silent, and 2 makes it maximally verbose. Value 1 may be used in the future.   
  debugMsgLocation | int/str | 1 | Value 1 denotes the command prompt, else specify a filename, say 'runLog.txt'. This file is written for each outer fold separately.  
  computationVersion | str | 'Looping'| Possible values: 'Looping' or 'AccumArray'. 'Looping' is faster and preferred/recommended for large datasets when 'AccumArray' can be memory intensive.  
 
-The `comikl_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).
+The `comik_wrapper` function handles creation of and running outer cross-validation folds (as part of nested cross-validation). The supplied indices of the test sequences, or `test_indices` are then used to note the proportion of positives and negatives from the whole set that is to be treated as unseen test examples. The given set of sequences are then shuffled before splitting them into training and unseen test examples as per the specified proportions. The indices of the samples treated as unseen test examples is also written to disk per outer fold (filename: testIndices.txt).
 
 
-**Note 1**: In case you are interested in performing a quick, exploratory run using _CoMIKL_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:
+**Note 1**: In case you are interested in performing a quick, exploratory run using _comik_ on your data, kindly note that depending on the number of sequences in the collection and their lengths, if there are many very long sequences, a small segment-size may lead to very high number of segments in total thereby increasing the computation time which may be prohibitive for this initial run. Hence, for such an initial run, the following values are recommended:
 
 oligomer-length: [2]  
 maximum-distance: min(segment-size, 50)  
@@ -71,10 +71,10 @@ segment-size: 100 or 200
 
 **Note 2**: Kindly note that oligomer-lengths of 3 or larger than 3, depending on the number and length of the sequences, and segment-size used, can lead to very high-dimensional vectors which can be memory-intensive.
 
-**Note 3**: Presently, _CoMIKL_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.
+**Note 3**: Presently, _comik_ uses MATLAB parfor-loop to execute the outer cross-validation folds in parallel.
 
 During the run, 
-* _CoMIKL_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named `ommittedFastaIds.txt` per outer fold separately.
+* _comik_ omits the sequences whose lengths are shorter than the segment-size specified from the run. It reports the number of sequences that got ommitted, their FASTA-Ids in a separate file named `ommittedFastaIds.txt` per outer fold separately.
 
 * the following files are written to the disk per outer fold at any intermediate stage of the pipeline. Most of these are used by the pipeline itself in its subsequent stages.  
     - Run summary file: The resultString is also written to the summary file which is characterized by the segment-size and oligomer-length. The summary file is typically named: 'runSummary_segment-sizeX_oligoLenY.txt' where X and Y are as set for the pipeline run.

diff --git a/analyze_wvector.m b/analyze_wvector.m
@@ -310,7 +310,7 @@
 	%set(new_ax, 'Ydir', 'reverse');
 
 	pdf_fname = strcat(folderName, '/SeqLogos_MinMax_at_distances_rank', num2str(rankId), '_oligoLen', num2str(oligoLen));
-	print(pdf_fname,'-dpdf');
+	print(pdf_fname,'-dpdf', '-r900', '-bestfit');
 	%print(pdf_fname, '-depsc');
 	pdfcrop_lines = [pdfcrop_lines; sprintf('pdfcrop %s\n', pdf_fname)];
 	grid off;

diff --git a/barplot-heatmap-visualization.R b/barplot-heatmap-visualization.R
@@ -0,0 +1,52 @@
+
+
+#rscaled <- scale(r, center = FALSE, scale = colSums(r,na.rm=T))
+#may not need re-scaling
+pdf('samarth_viz_in_r.pdf', width=10.0, height=6.0)
+flds <- count.fields('samarth_test1.csv',sep=',')
+s <- read.csv('samarth_test1.csv', fill=TRUE, nrows=length(flds), header=F, sep=",", col.names=paste0("Seg-", seq(1,max(flds),by=1)))
+r <- s[1:25,]
+r_ones <- r
+r_ones[!is.na(r)] <- 1
+#ncolors <- max(r[complete.cases(r),])
+#or 
+ncolors <- max(flds)
+#color_palette <- rainbow(ncolors)
+color_palette <- heat.colors(ncolors)
+
+jet.colors <-
+  colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
+                     "#7FFF7F", "yellow", "#FF7F00"))#, "red", "#7F0000"))
+#color_palette <- jet.colors(ncolors)
+
+ypos <- apply(r, 2, cumsum)
+ypos <- ypos - r/2
+
+#xpos <- barplot(t(r_ones), col=rainbow(ncolors), axes=F, horiz=T)
+xpos <- barplot(t(r_ones), col=c("white"), axes=F, horiz=T, axisnames=F)
+cp <- 0
+for (i in 1:ncol(r_ones)){
+    temp <- color_palette[r[seq(1,nrow(r),by=1), i]]
+    cp <- cbind(cp, temp)
+    #barplot(t(r_ones[]), col=color_palette[], add=T, axes=F) 
+}
+new_cp <- cp[,-1]
+
+for (i in 1:nrow(r_ones)){
+    #xx = r_ones
+    #xx[,-i] <- NA
+    #colnames(xx)[-i] <- NA
+    barplot(t(r_ones[i,]), col=new_cp[i,], horiz=T, add=T, space=(i-1)*1.2, axes=F, xlab="", axisnames=F)
+}
+
+axis(1,at=0.5+seq(0,ncolors-1,1), labels=seq(1,ncolors,1))
+#Additional axis showing the numcleotide ranges
+#axis(1,at=seq(0,ncolors-1,1), labels=seq(1,ncolors,1))
+legend("topright", legend = seq(1,ncolors), fill = color_palette, title="Segment ranks", ncol = 2, x.intersp = 0.1, bty="n", cex=0.8)
+#pie(rep(1,ncolors), col=color_palette, labels=paste0("Rank ", seq(1,ncolors)), add=T)
+
+dev.off()
+#text(rep(xpos,5), as.vector(t(ypos)), labels=t(r)) # ?? 
+
+
+
diff --git a/comikl_main_with_weight_vector.m → comik_main_with_weight_vector.m b/comikl_main_with_weight_vector.m → comik_main_with_weight_vector.m
@@ -1,4 +1,4 @@
-function [allSeqsAsBags, allSeqsConformedSetKernel, subkernelWeights, thetaVals, instanceWeightsInEachBag, resultString, bestParamComb, test_teAUROC, test_teAUPRC, predictions] = comikl_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, givenNPos, givenNNeg, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector)
+function [allSeqsAsBags, allSeqsConformedSetKernel, subkernelWeights, thetaVals, instanceWeightsInEachBag, resultString, bestParamComb, test_teAUROC, test_teAUPRC, predictions] = comik_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, givenNPos, givenNNeg, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector)
 
 % Author: snikumbh 
 %
@@ -93,11 +93,11 @@
 thisInstances = [];
 
 for k=1:nBags
-    if oligoLen == 3
+    %if oligoLen == 3
 	if rem(k,100) == 0
             logMessages(debugMsgLocation,sprintf('%d--', k), debugLevel);
 	end
-    end
+    %end
     if any(k == testIndices)
 	% do-nothing now, we collect them later
 	% Collection of test sequences can take place later, 
@@ -328,7 +328,7 @@
     logMessages(debugMsgLocation, sprintf('---Prediction using weight vector---\n'), debugLevel);
     % 6.1 Use the subkernel weights and get transformed instances among the test examples, may be use applyTransformation.m
     % instanceWeights for instances in training bags
-    [thetaVals, instanceWeightsInEachBag] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel);
+    [thetaVals, instanceWeightsInEachBag] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel, 'samarth_train1.csv');
     % Get the transformations for test examples, and the diagonal elements which will be used for normalization further
     % Should one proceed block-wise at this juncture if the number of test examples is extremely large as is typically the case?
     %
@@ -338,7 +338,7 @@
     logMessages(debugMsgLocation, sprintf('done in %.3f seconds\n', toc), debugLevel);
     %
     % Getting the instance weightings for the test examples
-    [thetaVals, instanceWeightsInEachBag_Test] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel_Test);
+    [thetaVals, instanceWeightsInEachBag_Test] = getInstanceWeights(subkernelWeights, allSeqsTransformationKernel_Test, 'samarth_test1.csv');
     % At this point, all (train + test examples) instance weights have been obtained.
     % 
     % 6.2 Use function obtainWeightVector and procure the weight vectors and the bias value
@@ -387,6 +387,16 @@
 end
 %
 %
+% Write the instanceWeightInEachBag to file
+%dlmwrite('samarth_test.csv', instanceWeightsInEachBag, '-append');
+%disp(instanceWeightsInEachBag{1});
+%disp(size(instanceWeightsInEachBag));
+%disp(size(instanceWeightsInEachBag{1}));
+%disp(size(instanceWeightsInEachBag{2}));
+%for i=1:size(instanceWeightsInEachBag,2)
+%    dlmwrite('samarth_test.csv', cell2mat(instanceWeightsInEachBag{i})', '-append');
+%end
+
 if strcmp(whetherToPlotHeatmap, 'Yes')
     logMessages(debugMsgLocation,sprintf('Plotting heatmaps...'), debugLevel);
     if(predictUsingWeightVector == 1)
@@ -397,7 +407,7 @@
     %
     heatmapFname = strcat(outputFolder,'/Heatmap-InstanceWeights', '_oligoLen', num2str(oligoLen), '_segmentSize', num2str(chunkSizeInBps) , '_maxDist', num2str(maxDist), '_nClusters', num2str(bestParamComb.best_nClusters), '_svmC', num2str(bestParamComb.best_C) ,'_sigma', num2str(bestParamComb.best_sigma), '.pdf');
     %
-    p = plot_heatmap(normInstanceWeightsInEachBag, heatmapFname, 1, 2);
+    p = plot_heatmap(normInstanceWeightsInEachBag, heatmapFname, 0, 1);
     logMessages(debugMsgLocation,sprintf('done!\n'), debugLevel);
     %
 end

diff --git a/comikl_wrapper.m → comik_wrapper.m b/comikl_wrapper.m → comik_wrapper.m
@@ -1,8 +1,8 @@
-function [] = comikl_wrapper(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, testIndices, outputFolder, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, nOuterFolds, whetherToPlotHeatmap, whetherToVisualizeWVector, debugLevel, debugMsgLocation, computationVersion)
+function [] = comik_wrapper(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, testIndices, outputFolder, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, nOuterFolds, whetherToPlotHeatmap, whetherToVisualizeWVector, debugLevel, debugMsgLocation, computationVersion)
 
 % COMIKL_WRAPPER
 % Usage:
-%     comikl_wrapper(givenFastaFilename, nPosSequences, nNegSequences, ...
+%     comik_wrapper(givenFastaFilename, nPosSequences, nNegSequences, ...
 %     outputFolder, oligoLen, maxDist, chunkSizeInBps, nClusterVals, ...
 %     sigmaVals, Cs, mklNorm, nFolds, nOuterFolds, whetherToPlotHeatmap, ...
 %     computationVersion, whetherToVisualizeWVector)
@@ -306,8 +306,8 @@ case char('distal-proximal-cgi')
 posTestIndices = intersect(testIndices, posIndices);
 negTestIndices = setdiff(testIndices, posIndices) - nPosSequences;
 
-%thisTestIndices{1} = testIndices;
-for i=1:nOuterFolds
+thisTestIndices{1} = testIndices;
+for i=2:nOuterFolds
     thisPosTestIndices = circshift(posIndices, [0, (i-1)*length(posTestIndices)]);
     thisNegTestIndices = circshift(negIndices, [0, (i-1)*length(negTestIndices)]) + length(posIndices);
     thisTestIndices{i} = [thisPosTestIndices(end-(length(posTestIndices))+1:end) thisNegTestIndices(end-(length(negTestIndices))+1:end)];
@@ -327,7 +327,7 @@ case char('distal-proximal-cgi')
 end
 parfor (i=1:nOuterFolds, nOuterFolds)
     thisFoldDirectory = strcat(outputFolder, '/outer_fold_', num2str(i));
-    returnStatusVal = perform_one_outer_fold_for_comikl(i, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, ...
+    returnStatusVal = perform_one_outer_fold_for_comik(i, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, ...
 							maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, ...
 							thisTestIndices{i}, debugLevel, thisFoldDirectory, whetherToPlotHeatmap, ...
 							computationVersion, whetherToVisualizeWVector, debugMsgLocation);
@@ -336,9 +336,9 @@ case char('distal-proximal-cgi')
 logMessages(1, sprintf('All outer folds completed!\n'), debugLevel);
 
 
-end % comikl_wrapper function ends
+end % comik_wrapper function ends
 
-function statusVal = perform_one_outer_fold_for_comikl(outerFoldID, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, outputFolder, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector, debugMsgLocation)
+function statusVal = perform_one_outer_fold_for_comik(outerFoldID, givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen, maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, outputFolder, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector, debugMsgLocation)
 % Runs one outer fold for CoMIKL
 
 % Make the outputFolder for this outer fold
@@ -414,11 +414,12 @@ case char('distal-proximal-cgi')
     fclose(fid);
     %
     %
-    [allSeqsAsBags, allSeqsConformedSetKernel, kernelweights, thetaVals, instanceWeightsInEachBag, resultString, bestVals, test_teAUROC, test_teAUPRC, predictions] = comikl_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen(l), maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector);
+    [allSeqsAsBags, allSeqsConformedSetKernel, kernelweights, thetaVals, instanceWeightsInEachBag, resultString, bestVals, test_teAUROC, test_teAUPRC, predictions] = comik_main_with_weight_vector(givenPosFastaFilename, givenNegFastaFilename, nPosSequences, nNegSequences, oligoLen(l), maxDist, chunkSizeInBps, nClusterVals, sigmaVals, Cs, mklNorm, nFolds, testIndices, debugLevel, debugMsgLocation, outputFolder, runSummaryFilename, whetherToPlotHeatmap, computationVersion, whetherToVisualizeWVector);
     fid=fopen(runSummaryFilename, 'a');
     fprintf(fid, '\n\n');
     fprintf(fid, '=====Validation results - SAMARTH=====\n');
     fprintf(fid, 'OligoLen: %d, maxDist: %d, segment size: %d\n', oligoLen(l), maxDist, chunkSizeInBps);
+    fprintf(fid, 'Validation teAUPRC: %.4f\n', test_teAUPRC);
     fprintf(fid, 'Validation teAUROC: %.4f\n', test_teAUROC);
     fprintf(fid, 'Best C: %.3f\n', bestVals.best_C);
     fprintf(fid, 'Best sigma: %.3f\n', bestVals.best_sigma);