From 65a1b804bba4a639bcd74aa87f23a1ee006f32c7 Mon Sep 17 00:00:00 2001 From: Peter Ebert Date: Thu, 14 Sep 2017 19:50:06 +0200 Subject: [PATCH] ENH: need one last check, but probably close to final version --- docs/quantification/chip-seq/CHPv5.xml | 224 +++++++++++++++++++++---- 1 file changed, 194 insertions(+), 30 deletions(-) diff --git a/docs/quantification/chip-seq/CHPv5.xml b/docs/quantification/chip-seq/CHPv5.xml index 3b97b0e..8f92919 100644 --- a/docs/quantification/chip-seq/CHPv5.xml +++ b/docs/quantification/chip-seq/CHPv5.xml @@ -8,13 +8,54 @@ pebert@mpi-inf.mpg.de - Key points: - (1) deepTools QC (fingerprint / GC bias) for raw BAM files [same as before] - (2) deepTools QC (fingerprint) for filtered BAM files to create IHEC QC as part of the process [new] - (3) deepTools correlation for filtered and blacklist removed BAM files [same as before] - (4) peak calling with MACS2 and histoneHMM on filtered BAM files [partially new] - (5) deepTools fold-change and coverage tracks for raw BAM files [same as before] - (6) deepTools coverage track for filtered BAM file [same as before] + - This description gives a rough summary of what this process produces as default output + - For details, please do carefully read the respective step given below + (match the output filename as listed under Output files) + + 1. Quality control + 1.1 The GC bias and fingerprint QC plots are based on raw / unfiltered BAM files + 1.2 The correlation heatmap is based on filtered BAM files, restricted to autosomes + + 2. IHEC ChIP QC + 2.1 This process collects QC metrics as defined by the IHEC Assay Standards work group + 2.2 The set of metrics is saved in the analysis metadata file (.amd.tsv) that is generated as part of each run + 2.3 All metrics below are computed based on filtered BAM files, following the IHEC reference implementation + 2.4 2017-Sep.: https://github.com/IHEC/ihec-assay-standards + 2.5 final reads: total number of mapped reads in filtered BAM file + 2.6 peak reads: total number of reads overlapping called peak regions + 2.7 frip: peak reads divided by final reads (fraction of reads in peaks) + 2.8 JS dist: Jensen-Shannon distance as computed by deepTools/plotFingerprint + 2.9 CHANCE div: CHANCE divergence as computed by deepTools/plotFingerprint + + 3. bigWig signal tracks + 3.1 log2 fold-change tracks of histone signal over Input + 3.2 Coverage tracks of raw / unfiltered BAM files, normalized to 1x depth + 3.3 Coverage tracks of filtered BAM files, with blacklist removed and 1x normalization based + on automsomes only. These are the recommended signal tracks to use in order to assess + the coverage in regions of interest in generic downstream analyses + + 4. Peak calling + 4.1 We use MACS2 and histoneHMM to call peaks + 4.2 MACS2 narrow marks: H3K27ac, H3K4me1, H3K4me3 + 4.3 MACS2 broad marks: H3K27me3, H3K36me3, H3K9me3 + 4.3.1 The quality and reliability of MACS2 broad peaks is a subject of active debate. + We provide the above files mainly for historical reasons and + to improve compatibility to other epigenome mapping consortia + 4.3.2 We discourage the use of the MACS2 broad output for downstream analysis + 4.4 histoneHMM broad marks: H3K27me3, H3K36me3, H3K4me1, H3K9me3, Input + 4.4.1 We call enriched regions on Input since histoneHMM has no option to use the Input during peak calling + 4.5 Called peaks are flagged for overlapping with a blacklist region (MACS2 and histoneHMM) and for + overlapping with a region enriched in the Input (histoneHMM only). The flagging is realized via + the name column in the output files. + 4.6 The peak files are standardized to follow ENCODE's broadPeak and narrowPeak format, respectively. + For MACS output, this requires a rescaling of the score column to be in the range of 0...1000. + For histoneHMM, default values are added and the average posterior is taken as the signalValue. + + 5. File naming + 5.1 DEEPID.PROC.DATE.ASSM (plus suffix and extension) + 5.2 Example DEEPID: 43_Hm21_WMAs_Ct_H3K4me3_F_1 + 5.3 PROC = CHPv5 ; DATE = date of run + 5.3 ASSM: assembly like hg38 or m38 (GRCm38) @@ -65,6 +106,12 @@ single 2-column, tab-separated table of chromosome sizes for reference genome + + autosomal_regions + BED + single + A file listing all autosomes as BED regions for filtering + @@ -117,6 +164,30 @@ collection Zip archive containing other histoneHMM output files (raw data files not needed by most users) + + DEEPID.PROC.DATE.macs.out + zip + collection + Zip archive containing other MACS output files (raw data files not needed by most users) + + + DEEPID.PROC.DATE.hhmm.broad + BED / broadPeak + collection + Histone and Input enriched regions called by histoneHMM + + + DEEPID.PROC.DATE.macs.broad + BED / broadPeak + collection + Histone enriched regions called by MACS + + + DEEPID.PROC.DATE.macs.narrow + BED / narrowPeak + collection + Histone enriched regions called by MACS + DEEPID.PROC.DATE.fgpr SVG @@ -129,6 +200,28 @@ single Fingerprint quality metrics based on raw BAM files + + DEEPID.PROC.DATE.counts-fgpr + tsv + single + Fingerprint raw counts based on raw BAM files + + + + DEEPID.PROC.DATE.auto.counts-summ + tsv + single + multiBamSummary raw counts based on filtered and autosome-restricted BAM files + + + DEEPID.PROC.DATE.auto.summ + npz + single + + multiBamSummary data file based on filtered and autosome-restricted BAM files. + The format is a numpy compatible binary file. + + @@ -156,7 +249,7 @@ ]]> GALvX_Histone, GALvX_Input - Compute GC bias on raw BAM files + Compute GC bias on raw BAM files. bamCompare @@ -179,13 +272,17 @@ 2.5.3 no looping - Compute fingerprint on raw BAM files + + Compute fingerprint on raw BAM files. In cases where there are two or more Input control + files, the majority file will be selected as JSD-sample file. + @@ -214,9 +311,9 @@ DEEPID.tmp.filt.bam - Due to the previous filtering step, counting simply all reads in the filtered BAM - file is equivalent to counting only mapped reads. The number of mapped reads is needed - to compute the FRiP score in a later stage. + Due to the previous filtering step, counting simply all reads in the filtered BAM file + is equivalent to counting only mapped reads. + The number of mapped reads is needed to compute the FRiP score in a later stage. @@ -224,7 +321,7 @@ 2.5.3 @@ -232,6 +329,8 @@ Generate read coverage signal normalized to 1x depth for filtered BAM files. Remove blacklist regions on-the-fly and consider only autosomes for normalization step. + Note that the implementation of this pipeline is designed to support genomes + with and without "chr" prefix, hence the various different naming styles for --ignoreForNorm. @@ -239,13 +338,17 @@ 2.5.3 no looping - Compute fingerprint on filtered BAM files to compute IHEC QC measures + + Compute fingerprint on filtered BAM files to collect IHEC QC metrics. + The output files of this step are temporary and are discarded after the analysis. + MACS2 @@ -257,7 +360,24 @@ ]]> GALvX_Histone - MACS2 peak calling on filtered BAM files. Parameter "--broad" for libraries H3K27me3/H3K36me/H3K9me3 + + MACS2 peak calling on filtered BAM files. + Parameter "--broad" set for libraries H3K27me3/H3K36me/H3K9me3 + + + + zip + 3.0 + + + + GALvX_Histone + + zip all secondary MACS output files per histone mark: _peak.xls, _summits.bed and _peaks.gappedPeak + depending on parameter "--broad" + @@ -278,12 +398,27 @@ DEEPID.hmm.bed && - mv DEEPID-zinba-emfit.pdf DEEPID.PROC.DATE.hhmm.emfit.pdf + mv DEEPID-zinba-emfit.pdf {DEEPID.PROC.DATE.hhmm.emfit} ]]> DEEPID-regions.gff - Make histoneHMM output BED-like for blacklist intersection and standardize name of EM fit PDF. + Make histoneHMM output BED-like for blacklist intersection and standardize name of EM fit PDF + + zip + 3.0 + + + + GALvX_Histone + + zip all secondary histoneHMM output files per histone mark: -em-posterior.txt, -zinba-params-em.RData + -zinba-params-em.txt, .txt + + + sambamba @@ -295,20 +430,47 @@ ]]> DEEPID.tmp.filt.bam - Get flagstat output for filtered BAM files, specifically number of mapped reads in these files + Count number of reads overlapping peak regions, later used for FRiP score + - custom - 0.1 + bedtools + 2.26.0 DEEPID.peak-ovl-bl ]]> - no looping + peak_file + Intersect all peak files with blacklist regions for flagging + + + + bedtools + 2.26.0 + + DEEPID.peak-ovl-input + ]]> + + histoneHMM_peak_file + Intersect all histoneHMM peak files with Input peaks for flagging + + + + Python + 2.7.13 + + {DEEPID.PROC.DATE.macs.narrow} {DEEPID.PROC.DATE.macs.broad} {DEEPID.PROC.DATE.hhmm.broad} + ]]> + + peak-file - Compute FRiP score and record in analysis metadata file (.amd.tsv). - Values input from the two previous steps. + Pipeline-internal on-the-fly peak flagging and standardization: for MACS2 output, score + column is rescaled to range 0...1000, for histoneHMM, default values are added to fulfill + standard broadPeak/narrowPeak format specifications. @@ -317,7 +479,7 @@ 0.6.6 @@ -331,8 +493,9 @@ 2.5.3 no looping @@ -346,6 +509,7 @@ plotCorrelation bins --corData SAMPLEID.npz --plotFile {DEEPID.PROC.DATE.bamcorr} --whatToPlot heatmap --plotTitle {plot_title} --plotFileFormat svg --corMethod {cor_method} --plotNumbers --zMin -1 --zMax 1 --colorMap coolwarm + --outFileCorMatrix {DEEPID.PROC.DATE.corrmat} ]]> no looping