Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
ENH: need one last check, but probably close to final version
  • Loading branch information
pebert committed Sep 14, 2017
1 parent 085c009 commit 65a1b80
Showing 1 changed file with 194 additions and 30 deletions.
224 changes: 194 additions & 30 deletions docs/quantification/chip-seq/CHPv5.xml
Expand Up @@ -8,13 +8,54 @@
<email>pebert@mpi-inf.mpg.de</email>
</author>
<description>
Key points:
(1) deepTools QC (fingerprint / GC bias) for raw BAM files [same as before]
(2) deepTools QC (fingerprint) for filtered BAM files to create IHEC QC as part of the process [new]
(3) deepTools correlation for filtered and blacklist removed BAM files [same as before]
(4) peak calling with MACS2 and histoneHMM on filtered BAM files [partially new]
(5) deepTools fold-change and coverage tracks for raw BAM files [same as before]
(6) deepTools coverage track for filtered BAM file [same as before]
- This description gives a rough summary of what this process produces as default output
- For details, please do carefully read the respective step given below
(match the output filename as listed under Output files)

1. Quality control
1.1 The GC bias and fingerprint QC plots are based on raw / unfiltered BAM files
1.2 The correlation heatmap is based on filtered BAM files, restricted to autosomes

2. IHEC ChIP QC
2.1 This process collects QC metrics as defined by the IHEC Assay Standards work group
2.2 The set of metrics is saved in the analysis metadata file (.amd.tsv) that is generated as part of each run
2.3 All metrics below are computed based on filtered BAM files, following the IHEC reference implementation
2.4 2017-Sep.: https://github.com/IHEC/ihec-assay-standards
2.5 final reads: total number of mapped reads in filtered BAM file
2.6 peak reads: total number of reads overlapping called peak regions
2.7 frip: peak reads divided by final reads (fraction of reads in peaks)
2.8 JS dist: Jensen-Shannon distance as computed by deepTools/plotFingerprint
2.9 CHANCE div: CHANCE divergence as computed by deepTools/plotFingerprint

3. bigWig signal tracks
3.1 log2 fold-change tracks of histone signal over Input
3.2 Coverage tracks of raw / unfiltered BAM files, normalized to 1x depth
3.3 Coverage tracks of filtered BAM files, with blacklist removed and 1x normalization based
on automsomes only. These are the recommended signal tracks to use in order to assess
the coverage in regions of interest in generic downstream analyses

4. Peak calling
4.1 We use MACS2 and histoneHMM to call peaks
4.2 MACS2 narrow marks: H3K27ac, H3K4me1, H3K4me3
4.3 MACS2 broad marks: H3K27me3, H3K36me3, H3K9me3
4.3.1 The quality and reliability of MACS2 broad peaks is a subject of active debate.
We provide the above files mainly for historical reasons and
to improve compatibility to other epigenome mapping consortia
4.3.2 We discourage the use of the MACS2 broad output for downstream analysis
4.4 histoneHMM broad marks: H3K27me3, H3K36me3, H3K4me1, H3K9me3, Input
4.4.1 We call enriched regions on Input since histoneHMM has no option to use the Input during peak calling
4.5 Called peaks are flagged for overlapping with a blacklist region (MACS2 and histoneHMM) and for
overlapping with a region enriched in the Input (histoneHMM only). The flagging is realized via
the name column in the output files.
4.6 The peak files are standardized to follow ENCODE's broadPeak and narrowPeak format, respectively.
For MACS output, this requires a rescaling of the score column to be in the range of 0...1000.
For histoneHMM, default values are added and the average posterior is taken as the signalValue.

5. File naming
5.1 DEEPID.PROC.DATE.ASSM (plus suffix and extension)
5.2 Example DEEPID: 43_Hm21_WMAs_Ct_H3K4me3_F_1
5.3 PROC = CHPv5 ; DATE = date of run
5.3 ASSM: assembly like hg38 or m38 (GRCm38)
</description>
<inputs>
<filetype>
Expand Down Expand Up @@ -65,6 +106,12 @@
<quantity>single</quantity>
<comment>2-column, tab-separated table of chromosome sizes for reference genome</comment>
</filetype>
<filetype>
<identifier>autosomal_regions</identifier>
<format>BED</format>
<quantity>single</quantity>
<comment>A file listing all autosomes as BED regions for filtering</comment>
</filetype>
</references>
<outputs>
<filetype>
Expand Down Expand Up @@ -117,6 +164,30 @@
<quantity>collection</quantity>
<comment>Zip archive containing other histoneHMM output files (raw data files not needed by most users)</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.macs.out</identifier>
<format>zip</format>
<quantity>collection</quantity>
<comment>Zip archive containing other MACS output files (raw data files not needed by most users)</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.hhmm.broad</identifier>
<format>BED / broadPeak</format>
<quantity>collection</quantity>
<comment>Histone and Input enriched regions called by histoneHMM</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.macs.broad</identifier>
<format>BED / broadPeak</format>
<quantity>collection</quantity>
<comment>Histone enriched regions called by MACS</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.macs.narrow</identifier>
<format>BED / narrowPeak</format>
<quantity>collection</quantity>
<comment>Histone enriched regions called by MACS</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.fgpr</identifier>
<format>SVG</format>
Expand All @@ -129,6 +200,28 @@
<quantity>single</quantity>
<comment>Fingerprint quality metrics based on raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.counts-fgpr</identifier>
<format>tsv</format>
<quantity>single</quantity>
<comment>Fingerprint raw counts based on raw BAM files</comment>
</filetype>

<filetype>
<identifier>DEEPID.PROC.DATE.auto.counts-summ</identifier>
<format>tsv</format>
<quantity>single</quantity>
<comment>multiBamSummary raw counts based on filtered and autosome-restricted BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.auto.summ</identifier>
<format>npz</format>
<quantity>single</quantity>
<comment>
multiBamSummary data file based on filtered and autosome-restricted BAM files.
The format is a numpy compatible binary file.
</comment>
</filetype>

</outputs>
<software>
Expand Down Expand Up @@ -156,7 +249,7 @@
]]>
</command_line>
<loop>GALvX_Histone, GALvX_Input</loop>
<comment>Compute GC bias on raw BAM files</comment>
<comment>Compute GC bias on raw BAM files.</comment>
</tool>
<tool>
<name>bamCompare</name>
Expand All @@ -179,13 +272,17 @@
<version>2.5.3</version>
<command_line>
<![CDATA[
plotFingerprint -p {deeptools_parallel} --bamfiles {GALvX_*} --plotFile {DEEPID.PROC.DATE.fgpr}
plotFingerprint -p {deeptools_parallel} --bamfiles {GALvX_*} --plotFile {DEEPID.PROC.DATE.raw.fgpr}
--labels {plot_labels} --plotTitle {plot_title} --numberOfSamples 500000 --plotFileFormat svg
--outQualityMetrics {DEEPID.PROC.DATE.fgpr.qc} --JSDsample {GALvX_Input}
--outQualityMetrics {DEEPID.PROC.DATE.raw.qm-fgpr} --JSDsample {GALvX_Input}
--outRawCounts {DEEPID.PROC.DATE.raw.counts-fgpr}
]]>
</command_line>
<loop>no looping</loop>
<comment>Compute fingerprint on raw BAM files</comment>
<comment>
Compute fingerprint on raw BAM files. In cases where there are two or more Input control
files, the majority file will be selected as JSD-sample file.
</comment>
</tool>

<tool>
Expand Down Expand Up @@ -214,38 +311,44 @@
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>
Due to the previous filtering step, counting simply all reads in the filtered BAM
file is equivalent to counting only mapped reads. The number of mapped reads is needed
to compute the FRiP score in a later stage.
Due to the previous filtering step, counting simply all reads in the filtered BAM file
is equivalent to counting only mapped reads.
The number of mapped reads is needed to compute the FRiP score in a later stage.
</comment>
</tool>
<tool>
<name>bamCoverage</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
bamCoverage -p {deeptools_parallel} --binSize 25 --bam DEEPID.tmp.filt.bam --outFileName {DEEPID.PROC.DATE.bamcov}
bamCoverage -p {deeptools_parallel} --binSize 25 --bam DEEPID.tmp.filt.bam --outFileName {DEEPID.PROC.DATE.filt.bamcov}
--outFileFormat bigwig --normalizeTo1x {genomesize} --blackListFileName {blacklist_regions} --ignoreForNorm chrX chrY chrM X Y M MT
]]>
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>
Generate read coverage signal normalized to 1x depth for filtered BAM files.
Remove blacklist regions on-the-fly and consider only autosomes for normalization step.
Note that the implementation of this pipeline is designed to support genomes
with and without "chr" prefix, hence the various different naming styles for --ignoreForNorm.
</comment>
</tool>
<tool>
<name>plotFingerprint</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
plotFingerprint -p {deeptools_parallel} --bamfiles DEEPID.tmp.filt.bam --plotFile {DEEPID.PROC.DATE.fgpr}
plotFingerprint -p {deeptools_parallel} --bamfiles DEEPID.tmp.filt.bam --plotFile DEEPID.PROC.DATE.filt.fgpr.tmp
--labels {plot_labels} --plotTitle {plot_title} --numberOfSamples 500000 --plotFileFormat svg
--outQualityMetrics {DEEPID.PROC.DATE.fgpr.qc} --JSDsample DEEPID_Input.tmp.filt.bam
--outQualityMetrics DEEPID.PROC.DATE.filt.qm-fgpr.tmp --JSDsample DEEPID_Input.tmp.filt.bam
--outRawCounts DEEPID.PROC.DATE.filt.counts-fgpr
]]>
</command_line>
<loop>no looping</loop>
<comment>Compute fingerprint on filtered BAM files to compute IHEC QC measures</comment>
<comment>
Compute fingerprint on filtered BAM files to collect IHEC QC metrics.
The output files of this step are temporary and are discarded after the analysis.
</comment>
</tool>
<tool>
<name>MACS2</name>
Expand All @@ -257,7 +360,24 @@
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>MACS2 peak calling on filtered BAM files. Parameter "--broad" for libraries H3K27me3/H3K36me/H3K9me3</comment>
<comment>
MACS2 peak calling on filtered BAM files.
Parameter &quot;--broad&quot; set for libraries H3K27me3/H3K36me/H3K9me3
</comment>
</tool>
<tool>
<name>zip</name>
<version>3.0</version>
<command_line>
<![CDATA[
zip -9 -X -j -q -D {DEEPID.PROC.DATE.macs.out} DEEPID_macs*
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>
zip all secondary MACS output files per histone mark: _peak.xls, _summits.bed and _peaks.gappedPeak
depending on parameter &quot;--broad&quot;
</comment>
</tool>

<tool>
Expand All @@ -278,12 +398,27 @@
<command_line>
<![CDATA[
cut -f 1,4,5,9 DEEPID-regions.gff | sort -V -k1,2 > DEEPID.hmm.bed &&
mv DEEPID-zinba-emfit.pdf DEEPID.PROC.DATE.hhmm.emfit.pdf
mv DEEPID-zinba-emfit.pdf {DEEPID.PROC.DATE.hhmm.emfit}
]]>
</command_line>
<loop>DEEPID-regions.gff</loop>
<comment>Make histoneHMM output BED-like for blacklist intersection and standardize name of EM fit PDF.</comment>
<comment>Make histoneHMM output BED-like for blacklist intersection and standardize name of EM fit PDF</comment>
</tool>
<tool>
<name>zip</name>
<version>3.0</version>
<command_line>
<![CDATA[
zip -9 -X -j -q -D {DEEPID.PROC.DATE.hhmm.out} DEEPID_hhmm*
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>
zip all secondary histoneHMM output files per histone mark: -em-posterior.txt, -zinba-params-em.RData
-zinba-params-em.txt, .txt
</comment>
</tool>


<tool>
<name>sambamba</name>
Expand All @@ -295,20 +430,47 @@
]]>
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>Get flagstat output for filtered BAM files, specifically number of mapped reads in these files</comment>
<comment>Count number of reads overlapping peak regions, later used for FRiP score</comment>
</tool>

<tool>
<name>custom</name>
<version>0.1</version>
<name>bedtools</name>
<version>2.26.0</version>
<command_line>
<![CDATA[
compute_frip: reads_in_peaks / total_mapped_reads
bedtools intersect -u -a peak_file -b {blacklist_regions} > DEEPID.peak-ovl-bl
]]>
</command_line>
<loop>no looping</loop>
<loop>peak_file</loop>
<comment>Intersect all peak files with blacklist regions for flagging</comment>
</tool>

<tool>
<name>bedtools</name>
<version>2.26.0</version>
<command_line>
<![CDATA[
bedtools intersect -u -a histoneHMM_peak_file -b histoneHMM_Input_peaks > DEEPID.peak-ovl-input
]]>
</command_line>
<loop>histoneHMM_peak_file</loop>
<comment>Intersect all histoneHMM peak files with Input peaks for flagging</comment>
</tool>

<tool>
<name>Python</name>
<version>2.7.13</version>
<command_line>
<![CDATA[
pipeline-merge peak-file DEEPID.peak-ovl-bl DEEPID.peak-ovl-input
> {DEEPID.PROC.DATE.macs.narrow} {DEEPID.PROC.DATE.macs.broad} {DEEPID.PROC.DATE.hhmm.broad}
]]>
</command_line>
<loop>peak-file</loop>
<comment>
Compute FRiP score and record in analysis metadata file (.amd.tsv).
Values input from the two previous steps.
Pipeline-internal on-the-fly peak flagging and standardization: for MACS2 output, score
column is rescaled to range 0...1000, for histoneHMM, default values are added to fulfill
standard broadPeak/narrowPeak format specifications.
</comment>
</tool>

Expand All @@ -317,7 +479,7 @@
<version>0.6.6</version>
<command_line>
<![CDATA[
sambamba view --format=bam --nthreads={sambamba_parallel} --output-filename DEEPID.tmp.auto.bam
sambamba view --format=bam --nthreads={sambamba_parallel} --output-filename DEEPID.PROC.DATE.tmp.auto.bam
--regions={autosome_regions} DEEPID.tmp.filt.bam
]]>
</command_line>
Expand All @@ -331,8 +493,9 @@
<version>2.5.3</version>
<command_line>
<![CDATA[
multiBamSummary bins -p {deeptools_parallel} --bamfiles DEEPID.tmp.auto.bam --outFileName SAMPLEID.npz
multiBamSummary bins -p 8 --bamfiles DEEPID.tmp.auto.bam --outFileName {DEEPID.PROC.DATE.auto.summ}
--labels {plot_labels} --binSize 1000 --distanceBetweenBins 2000 --blackListFileName {blacklist_regions}
--outRawCounts {DEEPID.PROC.DATE.auto.counts-summ}
]]>
</command_line>
<loop>no looping</loop>
Expand All @@ -346,6 +509,7 @@
plotCorrelation bins --corData SAMPLEID.npz --plotFile {DEEPID.PROC.DATE.bamcorr} --whatToPlot heatmap
--plotTitle {plot_title} --plotFileFormat svg --corMethod {cor_method}
--plotNumbers --zMin -1 --zMax 1 --colorMap coolwarm
--outFileCorMatrix {DEEPID.PROC.DATE.corrmat}
]]>
</command_line>
<loop>no looping</loop>
Expand Down

0 comments on commit 65a1b80

Please sign in to comment.