Skip to content
Permalink
c96feefca6
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
525 lines (517 sloc) 19.8 KB
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>
<process>
<name>CHP</name>
<version>5</version>
<author>
<name>Peter Ebert</name>
<email>pebert@mpi-inf.mpg.de</email>
</author>
<description>
- This description gives a rough summary of what this process produces as default output
- For details, please do carefully read the respective step given below
(match the output filename as listed under Output files)
1. Quality control
1.1 The GC bias and fingerprint QC plots are based on raw / unfiltered BAM files
1.2 The correlation heatmap is based on filtered BAM files, restricted to autosomes
2. IHEC ChIP QC
2.1 This process collects QC metrics as defined by the IHEC Assay Standards work group
2.2 The set of metrics is saved in the analysis metadata file (.amd.tsv) that is generated as part of each run
2.3 All metrics below are computed based on filtered BAM files, following the IHEC reference implementation
2.4 2017-Sep.: https://github.com/IHEC/ihec-assay-standards
2.5 final reads: total number of mapped reads in filtered BAM file
2.6 peak reads: total number of reads overlapping called peak regions
2.7 frip: peak reads divided by final reads (fraction of reads in peaks)
2.8 JS dist: Jensen-Shannon distance as computed by deepTools/plotFingerprint
2.9 CHANCE div: CHANCE divergence as computed by deepTools/plotFingerprint
3. bigWig signal tracks
3.1 log2 fold-change tracks of histone signal over Input
3.2 Coverage tracks of raw / unfiltered BAM files, normalized to 1x depth
3.3 Coverage tracks of filtered BAM files, with blacklist removed and 1x normalization based
on automsomes only. These are the recommended signal tracks to use in order to assess
the coverage in regions of interest in generic downstream analyses
4. Peak calling
4.1 We use MACS2 and histoneHMM to call peaks
4.2 MACS2 narrow marks: H3K27ac, H3K4me1, H3K4me3
4.3 MACS2 broad marks: H3K27me3, H3K36me3, H3K9me3
4.3.1 The quality and reliability of MACS2 broad peaks is a subject of active debate.
We provide the above files mainly for historical reasons and
to improve compatibility to other epigenome mapping consortia
4.3.2 We discourage the use of the MACS2 broad output for downstream analysis
4.4 histoneHMM broad marks: H3K27me3, H3K36me3, H3K4me1, H3K9me3, Input
4.4.1 We call enriched regions on Input since histoneHMM has no option to use the Input during peak calling
4.5 Called peaks are flagged for overlapping with a blacklist region (MACS2 and histoneHMM) and for
overlapping with a region enriched in the Input (histoneHMM only). The flagging is realized via
the name column in the output files.
4.6 The peak files are standardized to follow ENCODE's broadPeak and narrowPeak format, respectively.
For MACS output, this requires a rescaling of the score column to be in the range of 0...1000.
For histoneHMM, default values are added and the average posterior is taken as the signalValue.
5. File naming
5.1 DEEPID.PROC.DATE.ASSM (plus suffix and extension)
5.2 Example DEEPID: 43_Hm21_WMAs_Ct_H3K4me3_F_1
5.3 PROC = CHPv5 ; DATE = date of run
5.3 ASSM: assembly like hg38 or m38 (GRCm38)
</description>
<inputs>
<filetype>
<identifier>GALvX_Histone</identifier>
<format>BAM</format>
<quantity>collection</quantity>
<comment>Only paired-end libraries are supported</comment>
</filetype>
<filetype>
<identifier>GALvX_Input</identifier>
<format>BAM</format>
<quantity>single</quantity>
<comment>Only paired-end libraries are supported</comment>
</filetype>
<filetype>
<identifier>GALvX_Index</identifier>
<format>BAI</format>
<quantity>collection</quantity>
<comment>
No distinction between histone and Input library for the index files - one index file per BAM file is required
</comment>
</filetype>
<filetype>
<identifier>GALvX_QCSummary</identifier>
<format>JSON / txt</format>
<quantity>collection</quantity>
<comment>
The median insert size (field: insertSizeMedian) is extracted from the QC summary file.
Note that for compatibility with previous alignment processes, the QC summary files
may also have the old tabular / text-based format (field: PE_insertsize (mapq&gt;0))
</comment>
</filetype>
</inputs>
<references>
<filetype>
<identifier>reference_genome</identifier>
<format>2bit</format>
<quantity>single</quantity>
<comment>The reference genome file; see DCC/download/results/references/genomes</comment>
</filetype>
<filetype>
<identifier>blacklist_regions</identifier>
<format>BED</format>
<quantity>single</quantity>
<comment>Blacklist region</comment>
</filetype>
<filetype>
<identifier>chromosome_sizes</identifier>
<format>TSV</format>
<quantity>single</quantity>
<comment>2-column, tab-separated table of chromosome sizes for reference genome</comment>
</filetype>
<filetype>
<identifier>autosome_regions</identifier>
<format>BED</format>
<quantity>single</quantity>
<comment>A file listing all autosomes as BED regions for filtering</comment>
</filetype>
</references>
<outputs>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.raw.bamcov</identifier>
<format>bigwig</format>
<quantity>collection</quantity>
<comment>Signal coverage track generated from raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.filt.bamcov</identifier>
<format>bigwig</format>
<quantity>collection</quantity>
<comment>
Signal coverage track generated from filtered BAM files. -F 3844 / q >= 5 / blacklist removed
</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.ses.log2-Input</identifier>
<format>bigwig</format>
<quantity>collection</quantity>
<comment>SES normalized signal-over-Input track</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.cnt.log2-Input</identifier>
<format>bigwig</format>
<quantity>collection</quantity>
<comment>Read-count normalized signal-over-Input track</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.gcbias</identifier>
<format>svg</format>
<quantity>collection</quantity>
<comment>GC bias plot based on raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.gcfreq</identifier>
<format>txt</format>
<quantity>collection</quantity>
<comment>Obs./exp. GC read frequencies based on raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.hhmm.emfit</identifier>
<format>PDF</format>
<quantity>collection</quantity>
<comment>histoneHMM output visualizing the EM fit. Check this before using the histoneHMM output</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.hhmm.out</identifier>
<format>zip</format>
<quantity>collection</quantity>
<comment>Zip archive containing other histoneHMM output files (raw data files not needed by most users)</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.macs.out</identifier>
<format>zip</format>
<quantity>collection</quantity>
<comment>Zip archive containing other MACS output files (raw data files not needed by most users)</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.hhmm.broad</identifier>
<format>BED / broadPeak</format>
<quantity>collection</quantity>
<comment>Histone and Input enriched regions called by histoneHMM</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.macs.broad</identifier>
<format>BED / broadPeak</format>
<quantity>collection</quantity>
<comment>Histone enriched regions called by MACS</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.macs.narrow</identifier>
<format>BED / narrowPeak</format>
<quantity>collection</quantity>
<comment>Histone enriched regions called by MACS</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.fgpr</identifier>
<format>SVG</format>
<quantity>single</quantity>
<comment>Fingerprint plots based on raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.qm-fgpr</identifier>
<format>txt</format>
<quantity>single</quantity>
<comment>Fingerprint quality metrics based on raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.counts-fgpr</identifier>
<format>tsv</format>
<quantity>single</quantity>
<comment>Fingerprint raw counts based on raw BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.auto.counts-summ</identifier>
<format>tsv</format>
<quantity>single</quantity>
<comment>multiBamSummary raw counts based on filtered and autosome-restricted BAM files</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.auto.summ</identifier>
<format>npz</format>
<quantity>single</quantity>
<comment>
multiBamSummary data file based on filtered and autosome-restricted BAM files.
The format is a numpy compatible binary file.
</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.bamcorr</identifier>
<format>SVG</format>
<quantity>collection</quantity>
<comment>Correlation heatmaps using Pearson and Spearman correlation measure</comment>
</filetype>
<filetype>
<identifier>DEEPID.PROC.DATE.ASSM.corrmat</identifier>
<format>tsv</format>
<quantity>collection</quantity>
<comment>Raw correlation matrices</comment>
</filetype>
</outputs>
<software>
<tool>
<name>bamCoverage</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
bamCoverage -p {deeptools_parallel} --binSize 25 --bam {GALvX_*} --outFileName {DEEPID.PROC.DATE.ASSM.raw.bamcov}
--outFileFormat bigwig --normalizeTo1x {genomesize}
]]>
</command_line>
<loop>GALvX_Histone, GALvX_Input</loop>
<comment>Generate read coverage signal normalized to 1x depth for raw BAM files</comment>
</tool>
<tool>
<name>computeGCBias</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
computeGCBias -p {deeptools_parallel} --bamfile {GALvX_*} --effectiveGenomeSize {genomesize}
--genome {reference_genome} --sampleSize 50000000 --fragmentLength {*_fraglen}
--GCbiasFrequenciesFile {DEEPID.PROC.DATE.ASSM.gcfreq} --biasPlot {DEEPID.PROC.DATE.ASSM.gcbias} --plotFileFormat svg
]]>
</command_line>
<loop>GALvX_Histone, GALvX_Input</loop>
<comment>Compute GC bias on raw BAM files.</comment>
</tool>
<tool>
<name>bamCompare</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
bamCompare -p {deeptools_parallel} --bamfile1 {GALvX_Histone} --bamfile2 {GALvX_Input}
--outFileName {DEEPID.PROC.DATE.ASSM.*.log2-Input} --outFileFormat bigwig --scaleFactorsMethod {*_scaling}
--ratio log2 --binSize 25
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>
Generate log2 fold-change tracks of signal over input for raw BAM files with scaling method
&quot;readCount&quot; for libraries H3K27me3/H3K9me3, and &quot;SES&quot; otherwise
</comment>
</tool>
<tool>
<name>plotFingerprint</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
plotFingerprint -p {deeptools_parallel} --bamfiles {GALvX_*} --plotFile {DEEPID.PROC.DATE.ASSM.fgpr}
--labels {plot_labels} --plotTitle {plot_title} --numberOfSamples 500000 --plotFileFormat svg
--outQualityMetrics {DEEPID.PROC.DATE.ASSM.qm-fgpr} --JSDsample {GALvX_Input}
--outRawCounts {DEEPID.PROC.DATE.ASSM.counts-fgpr}
]]>
</command_line>
<loop>no looping</loop>
<comment>
Compute fingerprint on raw BAM files. In cases where there are two or more Input control
files, the majority file will be selected as JSD-sample file.
</comment>
</tool>
<tool>
<name>sambamba</name>
<version>0.6.6</version>
<command_line>
<![CDATA[
sambamba view --format=bam --nthreads={sambamba_parallel} --output-filename DEEPID.tmp.filt.bam
--filter="not (duplicate or unmapped or failed_quality_control or supplementary or secondary_alignment) and mapping_quality >= 5"
{GALvX_*}
]]>
</command_line>
<loop>GALvX_Histone, GALvX_Input</loop>
<comment>
Apply IHEC ChIP QC standard filtering to all BAM files (equivalent to bitflag 3844).
The resulting BAM files are temporary and discarded after the analysis.
</comment>
</tool>
<tool>
<name>sambamba</name>
<version>0.6.6</version>
<command_line>
<![CDATA[
sambamba view --count DEEPID.tmp.filt.bam > DEEPID.mapped.readcount
]]>
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>
Due to the previous filtering step, counting simply all reads in the filtered BAM file
is equivalent to counting only mapped reads.
The number of mapped reads is needed to compute the FRiP score in a later stage.
</comment>
</tool>
<tool>
<name>bamCoverage</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
bamCoverage -p {deeptools_parallel} --binSize 25 --bam DEEPID.tmp.filt.bam --outFileName {DEEPID.PROC.DATE.ASSM.filt.bamcov}
--outFileFormat bigwig --normalizeTo1x {genomesize} --blackListFileName {blacklist_regions} --ignoreForNorm chrX chrY chrM X Y M MT
]]>
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>
Generate read coverage signal normalized to 1x depth for filtered BAM files.
Remove blacklist regions on-the-fly and consider only autosomes for normalization step.
Note that the implementation of this pipeline is designed to support genomes
with and without "chr" prefix, hence the various different naming styles for --ignoreForNorm.
</comment>
</tool>
<tool>
<name>plotFingerprint</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
plotFingerprint -p {deeptools_parallel} --bamfiles DEEPID.tmp.filt.bam --plotFile DEEPID.PROC.DATE.tmp.filt.fgpr
--labels {plot_labels} --plotTitle {plot_title} --numberOfSamples 500000 --plotFileFormat svg
--outQualityMetrics DEEPID.tmp.filt.qm-fgpr.tmp --JSDsample DEEPID_Input.tmp.filt.bam
--outRawCounts DEEPID.tmp.filt.counts-fgpr
]]>
</command_line>
<loop>no looping</loop>
<comment>
Compute fingerprint on filtered BAM files to collect IHEC QC metrics.
The output files of this step are temporary and are discarded after the analysis.
</comment>
</tool>
<tool>
<name>MACS2</name>
<version>2.1.1.20160309</version>
<command_line>
<![CDATA[
macs2 callpeak -t DEEPID.tmp.filt.bam -c DEEPID_Input.tmp.filt.bam -f BAM --gsize {genomesize}
--keep-dup all --name {*_name_prefix} --nomodel --extsize {*_fraglen} --qvalue 0.05 {*_broad}
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>
MACS2 peak calling on filtered BAM files.
Parameter &quot;--broad&quot; set for libraries H3K27me3/H3K36me/H3K9me3
</comment>
</tool>
<tool>
<name>zip</name>
<version>3.0</version>
<command_line>
<![CDATA[
zip -9 -X -j -q -D {DEEPID.PROC.DATE.ASSM.macs.out} DEEPID_macs*
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>
zip all secondary MACS output files per histone mark: _peak.xls, _summits.bed and _peaks.gappedPeak
depending on parameter &quot;--broad&quot;
</comment>
</tool>
<tool>
<name>histoneHMM</name>
<version>1.7</version>
<command_line>
<![CDATA[
histoneHMM_call_regions.R -b 750 --chromlen={chromosome_sizes}
--outprefix=DEEPID-regions.gff --probability=0.1 DEEPID.tmp.filt.bam
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>HistoneHMM peak calling on filtered BAM files for broad marks: H3K4me1/H3K27me3/H3K9me3/H3K36me3</comment>
</tool>
<tool>
<name>cut, sort, mv</name>
<version>8.13</version>
<command_line>
<![CDATA[
cut -f 1,4,5,9 DEEPID-regions.gff | sort -V -k1,2 > DEEPID.hmm.bed &&
mv DEEPID-zinba-emfit.pdf {DEEPID.PROC.DATE.ASSM.hhmm.emfit}
]]>
</command_line>
<loop>DEEPID-regions.gff</loop>
<comment>Make histoneHMM output BED-like for blacklist intersection and standardize name of EM fit PDF</comment>
</tool>
<tool>
<name>zip</name>
<version>3.0</version>
<command_line>
<![CDATA[
zip -9 -X -j -q -D {DEEPID.PROC.DATE.ASSM.hhmm.out} DEEPID_hhmm*
]]>
</command_line>
<loop>GALvX_Histone</loop>
<comment>
zip all secondary histoneHMM output files per histone mark: -em-posterior.txt, -zinba-params-em.RData
-zinba-params-em.txt, .txt
</comment>
</tool>
<tool>
<name>sambamba</name>
<version>0.6.6</version>
<command_line>
<![CDATA[
sambamba view --count --nthreads={sambamba_parallel}
--regions=peak_file DEEPID.tmp.filt.bam > DEEPID.tmp.peak_ovl.cnt
]]>
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>Count number of reads overlapping peak regions, later used for FRiP score</comment>
</tool>
<tool>
<name>bedtools</name>
<version>2.26.0</version>
<command_line>
<![CDATA[
bedtools intersect -u -a peak_file -b {blacklist_regions} > DEEPID.peak-ovl-bl
]]>
</command_line>
<loop>peak_file</loop>
<comment>Intersect all peak files with blacklist regions for flagging</comment>
</tool>
<tool>
<name>bedtools</name>
<version>2.26.0</version>
<command_line>
<![CDATA[
bedtools intersect -u -a histoneHMM_peak_file -b histoneHMM_Input_peaks > DEEPID.peak-ovl-input
]]>
</command_line>
<loop>histoneHMM_peak_file</loop>
<comment>Intersect all histoneHMM peak files with Input peaks for flagging</comment>
</tool>
<tool>
<name>Python</name>
<version>2.7.13</version>
<command_line>
<![CDATA[
pipeline-merge peak-file DEEPID.peak-ovl-bl DEEPID.peak-ovl-input
> {DEEPID.PROC.DATE.ASSM.macs.narrow} {DEEPID.PROC.DATE.ASSM.macs.broad} {DEEPID.PROC.DATE.ASSM.hhmm.broad}
]]>
</command_line>
<loop>peak-file</loop>
<comment>
Pipeline-internal on-the-fly peak flagging and standardization: for MACS2 output, score
column is rescaled to range 0...1000, for histoneHMM, default values are added to fulfill
standard broadPeak/narrowPeak format specifications.
</comment>
</tool>
<tool>
<name>sambamba</name>
<version>0.6.6</version>
<command_line>
<![CDATA[
sambamba view --format=bam --nthreads={sambamba_parallel} --output-filename DEEPID.PROC.DATE.tmp.auto.bam
--regions={autosome_regions} DEEPID.tmp.filt.bam
]]>
</command_line>
<loop>DEEPID.tmp.filt.bam</loop>
<comment>
Restrict filtered BAM files to autosomal regions. These BAM files will be used to plot the correlation heatmaps.
</comment>
</tool>
<tool>
<name>multiBamSummary</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
multiBamSummary bins -p 8 --bamfiles DEEPID.tmp.auto.bam --outFileName {DEEPID.PROC.DATE.ASSM.auto.summ}
--labels {plot_labels} --binSize 1000 --distanceBetweenBins 2000 --blackListFileName {blacklist_regions}
--outRawCounts {DEEPID.PROC.DATE.ASSM.auto.counts-summ}
]]>
</command_line>
<loop>no looping</loop>
<comment>Create data matrix for correlation plot on filtered BAM files; remove blacklist regions on the fly</comment>
</tool>
<tool>
<name>plotCorrelation</name>
<version>2.5.3</version>
<command_line>
<![CDATA[
plotCorrelation bins --corData SAMPLEID.npz --plotFile {DEEPID.PROC.DATE.ASSM.bamcorr} --whatToPlot heatmap
--plotTitle {plot_title} --plotFileFormat svg --corMethod corr_method
--plotNumbers --zMin -1 --zMax 1 --colorMap coolwarm
--outFileCorMatrix {DEEPID.PROC.DATE.ASSM.corrmat}
]]>
</command_line>
<loop>no looping</loop>
<comment>Create heatmap correlation plot using Spearman and Pearson correlation</comment>
</tool>
</software>
</process>