<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href=""?>
<name>Barbara Hutter</name>
<!-- Following section: free text description of process (what, how, why) -->
* mapping of raw sequences to the reference genome
- N pairs of fastq files that are processed into bam files separately and merged into one at the end
<!-- Following section: list input files [samples to be analysed and similar] -->
<comment>raw input file with forward read of the pair ("read1"), pre-filtered for illumina chastity filter failed reads</comment>
<comment>raw input file with reverse read of the pair ("read2"), pre-filtered for illumina chastity filter failed reads</comment>
<comment>The reference genome file; see > download > results > references > genomes > human/mouse > WholeGenome</comment>
<!-- Following section: list input files [samples to be analysed and similar] -->
<comment>the bam file merged from all input fastq files, duplicates are marked</comment>
<comment>Corresponding BAM index file, produced during merging and duplicate marking</comment>
<comment>simple alignment statistics of the merged, duplicate marked bam</comment>
<comment>A summary of aligment statistics such as number of reads, percent of aligned reads, coverage of the genome, duplication level, etc.</comment>
<comment>Produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<comment>produced by Picard CollectMultipleMetrics</comment>
<command_line><![CDATA[ cnybwa-0.6.2 aln -t 12 -q 20 {reference_genome} {SampleID_R*} > sampleID_R*.sai ]]></command_line>
<comment>production of an intermediate .sai file for each read1 and read2 fastq file, performed on convey machines. cnybwa-0.6.2 is a hardware re-implementation of bwa version 0.6.2. t is the number of threads, -q the parameter for iterative quality trimming of the read down to 35 bp</comment>
bwa sampe -P -a 1000 -T -t 8 -r readgroupinformation {reference_genome}
sampleID_R1.sai sampleID_R2.sai {SampleID_R1} {SampleID_R2} > sampleID_Sampe_output
Pairing of reads to SAM format, piped to next step (samtools view).
Parameters: -a to set maximum insert size to 1000 bp, -t number of threads, -P pre-load index, -T use original buffer size.
The parameter readgroupinformation is initialized in the script as "@RG\tID:$ID\tSM:$SM\tLB:$LB\tPL:ILLUMINA",
where $ID is composed of run and lane (e.g. run140918_SN7001180_0145_C451VACXX_44_Mm08_WEAd_Db1_H3K9me3_F_1_ACAGTG_L001),
$SM the sampletype (e.g. sample_replicate1-H3K9me3_44_Mm08_WEAd_Db1),
and $LB the library (e.g. replicate1-H3K9me3_44_Mm08_WEAd_Db1).
These variables are constructed according to the file path of the fastq files.
<command_line><![CDATA[ cat sampleID_Sampe_output | samtools view -uSbh - | samtools sort -o - > bamfile ]]></command_line>
<comment>Input piped from previous step (bwa sampe), conversion of SAM to BAM and sorting by coordinate</comment>
java8 -Xmx50G -jar picard-tools-1.125.jar MarkDuplicates I=bamfile*
<loop>no looping</loop>
Merging of per-lane bam files, marking of duplicates and index creation {DEEPID.PROC.DATE.bai}.
The Picard commandline gets I=bamfile for each bam file as input, which is simplified above in the command line.
Was previously version 1.61, from January 2015 on version 1.125
<command_line><![CDATA[ samtools flagstat {DEEPID.PROC.DATE.bam} > {DEEPID.PROC.DATE.flagstats} ]]></command_line>
<loop>no looping</loop>
<comment>simple alignment statistics of the merged, duplicate marked bam</comment>
java -Xmx4G -cp picard-tools-1.61.jar -jar CollectMultipleMetrics.jar INPUT={DEEPID.PROC.DATE.bam}
OUTPUT={DEEPID.PROC.DATE.Picard*} PROGRAM=CollectAlignmentSummaryMetrics
PROGRAM=CollectInsertSizeMetrics PROGRAM=QualityScoreDistribution PROGRAM=MeanQualityByCycle
<loop>no looping</loop>
creates several output files:
perl -c samplesID.coverage.txt -d samplesID.diffchrom.txt
-f {DEEPID.PROC.DATE.flagstats} -i samplesID.insertsize.txt
-m {DEEPID.PROC.DATE.PicardMarkDupmetrics} -l "genome" -r "all_merged"
-p sampleID -s sampletype > {DEEPID.PROC.DATE.QcSummary}
<loop>no looping</loop>
<comment>a custom perl script that also reads in files that are not relevant for DEEP since it is part of the DKFZ whole genome pipeline.</comment>