docs/alignment/genome/GALv1.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>
<process>
	<name>GAL</name>
	<version>1</version>
	<author>
		<name>Barbara Hutter</name>
		<email>b.hutter@dkfz.de</email>
	</author>
	<!-- Following section: free text description of process (what, how, why) -->
	<description>
		* mapping of raw sequences to the reference genome
		- N pairs of fastq files that are processed into bam files separately and merged into one at the end
	</description>
	<!-- Following section: list input files [samples to be analysed and similar] -->
	<inputs>
		<filetype>
			<identifier>SampleID_R1</identifier>
			<format>FASTQ</format>
			<quantity>collection</quantity>
			<comment>raw input file with forward read of the pair ("read1"), pre-filtered for illumina chastity filter failed reads</comment>
		</filetype>
		<filetype>
			<identifier>SampleID_R2</identifier>
			<format>FASTQ</format>
			<quantity>collection</quantity>
			<comment>raw input file with reverse read of the pair ("read2"), pre-filtered for illumina chastity filter failed reads</comment>
		</filetype>
	</inputs>
	<references>
		<filetype>
			<identifier>reference_genome</identifier>
			<format>FASTA</format>
			<quantity>single</quantity>
			<comment>The reference genome file; see aspera.dkfz.de > download > results > references > genomes > human/mouse > WholeGenome</comment>
		</filetype>
	</references>
	<!-- Following section: list input files [samples to be analysed and similar] -->
	<outputs>
		<filetype>
			<identifier>DEEPID.PROC.DATE.bam</identifier>
			<format>BAM</format>
			<quantity>single</quantity>
			<comment>the bam file merged from all input fastq files, duplicates are marked</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.bai</identifier>
			<format>BAI</format>
			<quantity>single</quantity>
			<comment>Corresponding BAM index file, produced during merging and duplicate marking</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.flagstats</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>simple alignment statistics of the merged, duplicate marked bam</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.QcSummary</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>A summary of aligment statistics such as number of reads, percent of aligned reads, coverage of the genome, duplication level, etc.</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardMarkDupmetrics</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>Produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardAlignmentSummarymetrics</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardInsertSizemetrics</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardQualityByCyclemetrics</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardQualityDistributionmetrics</identifier>
			<format>text</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardInsertSizeHistogram</identifier>
			<format>PDF</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardQualityByCyclemetrics</identifier>
			<format>PDF</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
		<filetype>
			<identifier>DEEPID.PROC.DATE.PicardQualityDistributionmetrics</identifier>
			<format>PDF</format>
			<quantity>single</quantity>
			<comment>produced by Picard CollectMultipleMetrics</comment>
		</filetype>
	</outputs>
	<software>
		<tool>
			<name>bwa</name>
			<version>cnybwa-0.6.2</version>
			<command_line><![CDATA[ cnybwa-0.6.2 aln -t 12 -q 20 {reference_genome} {SampleID_R*} > sampleID_R*.sai ]]></command_line>
            <loop>SampleID_R*</loop>
			<comment>production of an intermediate .sai file for each read1 and read2 fastq file, performed on convey machines. cnybwa-0.6.2 is a hardware re-implementation of bwa version 0.6.2. t is the number of threads, -q the parameter for iterative quality trimming of the read down to 35 bp</comment>
		</tool>
		<tool>
			<name>bwa</name>
			<version>0.6.2-tpx</version>
			<command_line>
				<![CDATA[
					bwa sampe -P -a 1000 -T -t 8 -r readgroupinformation {reference_genome}
					sampleID_R1.sai sampleID_R2.sai {SampleID_R1} {SampleID_R2} > sampleID_Sampe_output
				]]>
			</command_line>
            <loop>SampleID_R*</loop>
			<comment>
				Pairing of reads to SAM format, piped to next step (samtools view).
				Parameters: -a to set maximum insert size to 1000 bp, -t number of threads, -P pre-load index, -T use original buffer size.
				The parameter readgroupinformation is initialized in the script as "@RG\tID:$ID\tSM:$SM\tLB:$LB\tPL:ILLUMINA",
				where $ID is composed of run and lane (e.g. run140918_SN7001180_0145_C451VACXX_44_Mm08_WEAd_Db1_H3K9me3_F_1_ACAGTG_L001),
				$SM the sampletype (e.g. sample_replicate1-H3K9me3_44_Mm08_WEAd_Db1),
				and $LB the library (e.g. replicate1-H3K9me3_44_Mm08_WEAd_Db1).
				These variables are constructed according to the file path of the fastq files.
			</comment>
		</tool>
		<tool>
			<name>samtools</name>
			<version>0.1.19</version>
			<command_line><![CDATA[ cat sampleID_Sampe_output | samtools view -uSbh - | samtools sort -o - > bamfile ]]></command_line>
            <loop>sampleID_Sampe_output</loop>
			<comment>Input piped from previous step (bwa sampe), conversion of SAM to BAM and sorting by coordinate</comment>
		</tool>
		<tool>
			<name>Picard</name>
			<version>1.125</version>
			<command_line>
				<![CDATA[
					java8 -Xmx50G -jar picard-tools-1.125.jar MarkDuplicates I=bamfile*
					OUTPUT={DEEPID.PROC.DATE.bam} VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=FALSE
					ASSUME_SORTED=TRUE CREATE_INDEX=TRUE MAX_RECORDS_IN_RAM=12500000
					METRICS_FILE={DEEPID.PROC.DATE.PicardMarkDupmetrics}
				]]>
			</command_line>
            <loop>no looping</loop>
			<comment>
				Merging of per-lane bam files, marking of duplicates and index creation {DEEPID.PROC.DATE.bai}.
				The Picard commandline gets I=bamfile for each bam file as input, which is simplified above in the command line.
				Was previously version 1.61, from January 2015 on version 1.125
			</comment>
		</tool>
		<tool>
			<name>samtools</name>
			<version>0.1.19</version>
			<command_line><![CDATA[ samtools flagstat {DEEPID.PROC.DATE.bam} > {DEEPID.PROC.DATE.flagstats} ]]></command_line>
            <loop>no looping</loop>
			<comment>simple alignment statistics of the merged, duplicate marked bam</comment>
		</tool>
	
		<tool>
			<name>Picard</name>
			<version>1.61</version>
			<command_line>
				<![CDATA[
					java -Xmx4G -cp picard-tools-1.61.jar -jar CollectMultipleMetrics.jar INPUT={DEEPID.PROC.DATE.bam}
					REFERENCE_SEQUENCE={reference_genome} ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT
					OUTPUT={DEEPID.PROC.DATE.Picard*} PROGRAM=CollectAlignmentSummaryMetrics
					PROGRAM=CollectInsertSizeMetrics PROGRAM=QualityScoreDistribution PROGRAM=MeanQualityByCycle
				]]>
			</command_line>
            <loop>no looping</loop>
			<comment>
				creates several output files:
				DEEPID.PROC.DATE.PicardAlignmentSummarymetrics
				DEEPID.PROC.DATE.PicardInsertSizemetrics
				DEEPID.PROC.DATE.PicardQualityByCyclemetrics
				DEEPID.PROC.DATE.PicardQualityDistributionmetrics
				DEEPID.PROC.DATE.PicardQualityByCyclemetrics
				DEEPID.PROC.DATE.PicardQualityDistributionmetrics
			</comment>
		</tool>
		<tool>
			<name>QCsummary</name>
			<version>n/a</version>
			<command_line>
				<![CDATA[
					perl writeQCsummary.pl -c samplesID.coverage.txt -d samplesID.diffchrom.txt
					-f {DEEPID.PROC.DATE.flagstats} -i samplesID.insertsize.txt
					-m {DEEPID.PROC.DATE.PicardMarkDupmetrics} -l "genome" -r "all_merged"
					-p sampleID -s sampletype > {DEEPID.PROC.DATE.QcSummary}
				]]>
			</command_line>
            <loop>no looping</loop>
			<comment>a custom perl script that also reads in files that are not relevant for DEEP since it is part of the DKFZ whole genome pipeline.</comment>
		</tool>
	</software>
</process>