From bd70b14e96d6e9e02fe98938204bb77d0a6e1920 Mon Sep 17 00:00:00 2001 From: Peter Ebert Date: Fri, 30 Dec 2016 15:16:43 +0100 Subject: [PATCH] ADD: RBA and GAL; syntactically valid, some minor content-related issues --- docs/alignment/bisulfite/RBAv0.xml | 219 +++++++++++++++++++++++++++++ docs/alignment/genome/GALv1.xml | 211 +++++++++++++++++++++++++++ 2 files changed, 430 insertions(+) create mode 100644 docs/alignment/bisulfite/RBAv0.xml create mode 100644 docs/alignment/genome/GALv1.xml diff --git a/docs/alignment/bisulfite/RBAv0.xml b/docs/alignment/bisulfite/RBAv0.xml new file mode 100644 index 0000000..a0f53df --- /dev/null +++ b/docs/alignment/bisulfite/RBAv0.xml @@ -0,0 +1,219 @@ + + + + RBA + 0 + + Karl Nordström, Charles Imbusch + karl.nordstroem@uni-saarland.de + + + The RBAv0 pipeline is a cloned version of the DEEP BAL process. It trims and aligns RRBS data to a reference genome. + + 0. Generation of MethylCtools reference index + 1. trim reads with Trim Galore! (Cutadapt) + 2. Map reads with MethylCtools (BWA) + 3. Merge bam files with Picard tools + 4. Generate a flagstat file + + Step 0 is run manually and only once. + + + + sampleID_R1.fastq.gz + FASTQ + collection + The current implementation takes a folder as input and trims and maps all fastq files in the folder + + + + + {ASSEMBLY}.fa + FASTA + single + fasta file containing genomic reference sequence + + + + + {OUTNAME}.bam + BAM + single + The resulting alignment in BAM format + + + {OUTNAME}.bam.bai + BAI + single + Index file for the alignment + + + {OUTNAME}.coverage.bw + bigWig + single + Coverage track in bigWig format. + + + {OUTNAME}.flagstat + TXT + single + The output from samtools flagstat + + + {OUTNAME}.rawCov + TXT + single + Contains a single value, the average genomic coverage + + + + + methylCtools + 0.9.2 + + no looping + Introduce C to T conversions to both strands. Only runs if the converted file does not exist + + + bwa + 0.7.12-r1039 + + no looping + generate the bwa index file. This step only runs if the index does not exist + + + Trim Galore! + 0.3.3 + + sampleID_R1.fastq.gz + This step is a one to one process trimming all fastq files. The reads are filtered for the default adapter (AGATCGGAAGAGC) and quality below 20 + + + methylCtools + 0.9.2 + + sampleID_R1.trimmed.fq.gz + A one to one process preparing the trimmed files for mapping by converting C to T and storing converted positions in the header + + + bwa + 0.7.12-r1039 + PIPE1.sai]]> + sampleID_R1.conv.fq + A one to one process mapping each file. Again a quality cutoff at 20. This step is piped to the next. + + + bwa + 0.7.12-r1039 + PIPE2.sam]]> + PIPE1.sai + A one to one process converting each bwa alignment from sai to sam format + + + samtools + 1.2 (using htslib 1.2.1) + PIPE3.bam]]> + PIPE2.sam + A one to one process converting the alignment to bam format + + + methylCtools + 0.9.2 + + PIPE3.bam + A one to one process converting the reads in the alignment files back to their raw format, undoing the C to T conversion. + + + samtools + 1.2 (using htslib 1.2.1) + PIPE5.sam]]> + PIPE4.bam + A one to one process reconverting to sam format in order to correct some peccularities introduced by bwa + + + awk + 4.0.1 + PIPE6.sam ]]> + PIPE5.sam + A one to one process removing all mapping information present for unmapped reads. Sometimes bwa add this for unmapped reads. + + + Picardtools + 1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) + + PIPE6.sam + A one to one process adding reads to readgroups in accordance to FLOWCELL and LANE, which are replaced to the corresponding values. + + + Picardtools + 1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010) + + no looping + Merges all the generated bam files. If multiple fastq files were used as input, I=sampleID_R1.bam has to be multiplied to point to all the generated bam files. + + + samtools + 1.2 (using htslib 1.2.1) + + no looping + Generating the index file {OUTNAME}.bam.bai + + + + samtools + 1.2 (using htslib 1.2.1) + PIPE7.txt]]> + no looping + Extracting the header of the bam file in order to get chromosome lengths for the generation of the coverage file + + + awk + 4.0.1 + ref.lengths]]> + no looping + Extracting the chromosome lengths from the sam header + + + bedtools + v2.20.1 + coverage.bw.tmp ]]> + no looping + Calculating base pair resolution coverage in bed graph format + + + bedGraphToBigWig + v 4 + + no looping + converting the bedgraph file to bigWig format + + + samtools + 1.2 (using htslib 1.2.1) + PIPE8.bam ]]> + no looping + Filtering non-primary (256) and multiple-mapping reads (1024) before calculating average coverage + + + samtools + 1.2 (using htslib 1.2.1) + PIPE9.txt]]> + no looping + Converting to pileup format, limiting to regions with a coverage below 100000 + + + awk + 4.0.1 + {OUTNAME}.rawCov ]]> + no looping + calculating the average + + + samtools + 1.2 (using htslib 1.2.1) + {OUTNAME}.flagstat]]> + no looping + generating the flagstat file + + + diff --git a/docs/alignment/genome/GALv1.xml b/docs/alignment/genome/GALv1.xml new file mode 100644 index 0000000..bef8a57 --- /dev/null +++ b/docs/alignment/genome/GALv1.xml @@ -0,0 +1,211 @@ + + + + GAL + 1 + + Barbara Hutter + b.hutter@dkfz.de + + + + * mapping of raw sequences to the reference genome + - N pairs of fastq files that are processed into bam files separately and merged into one at the end + + + + + SampleID_R1 + FASTQ + collection + raw input file with forward read of the pair ("read1"), pre-filtered for illumina chastity filter failed reads + + + SampleID_R2 + FASTQ + collection + raw input file with reverse read of the pair ("read2"), pre-filtered for illumina chastity filter failed reads + + + + + reference_genome + FASTA + single + The reference genome file; see aspera.dkfz.de > download > results > references > genomes > human/mouse > WholeGenome + + + + + + DEEPID.PROC.DATE.bam + BAM + single + the bam file merged from all input fastq files, duplicates are marked + + + DEEPID.PROC.DATE.bai + BAI + single + Corresponding BAM index file, produced during merging and duplicate marking + + + DEEPID.PROC.DATE.flagstats + text + single + simple alignment statistics of the merged, duplicate marked bam + + + DEEPID.PROC.DATE.QcSummary + text + single + A summary of aligment statistics such as number of reads, percent of aligned reads, coverage of the genome, duplication level, etc. + + + DEEPID.PROC.DATE.PicardMarkDupmetrics + text + single + Produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardAlignmentSummarymetrics + text + single + produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardInsertSizemetrics + text + single + produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardQualityByCyclemetrics + text + single + produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardQualityDistributionmetrics + text + single + produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardInsertSizeHistogram + PDF + single + produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardQualityByCyclemetrics + PDF + single + produced by Picard CollectMultipleMetrics + + + DEEPID.PROC.DATE.PicardQualityDistributionmetrics + PDF + single + produced by Picard CollectMultipleMetrics + + + + + bwa + cnybwa-0.6.2 + sampleID_R*.sai ]]> + SampleID_R* + production of an intermediate .sai file for each read1 and read2 fastq file, performed on convey machines. cnybwa-0.6.2 is a hardware re-implementation of bwa version 0.6.2. t is the number of threads, -q the parameter for iterative quality trimming of the read down to 35 bp + + + bwa + 0.6.2-tpx + + sampleID_Sampe_output + ]]> + + SampleID_R* + + Pairing of reads to SAM format, piped to next step (samtools view). + Parameters: -a to set maximum insert size to 1000 bp, -t number of threads, -P pre-load index, -T use original buffer size. + The parameter readgroupinformation is initialized in the script as "@RG\tID:$ID\tSM:$SM\tLB:$LB\tPL:ILLUMINA", + where $ID is composed of run and lane (e.g. run140918_SN7001180_0145_C451VACXX_44_Mm08_WEAd_Db1_H3K9me3_F_1_ACAGTG_L001), + $SM the sampletype (e.g. sample_replicate1-H3K9me3_44_Mm08_WEAd_Db1), + and $LB the library (e.g. replicate1-H3K9me3_44_Mm08_WEAd_Db1). + These variables are constructed according to the file path of the fastq files. + + + + samtools + 0.1.19 + bamfile ]]> + sampleID_Sampe_output + Input piped from previous step (bwa sampe), conversion of SAM to BAM and sorting by coordinate + + + Picard + 1.125 + + + + no looping + + Merging of per-lane bam files, marking of duplicates and index creation {DEEPID.PROC.DATE.bai}. + The Picard commandline gets I=bamfile for each bam file as input, which is simplified above in the command line. + Was previously version 1.61, from January 2015 on version 1.125 + + + + samtools + 0.1.19 + {DEEPID.PROC.DATE.flagstats} ]]> + no looping + simple alignment statistics of the merged, duplicate marked bam + + + + Picard + 1.61 + + + + no looping + + creates several output files: + DEEPID.PROC.DATE.PicardAlignmentSummarymetrics + DEEPID.PROC.DATE.PicardInsertSizemetrics + DEEPID.PROC.DATE.PicardQualityByCyclemetrics + DEEPID.PROC.DATE.PicardQualityDistributionmetrics + DEEPID.PROC.DATE.PicardQualityByCyclemetrics + DEEPID.PROC.DATE.PicardQualityDistributionmetrics + + + + QCsummary + n/a + + {DEEPID.PROC.DATE.QcSummary} + ]]> + + no looping + a custom perl script that also reads in files that are not relevant for DEEP since it is part of the DKFZ whole genome pipeline. + + +