diff --git a/docs/quantification/transcriptome/EXPv1.xml b/docs/quantification/transcriptome/EXPv1.xml new file mode 100644 index 0000000..a18b242 --- /dev/null +++ b/docs/quantification/transcriptome/EXPv1.xml @@ -0,0 +1,146 @@ + + + + EXP + 1 + + Matthias Barann + m.barann@ikmb.uni-kiel.de + + + * bam2wig.py: Conversion of BAM file to BigWig coverage tracks. One track per strand will be generated. + * htseq-count: Generates read counts on the gene level. + * cufflinks: Generates FPKM values for genes and transcript isoforms. + + + + .bam + + single + Unfiltered aligned reads + + + .bai + + single + Index file to bam file + + + + + chromInfo.txt + text file + single + Tab delimited file containing the name and length of the reference sequences: [name][tab][length]. + + + gencode.v19.annotation.gtf + GTF + single + Gencode gene annotation file in gene transfer format. + + + reference.fa + multi fasta + single + The reference genome file; see aspera.dkfz.de > download > results > references > genomes > human > WholeGenome + + + + + [sampleID].EXPv1.[DATE].bamcov.Forward.wig + wiggle + single + Forward strand wiggle file. Usually it is not necessary to keep this file. + + + [sampleID].EXPv1.[DATE].bamcov.Reverse.wig + wiggle + single + Reverse strand wiggle file Usually it is not necessary to keep this file. + + + [sampleID].EXPv1.[DATE].bamcov.Forward.bw + BigWig + single + Forward strand BigWig file. This file will only be generated if the UCSC program bamToBigWig can be found in $PATH. + + + [sampleID].EXPv1.[DATE].bamcov.Reverse.bw + BigWig + single + Reverse strand BigWig file. This file will only be generated if the UCSC program bamToBigWig can be found in $PATH. + + + [sampleID].EXPv1.[DATE].readcounts.txt + text file + single + This file contains the read counts on the gene level. + + + [sampleID].EXPv1.[DATE].genes.fpkm.tracking + text file + single + Output file containing the FPKM counts on the gene level. + + + [sampleID].EXPv1.[DATE].isoforms.fpkm.tracking + text file + single + Output file containing the FPKM counts on the isoform level. + + + [sampleID].EXPv1.[DATE].transcripts.gtf + gene transfer format + single + This file contains assembled transcripts. + + + + + Python + 2.7 + + no looping + + + + Samtools + 0.1.19-44428cd + + no looping + + + + bam2wig.py + 2.3.9 + + no looping + The python script is part of the RSeQC software. It will convert a bam file into two wig files (one for each strand). \ + If the UCSC program wigToBigWig can be located by the python script, the generated wig files will automatically be converted to bigWig. \ + Please note that for some samples the wigToBigWig command might exit with errors. In this case, manually invoking the wigToBigWig \ + command on the generated wig files can solve the problem: \ + wigToBigWig ${_sample}_Forward.wig -s ChromInfo.txt > ${_sample}_Forward.bw + + + htseq-count + 0.5.4p3 + ${_sample}.sam + htseq-count -s reverse -m intersection-strict -a 20 ${_sample}.sam gencode.v19.annotation.gtf > ${_sample}_htseq.txt ]]> + + no looping + DESeq requires bam files sorted by read name (step 1). After sorting, all non-primary alignments are removed during the bam to sam conversion. \ + Invoking htseq-count counts the number of reads per gene. Using the mode 'intersection-strict' results in a rather conservative read count. \ + Please see http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count for further information. + + + cufflinks + v2.0.2 + + + no looping + Please see http://cufflinks.cbcb.umd.edu/manual.html for further information. + + + diff --git a/docs/quantification/transcriptome/LXPv1.xml b/docs/quantification/transcriptome/LXPv1.xml new file mode 100644 index 0000000..5177b5f --- /dev/null +++ b/docs/quantification/transcriptome/LXPv1.xml @@ -0,0 +1,135 @@ + + + LXP + 1 + + Anupam Sinha + a.sinha@ikmb.uni-kiel.de + + + + * htseq-count: Generates read counts on the gene level. + * cufflinks: Generates FPKM values for genes and transcript isoforms. + * StringTie: Generates FPKM values for genes and transcript isoforms. Also generates .ctab files for analysis using Ballgown. + + + + + .bam + + single + Unfiltered aligned reads + + + + + + gencode.v19.annotation.gtf + GTF + single + Gencode gene annotation file in gene transfer format. + + + reference.fa + multi fasta + single + The reference genome file; see aspera.dkfz.de > download > results > references > genomes > human > WholeGenome + + + + + + [sampleID].LXPv1.[DATE].readcounts.txt + text file + single + This file contains the read counts on the gene level. + + + [sampleID].LXPv1.[DATE].genes.fpkm.tracking + text file + single + Output file containing the FPKM counts on the gene level. + + + [sampleID].LXPv1.[DATE].isoforms.fpkm.tracking + text file + single + Output file containing the FPKM counts on the isoform level. + + + [sampleID].LXPv1.[DATE].transcripts.gtf + gene transfer format + single + This file contains assembled transcripts. + + + [sampleID].LXPv1.[DATE].stringtie.gtf + gene transfer format + single + This file contains assembled transcripts. + + + [sampleID].LXPv1.[DATE].ballgown + tab separated fields (.ctab) format + five + This is a folder containing 5 .ctab files. These .ctab files contain the expression values of exons, introns and transcripts. Two files list the internal(generated by ballgown) association ids between exons, introns, and transcripts. + + + + + + Python + 2.7 + + no looping + + + + Samtools + 0.1.19-44428cd + + no looping + + + + htseq-count + 0.6.1p1 + samtools sort -n -@ 8 -m 4G ${_sample}.bam ${_sample}_sorted + samtools/samtools view -F 256 ${_sample}_sorted.bam > ${_sample}.sam + htseq-count -s reverse -m union -a 20 ${_sample}.sam gencode.v19.annotation.gtf > ${_sample}_htseq.txt + + no looping + DESeq2 requires bam files sorted by read name (step 1). After sorting, all non-primary alignments are removed during the bam to sam conversion. \ + Invoking htseq-count counts the number of reads per gene. \ + Please see http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count for further information. + + + + cufflinks + v2.0.2 + + + + no looping + Please see http://cufflinks.cbcb.umd.edu/manual.html for further information. + + + StringTie + v1.0.3 + + + + no looping + Please see http://ccb.jhu.edu/software/stringtie/ for further information. \ + "-b" option creates a folder which contains the .ctab files for analysis using Ballgown. \ + Please see https://github.com/alyssafrazee/ballgown for further information. + + + + + diff --git a/docs/quantification/transcriptome/SXPv1.xml b/docs/quantification/transcriptome/SXPv1.xml new file mode 100644 index 0000000..589091e --- /dev/null +++ b/docs/quantification/transcriptome/SXPv1.xml @@ -0,0 +1,207 @@ + + + + SXP + 1 + + Filippos Klironomos + filippos.klironomos@mdc-berlin.de + + + *) miRDeep2 pipeline involves: + *) mapping of reads to genome and keeping those uniquely mapped + *) extracting bracketing DNA of the uniquely mapped reads + *) RNAfold extracted sequences and keeping those that form unbifurcated hairpins + *) scoring putative precursors: + *) expect greater number of reads mapping to either the -5p or -3p strand and very little to the hairpin + *) short 3' duplex overhang characteristic of Drosha/Dicer processing adds to the score + *) relative and absolute stabilities contribute to the score + *) if 5' end of mature sequence is identical to that of known mature sequence it adds to the score + *) randomly permuting read signatures with putative precursor sequences in order to determine the FPR + Internally miRDeep2 uses the following packages: + RNAfold version 2.1.7 + RANDFOLD version 2 + + + + config + TSV + single + + this is the configuration file that miRDeep2 uses to locate the FASTQ library and assign the 3-character identification to it + + + + + + genome + fasta + single + + hs37d5 and GRCm38mm10 genomes are modified as follows: + *) IDs are simplified, everything to the right of the first white space encountered is removed, + *) all ambiguously called nucleotides [URYSWKMBDHV] have been masked to "N". + The following script does all this: + \(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa + sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa + ]]> + + + + genome_index + bowtie-index + collection + + bowtie version 0.12.7 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows: + bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa + bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa + + + + miRBase_mature + fasta + single + mature known miRNA reference from miRBase Release 20 uploaded to ASPERA + + + miRBase_hairpin + fasta + single + precursor (hairpin) known miRNA reference from miRBase Release 20 uploaded to ASPERA + + + + + SampleID.SXPv1.DATE.known.csv + csv + single + + expression of known miRNAs quantified by miRDeep2 + + + + SampleID.SXPv1.DATE.known.bed + bed + single + + BED track of expression of known miRNAs quantified by miRDeep2 + + + + SampleID.SXPv1.DATE.known.bedGraph + bedGraph + single + + bedGraph track of expression of known miRNAs quantified by miRDeep2 + + + + SampleID.SXPv1.DATE.novel.bed + bed + single + + bed track of expression of novel miRNAs predicted by miRDeep2 + + + + SampleID.SXPv1.DATE.novel.bedGraph + bedGraph + single + + bedGraph track of expression of novel miRNAs predicted by miRDeep2 + + + + + + generate_config + missing + + config ]]> + + no looping + + this command creates the configuration file for miRDeep2 to use in order to locate the FASTQ library {SampleID.fastq} and assign + a 3-letter internal ID to it, in this case ID1 + + + + mapper.pl + miRDeep2.0.0.6 + + mapper_summary.log ]]> + + no looping + + use the configuration file to locate the library; remove adaptor provided by {Adaptor}; + collapse the reads to the file "read_collapsed.fa"; + map to the reference and output the alignments in the file "reads_vs_genome.arf"; + print out summary in "mapper_summary.log" + + The ARF is a text-based format consisting of the following columns: + + readID # the ID of the read + readLength # length of the read + start # start position of the alignment relative to the read + end # end position of the alignment relative to the read + readSeq # sequence of the read + chr # chromosome of reference where read maps + refLength # length of the reference sequence where read maps to + start # start position of reference sequence where read maps to + end # end position of reference sequence where read maps to + referenceSeq # reference sequence where read maps to + strand # strand of reference + mm # number of mismatches in the alignment + MAPQ-like-string # m==perfect match, M==mismatch + + + + miRDeep2 + miRDeep2.0.0.6 + + miRDeep2.report.log ]]> + + no looping + quantify known miRNAs and predict putative novel miRNAs across samples + + + rename_according_to_metadata_standards + missing + + + + no looping + rename output data file to conform to metadata naming standards + + + mirdeep2_csv2bed.pl + missing + + "{SampleID}.SXPv1.{DATE}.novel.bed" + cat "novel_pres_DATE_t_TIME_score-50_to_na.bed" >> "{SampleID}.SXPv1.{DATE}.novel.bed" + ]]> + + no looping + + Generate BED tracks from the total precursor read counts of known and novel miRNAs and rename them according to metadata standards. + This tool has been uploaded to ASPERA. + + + + bed_to_bedGraph + missing + + FILENAME"Graph"; print $1,$2,$3,$5 >> FILENAME"Graph"} NR>3 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv1.{DATE}.known.bed" + gawk 'NR==1 {print "track type=bedGraph description=\"miRDeep2 novel miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph"; print $1,$2,$3,$5 >> FILENAME"Graph"} NR>1 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv1.{DATE}.novel.bed" + ]]> + + no looping + convert BED tracks to bedGraph + + + diff --git a/docs/quantification/transcriptome/SXPv2.xml b/docs/quantification/transcriptome/SXPv2.xml new file mode 100644 index 0000000..63ad5f6 --- /dev/null +++ b/docs/quantification/transcriptome/SXPv2.xml @@ -0,0 +1,227 @@ + + + + SXP + 2 + + Filippos Klironomos + filippos.klironomos@mdc-berlin.de + + + *) miRDeep2 pipeline involves: + + *) mapping of reads to genome and keeping those uniquely mapped + *) extracting bracketing DNA of the uniquely mapped reads + *) RNAfold extracted sequences and keeping those that form unbifurcated hairpins + *) scoring putative precursors: + *) expect greater number of reads mapping to either the -5p or -3p strand and very little to the hairpin + *) short 3' duplex overhang characteristic of Drosha/Dicer processing adds to the score + *) relative and absolute stabilities contribute to the score + *) if 5' end of mature sequence is identical to that of known mature sequence it adds to the score + *) randomly permuting read signatures with putative precursor sequences in order to determine the FPR + + Internally miRDeep2 uses the following packages: + + RNAfold version 2.1.7 + RANDFOLD version 2 + + + + config + TSV + single + + this is the configuration file that miRDeep2 uses to locate the FASTQ library and assign the 3-character identification to it + + + + + + + genome + fasta + single + +\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa + sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa +]]> + + + + genome_index + bowtie-index + collection + + bowtie version 1.1.1 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows: + + bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa + bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa + + + + miRBase_mature + fasta + single + mature known miRNA reference from miRBase Release 20 uploaded to ASPERA + + + miRBase_hairpin + fasta + single + precursor (hairpin) known miRNA reference from miRBase Release 20 uploaded to ASPERA + + + + + + SampleID.SXPv2.DATE.known.csv + csv + single + + expression of known miRNAs quantified by miRDeep2 + + + + SampleID.SXPv2.DATE.known.bed + bed + single + + BED track of expression of known miRNAs quantified by miRDeep2 + + + + SampleID.SXPv2.DATE.known.bedGraph + bedGraph + single + + bedGraph track of expression of known miRNAs quantified by miRDeep2 + + + + SampleID.SXPv2.DATE.novel.bed + bed + single + + bed track of expression of novel miRNAs predicted by miRDeep2 + + + + SampleID.SXPv2.DATE.novel.bedGraph + bedGraph + single + + bedGraph track of expression of novel miRNAs predicted by miRDeep2 + + + + + + + + generate_config + missing + + config + ]]> + + no looping + + this command creates the configuration file for miRDeep2 to use in order to locate the FASTQ library {SampleID.fastq} and assign + a 3-letter internal ID to it, in this case ID1 + + + + mapper.pl + miRDeep2.0.0.7 + + mapper_summary.log + ]]> + + no looping + + use the configuration file to locate the library; remove adaptor provided by {Adaptor}; + collapse the reads to the file "read_collapsed.fa"; + map to the reference and output the alignments in the file "reads_vs_genome.arf"; + print out summary in "mapper_summary.log" + + The ARF is a text-based format consisting of the following columns: + + readID # the ID of the read + readLength # length of the read + start # start position of the alignment relative to the read + end # end position of the alignment relative to the read + readSeq # sequence of the read + chr # chromosome of reference where read maps + refLength # length of the reference sequence where read maps to + start # start position of reference sequence where read maps to + end # end position of reference sequence where read maps to + referenceSeq # reference sequence where read maps to + strand # strand of reference + mm # number of mismatches in the alignment + MAPQ-like-string # m==perfect match, M==mismatch + + + + miRDeep2 + miRDeep2.0.0.7 + + miRDeep2.report.log +]]> + + no looping + quantify known miRNAs and predict putative novel miRNAs across samples + + + rename_according_to_metadata_standards + missing + + + + no looping + rename output data file to conform to metadata naming standards + + + mirdeep2_csv2bed.pl + missing + + "{SampleID}.SXPv2.{DATE}.novel.bed" + cat "novel_pres_DATE_t_TIME_score-50_to_na.bed" >> "{SampleID}.SXPv2.{DATE}.novel.bed" +]]> + + no looping + + Generate BED tracks from the total precursor read counts of known and novel miRNAs and rename them according to metadata standards. + This tool has been uploaded to ASPERA. + + + + bed_to_bedGraph + missing + + FILENAME"Graph"; print $1,$2,$3,$5 >> FILENAME"Graph"} NR>3 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv2.{DATE}.known.bed" + gawk 'NR==1 {print "track type=bedGraph description=\"miRDeep2 novel miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph"; print $1,$2,$3,$5 >> FILENAME"Graph"} NR>1 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv2.{DATE}.novel.bed" +]]> + + no looping + convert BED tracks to bedGraph + + +