SALv2.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>
<process>
	<name>SAL</name>
	<version>2</version>
	<author>
		<name>Filippos Klironomos</name>
		<email>filippos.klironomos@mdc-berlin.de</email>
	</author>
	<description>
    1) remove adaptors and map to reference without filtering or collapsing the reads
    2) generate coverage track from the aligned reads
    3) cluster reads that overlap windows of 501nts around TSS/TES
    4) predict TSS/TES-miRNAs based on the following filters:
       - pick sharply defined, well covered clusters and identify the peak
       - peak from step (1) should be consisted of reads with identical 5&apos; start position
       - average read length of peak from step (2) should be between 18 and 24nts
       - average phastCons mean score per peak from step (3) should be 0.8 or above
    5) generate coverage tracks for TSS/TES-miRNAs
	</description>
	<inputs>
        <filetype>
            <identifier>library</identifier>
			<format>FASTQ</format>
            <quantity>single</quantity>
            <comment>{SampleID}.fastq library of raw reads to trim and map to the reference</comment>
		</filetype>
	</inputs>
	<references>
		<filetype>
			<identifier>genome</identifier>
			<format>fasta</format>
			<quantity>single</quantity>
            <comment><![CDATA[
                hs37d5 and GRCm38mm10 genomes are modified as follows:
                *) IDs are simplified, everything to the right of the first white space encountered is removed,
                *) all ambiguously called nucleotides [URYSWKMBDHV] have been masked to 'N'.
                The following script does all this:
                sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa
                sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa
                ]]>
            </comment>
		</filetype>
		<filetype>
			<identifier>genome_index</identifier>
			<format>bowtie-index</format>
			<quantity>collection</quantity>
            <comment><![CDATA[
                bowtie version 1.1.1 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows:
                bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa
                bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa
                ]]>
            </comment>
		</filetype>
	</references>
	<outputs>
		<filetype>
            <identifier>SampleID.SALv2.DATE.trimmed.bam</identifier>
			<format>BAM</format>
			<quantity>single</quantity>
            <comment>adaptor-trimmed reads mapped to the reference without any filtering or collapsing</comment>
		</filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.trimmed.bedGraph</identifier>
			<format>bedGraph</format>
			<quantity>single</quantity>
            <comment>reference genome coverage track of aligned adaptor-trimmed reads</comment>
		</filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.TSS.{sense,antisense}.tsv</identifier>
			<format>TSV</format>
			<quantity>single</quantity>
            <comment>Coverage track for clustered reads mapped sense/antisense to TSS regions. Each cluster with a unique clusterId represents a TSS-miRNA prediction.
                The format is BED-like (0-based, end-exclusive) and the columns are:
                chr start end readId score strand clusterId min_phastCons_score max_phastCons_score mean_phastCons_score median_phastCons_score
            </comment>
        </filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.TSS.{sense,antisense}.summary.tsv</identifier>
			<format>TSV</format>
			<quantity>single</quantity>
            <comment>Summary results of called peaks (clusters)
                The format is BED-like (0-based, end-exclusive) and the columns are:
                chr start end strand clusterId coverage geneId,symbol consensus_sequence
            </comment>
        </filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.TSS.{sense,antisense}.bed</identifier>
			<format>BED</format>
			<quantity>single</quantity>
            <comment>Simplified BED version (6 columns) of the corresonding TSV file with readIds removed.</comment>
		</filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.TES.{sense,antisense}.tsv</identifier>
			<format>BED-like</format>
			<quantity>single</quantity>
            <comment>Coverage track for clustered reads mapped sense/antisense to TES regions. Each cluster with a unique clusterId represents a TES-miRNA prediction.
                The format is BED-like (0-based, end-exclusive) and the columns are:
                chr start end readId score strand clusterId min_phastCons_score max_phastCons_score mean_phastCons_score median_phastCons_score
            </comment>
		</filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.TES.{sense,antisense}.summary.tsv</identifier>
			<format>TSV</format>
			<quantity>single</quantity>
            <comment>Summary results of called peaks (clusters)
                The format is BED-like (0-based, end-exclusive) and the columns are:
                chr start end strand clusterId coverage geneId,symbol consensus_sequence
            </comment>
        </filetype>
		<filetype>
            <identifier>SampleID.SALv2.DATE.TES.{sense,antisense}.bed</identifier>
			<format>BED</format>
			<quantity>single</quantity>
            <comment>Simplified BED version (6 columns) of the corresonding TSV file with readIds removed.</comment>
		</filetype>
	</outputs>
	<software>
		<tool>
            <name>standard</name>
			<version>n/a</version>
            <command_line><![CDATA[ CMDLINE ]]></command_line>
            <loop>n/a</loop>
            <comment>The following software tools are used in the pipeline:
                flexbar version 2.4
                bowtie version 1.1.1
                samtools version 1.1
                bedtools version 2.23.0
                R version 3.2.0
                Bioconductor version 3.1 (BiocInstaller 1.18.1)
                bwtool version 1.0
                custom python 2.7 scripts
                gawk version 4.0.1
            </comment>
		</tool>
	</software>
</process>
	<?xml version="1.0"?>
	<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>
	<process>
	<name>SAL</name>
	<version>2</version>
	<author>
	<name>Filippos Klironomos</name>
	<email>filippos.klironomos@mdc-berlin.de</email>
	</author>
	<description>
	1) remove adaptors and map to reference without filtering or collapsing the reads
	2) generate coverage track from the aligned reads
	3) cluster reads that overlap windows of 501nts around TSS/TES
	4) predict TSS/TES-miRNAs based on the following filters:
	- pick sharply defined, well covered clusters and identify the peak
	- peak from step (1) should be consisted of reads with identical 5' start position
	- average read length of peak from step (2) should be between 18 and 24nts
	- average phastCons mean score per peak from step (3) should be 0.8 or above
	5) generate coverage tracks for TSS/TES-miRNAs
	</description>
	<inputs>
	<filetype>
	<identifier>library</identifier>
	<format>FASTQ</format>
	<quantity>single</quantity>
	<comment>{SampleID}.fastq library of raw reads to trim and map to the reference</comment>
	</filetype>
	</inputs>
	<references>
	<filetype>
	<identifier>genome</identifier>
	<format>fasta</format>
	<quantity>single</quantity>
	<comment><![CDATA[
	hs37d5 and GRCm38mm10 genomes are modified as follows:
	*) IDs are simplified, everything to the right of the first white space encountered is removed,
	*) all ambiguously called nucleotides [URYSWKMBDHV] have been masked to 'N'.
	The following script does all this:
	sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa
	sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa
	]]>
	</comment>
	</filetype>
	<filetype>
	<identifier>genome_index</identifier>
	<format>bowtie-index</format>
	<quantity>collection</quantity>
	<comment><![CDATA[
	bowtie version 1.1.1 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows:
	bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa
	bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa
	]]>
	</comment>
	</filetype>
	</references>
	<outputs>
	<filetype>
	<identifier>SampleID.SALv2.DATE.trimmed.bam</identifier>
	<format>BAM</format>
	<quantity>single</quantity>
	<comment>adaptor-trimmed reads mapped to the reference without any filtering or collapsing</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.trimmed.bedGraph</identifier>
	<format>bedGraph</format>
	<quantity>single</quantity>
	<comment>reference genome coverage track of aligned adaptor-trimmed reads</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.TSS.{sense,antisense}.tsv</identifier>
	<format>TSV</format>
	<quantity>single</quantity>
	<comment>Coverage track for clustered reads mapped sense/antisense to TSS regions. Each cluster with a unique clusterId represents a TSS-miRNA prediction.
	The format is BED-like (0-based, end-exclusive) and the columns are:
	chr start end readId score strand clusterId min_phastCons_score max_phastCons_score mean_phastCons_score median_phastCons_score
	</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.TSS.{sense,antisense}.summary.tsv</identifier>
	<format>TSV</format>
	<quantity>single</quantity>
	<comment>Summary results of called peaks (clusters)
	The format is BED-like (0-based, end-exclusive) and the columns are:
	chr start end strand clusterId coverage geneId,symbol consensus_sequence
	</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.TSS.{sense,antisense}.bed</identifier>
	<format>BED</format>
	<quantity>single</quantity>
	<comment>Simplified BED version (6 columns) of the corresonding TSV file with readIds removed.</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.TES.{sense,antisense}.tsv</identifier>
	<format>BED-like</format>
	<quantity>single</quantity>
	<comment>Coverage track for clustered reads mapped sense/antisense to TES regions. Each cluster with a unique clusterId represents a TES-miRNA prediction.
	The format is BED-like (0-based, end-exclusive) and the columns are:
	chr start end readId score strand clusterId min_phastCons_score max_phastCons_score mean_phastCons_score median_phastCons_score
	</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.TES.{sense,antisense}.summary.tsv</identifier>
	<format>TSV</format>
	<quantity>single</quantity>
	<comment>Summary results of called peaks (clusters)
	The format is BED-like (0-based, end-exclusive) and the columns are:
	chr start end strand clusterId coverage geneId,symbol consensus_sequence
	</comment>
	</filetype>
	<filetype>
	<identifier>SampleID.SALv2.DATE.TES.{sense,antisense}.bed</identifier>
	<format>BED</format>
	<quantity>single</quantity>
	<comment>Simplified BED version (6 columns) of the corresonding TSV file with readIds removed.</comment>
	</filetype>
	</outputs>
	<software>
	<tool>
	<name>standard</name>
	<version>n/a</version>
	<command_line><![CDATA[ CMDLINE ]]></command_line>
	<loop>n/a</loop>
	<comment>The following software tools are used in the pipeline:
	flexbar version 2.4
	bowtie version 1.1.1
	samtools version 1.1
	bedtools version 2.23.0
	R version 3.2.0
	Bioconductor version 3.1 (BiocInstaller 1.18.1)
	bwtool version 1.0
	custom python 2.7 scripts
	gawk version 4.0.1
	</comment>
	</tool>
	</software>
	</process>