WORKFLOW

WORKFLOW
---

The user can upload either FASTQ or FASTA files bearing reads using the
web interface or the MIRPIPE FTP server. These should ideally be
compressed (.zip, .gz) to reduce upload time. The pipeline can fully
process raw reads originating from Illumina, 454, IonTorrent or Sanger
sequencing instruments including adapter trimming.
--- Parameters:
- Reads: File bearing reads in FASTQ or FASTA format (ideally zip
compressed for Galaxy). This file can either be uploaded using the Galaxy
Upload Tool (Helpful Tools / Get Data / Upload Files) or using an account
on our FTP server. The latter is only possible after the registration of a
user in Galaxy, which automatically creates an account with the same
username and password on the FTP server
(ftp://bioinformatics.mpi-bn.mpg.de/). The data will be deleted from the
server after two weeks.


A reference FASTA database bearing mature target miRNAs can either be
selected from the preprocessed current miRBase release 20 data harbouring
30424 entries of 206 species or can be uploaded by the user in FASTA
format. The user can optionally choose a subset of the miRBase reference
miRNAs bearing only miRNAs of the desired organism to limit the comparison
the e.g. the closest relative. If the chosen reference FASTA file does not
obey to the naming convention of miRBase (<species>-miR-<#>-<suffix>), the
"family name clustering" parameter should be turned off.
--- Parameters:
- Reference database: Preprocessed DBs (full miRBase or miRNAs of only one
species) or any user uploaded FASTA file bearing mature miRNAs. The
correct miRBase file can be downloaded for offline usage:
ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz.


The raw read data is then processed to optionally remove an adapter
sequence and trim for a minimum quality (default Q20). Only reads of the
desired size range are selected to limit the pool to likely mature miRNAs
(default: 18-28 nt).
--- Parameters:
- Adapter sequence: Nucleotide sequence of the adapter to be removed from
the 3' end using Cutadapt. By default the larger of the following values
is used as the maximum mismatch number:  1, 10% of the adapter length.
These values can be changed inside the mirpipe.pl script.
- Minimum read length: Minimum length of a read after trimming to be
considered in the analysis.
- Minimum base quality: Minimum phred quality for FASTQ data. Nucleotides
with lower quality will be trimmed. This parameter is not used if FASTA
formatted read data is supplied.
- Maximum read length: Maximum length of a read to be considered in
analysis.


Duplicate reads are collapsed to decrease the number of necessary homology
searches (the number of duplicates per read is noted). Only those
sequences present a minimum number of times (default = 5) are kept for
further analyses. This measure is intended to remove unique reads which
frequently denote sequencing errors or lowly expressed miRNAs that can not
be reliably quantified. Setting this parameter to "1" will increase
sensitivity at the cost of an increased false positive rate.
--- Parameters:
- Minimum read copy number: A read sequence must be present at least this
number of times to be included.


Read counts from isomiRs of the same miRNA are combined. These isomiR read
sequences may only differ by the 3' end and are thus putatively encoded by
the same gene and bear the same target specificity. This function allows
the summary of putatively functionally equivalent isomiRs resulting from
imperfect digestion by the RNases Drosha and Dicer or RNA-Editing by
specialized enzymes resulting in 3' modification. Only the final 3'
nucleotide may differ between two sequences to be counted as isoforms of
the same miRNA and only the longest isoform sequence is used in the next
step to reduce the amount of homology searches per miRNA.


The resulting read sequences are compared versus the chosen reference
database of miRNAs. Sensitivity and specificity of this BLASTN homology
search can be controlled using various parameters. Parameters are
optimized for small query sequences (-num_alignments 15 -word_size 7
-evalue 10 -dust no -strand plus). The resulting hits are filtered to
exclude those with too many mismatches ((read length - alignment length) +
mismatches + gaps = final mismatches).
--- Parameters:
- Maximum mismatches: Maximum number of mismatches allowed between
reference miRNA and read sequence ((read length - alignment length) +
mismatches + gaps = final mismatches). This parameter controls the size of
the miRNA clusters: more mismatches allowed = larger clusters.


Mature miRNAs and their precursors are optionally collated by name on the
family level to remove redundancy (ex.
bta-miR-200a,oan-miR-200a-3p,tgu-miR-200a-3p -> miR-200a). Otherwise the
complete miRNA names given in the reference database are carried over
resulting in more detailed but also more redundant output. Turning off the
family name clustering can be advisable in case the reference database of
miRNA sequences does not obey to the naming convention of miRBase
(<species>-miR-<#>-<suffix>).
--- Parameters:
Family name clustering: Collapse the names of all variants of a miRNA to
the miRNA family (ex. bta-miR-200a,oan-miR-200a-3p,tgu-miR-200a-3p ->
miR-200a).


Detected reference miRNA families per read are scored based on the minimum
number of mismatches. If a read matched equally well versus multiple miRNA
families, the respective families are joined by single linkage clustering.
By default only those read sequences that are at least 5% as abundant as
the most abundant sequence per miRNA family cluster are denoted (ex. most
abundant sequence = 100 reads, cutoff = 5 reads). This is intended to
further suppress reads resulting from sequencing errors or biological
miRNA variations that are expressed near the detection limit.
--- Parameters:
Minimum cluster abundance: Remove read sequences from a cluster that are
less than x% as abundant as the most abundant sequence. This is intended
to suppress reads resulting from sequencing errors or biological miRNA
variations that are expressed near the detection limit. This parameter
controls the size of the miRNA clusters: lower minimum cluster abundance =
larger clusters.


In order to achieve congruent results yielding one count value per miRNA,
miRNA family clusters are finally split. Since some of the reads match
multiple miRNAs equally well, these reads are counted fully for all of the
respective miRNAs. This can lead to a situation where the summarized read
counts of all miRNAs can be higher than the amount of reads totally
matching. Each miRNA is associated with an ambiguity value, denoting the
share of reads that could not be placed clearly (e.g. 11/89 reads
ambiguous = 0.12). If this value is high, the respective miRNA count may
be misleading. Finally, the most abundant sequence matching a miRNA is
given (primary sequence) as well as the number of reads matching it.


OUTPUT FILES
---


- 1. mirpipe_cluster.tsv: MIRPIPE miRNA clusters = output of one read
sequence per line

This file is centred on the different read sequences found per miRNA
cluster that result from biological and technical variation. Only those
read sequences that are >=5% as abundant as the most abundant sequence per
cluster are denoted by default. If a read matched equally well versus
multiple miRNAs, the respective miRNAs or miRNA clusters are joined by
single linkage clustering.

Columns:
Cluster     Cluster number
Sequence    Read sequence
Count       Summarized read count for all duplicates of this read
miRNA       Name of miRNA or miRNA families

Example (sorted for cluster number, expression):
Cluster  Sequence                     Count  miRNA
     90  CAGTACTGTGATAACTGAAGAA          33  miR-101a
     90  CTACTGTGATAACTGACT              17  miR-101c,miR-101a


- 2. mirpipe_cluster.fasta: MIRPIPE cluster sequences

All sequences reported in the MIRPIPE miRNA cluster's file in fasta format.

Example:
>miR-101a count=33
CAGTACTGTGATAACTGAAGAA
>miR-101a,miR-101c count=17
GTACTGTGATAACTGACT


- 3. mirpipe_mirna.tsv: MIRPIPE miRNAs on 5% level = output of one miRNA
per line

This file includes one count value per miRNA and can directly serve as
input for subsequent differential expression analyses. It is based on
clusters of highly similar miRNAs, where a clear assignment of reads is
not always possible, since the same read can match equally well to
multiple reference miRNAs. Only those miRNA sequences are reported that
are >5% as abundant as the most abundant sequence in its cluster.

Columns:
miRNA                   Name of miRNA or miRNA family
Count		        Summarized read count including isomiRs,
biological + technical sequence variations
Ambiguous reads         Ratio of reads that mapped equally well to other
miRNAs inside the miRNA family cluster
Cluster                 miRNA family cluster number
Primary sequence        Most abundant sequence for this miRNA inside the
cluster
Primary sequence count  Count of the most abundant sequence for this miRNA
inside the cluster
Cluster members         A comma-separated list of all members of the miRNA
family cluster

Example (sorted for cluster number, expression):
miRNA     Expression  Ambiguity  Cluster  Primary Sequence     PS Count
Cluster members
miR-101a         143       0.12       90  CAGTACTGTGATAACTGAAGAA     33
miR-101a,miR-101c
miR-101c          17          1       90  GTACTGTGATAACTGACT         17
miR-101a,miR-101c


EXAMPLE
---

The following example shows a MIRPIPE result using default parameters. Two
miRNAs (miR-2478,miR-3968) were joined into a miRNA cluster based on
BLASTN results.

mirpipe_cluster.tsv
Cluster Sequence                Count   miRNA
192	ATCCCACTTCTGACACCA	69	miR-2478
192	ATCCCACTCTCAACACCA	11	miR-3968
192	ATCCCACTCCTGACACCA	11	miR-2478,miR-3968
192	ATCCCATTCTTGACACCA	9	miR-2478
192	TCGAATCCCACTCCTGACACCA	6	miR-3968
192	AATCCCACTCTCAACACCA	5	miR-3968
192	TCAAATCCCACTCTCAACACCA	5	miR-3968

mirpipe_cluster.fasta:
>miR-2478 count=69
ATCCCACTTCTGACACCA
>miR-3968 count=5
AATCCCACTCTCAACACCA
>miR-3968 count=11
ATCCCACTCTCAACACCA
>miR-2478,miR-3968 count=11
ATCCCACTCCTGACACCA
>miR-2478 count=9
ATCCCATTCTTGACACCA
>miR-3968 count=5
TCAAATCCCACTCTCAACACCA
>miR-3968 count=6
TCGAATCCCACTCCTGACACCA

mirpipe_mirna.tsv:
miRNA        Count Ambiguity  Cluster   Primary Sequence        Primary
Sequence Reads  Cluster members
miR-2478	89	0.12	192	ATCCCACTTCTGACACCA	69
miR-2478,miR-3968
miR-3968	38	0.29	192	ATCCCACTCTCAACACCA	11
miR-2478,miR-3968

The mirpipe_cluster.tsv file depicts the best BLASTN hit per read sequence
based on the least number of mismatches. Sequences are sorted for
expression from top to bottom with the least expressed sequence still at
least 5% as abundant as the most expressed sequence (69 <> 5). The two
miRNAs were joined to a cluster because one of the read sequences showed a
BLASTN hit which fit equally well to both reference sequences (192
ATCCCACTCCTGACACCA	11	miR-2478,miR-3968). If another query had
found that e.g. miR-2478 and miR-1000 had resulted in equally similar
homologies, the two clusters would have been joined to
miR-2478,miR-3968,miR-1000.

The mirpipe_cluster.fasta file shows all read sequences found in
mirpipe_cluster.tsv converted to FASTA format.

The mirpipe_mirna.tsv file attempts to include one count value per miRNA
in order to facilitate later quantification. The count values for each
sequence detected per miRNA are summarized (e.g.: miR-2478 = 69 + 11 + 9 =
89, miR-3968 = 11 + 11 + 6 + 5 + 5 = 38). Since some of the reads matched
two different miRNAs equally well (miR-2478,miR-3968 = 11), these reads
are counted fully for both miRNAs. This leads to a situation where the
summarized read counts of all miRNAs can be higher than the amount of
reads totally matching. Each miRNA is associated with an ambiguity value,
denoting the share of reads that could not be placed clearly (e.g.
miR-2478: 11/89 ambiguous = 0.12). If this value is high, the respective
miRNA count may be misleading. Finally, the most abundant sequence
matching a miRNA is given (primary sequence) as well as the number of
reads matching it.
	WORKFLOW
	---

	The user can upload either FASTQ or FASTA files bearing reads using the
	web interface or the MIRPIPE FTP server. These should ideally be
	compressed (.zip, .gz) to reduce upload time. The pipeline can fully
	process raw reads originating from Illumina, 454, IonTorrent or Sanger
	sequencing instruments including adapter trimming.
	--- Parameters:
	- Reads: File bearing reads in FASTQ or FASTA format (ideally zip
	compressed for Galaxy). This file can either be uploaded using the Galaxy
	Upload Tool (Helpful Tools / Get Data / Upload Files) or using an account
	on our FTP server. The latter is only possible after the registration of a
	user in Galaxy, which automatically creates an account with the same
	username and password on the FTP server
	(ftp://bioinformatics.mpi-bn.mpg.de/). The data will be deleted from the
	server after two weeks.


	A reference FASTA database bearing mature target miRNAs can either be
	selected from the preprocessed current miRBase release 20 data harbouring
	30424 entries of 206 species or can be uploaded by the user in FASTA
	format. The user can optionally choose a subset of the miRBase reference
	miRNAs bearing only miRNAs of the desired organism to limit the comparison
	the e.g. the closest relative. If the chosen reference FASTA file does not
	obey to the naming convention of miRBase (<species>-miR-<#>-<suffix>), the
	"family name clustering" parameter should be turned off.
	--- Parameters:
	- Reference database: Preprocessed DBs (full miRBase or miRNAs of only one
	species) or any user uploaded FASTA file bearing mature miRNAs. The
	correct miRBase file can be downloaded for offline usage:
	ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz.


	The raw read data is then processed to optionally remove an adapter
	sequence and trim for a minimum quality (default Q20). Only reads of the
	desired size range are selected to limit the pool to likely mature miRNAs
	(default: 18-28 nt).
	--- Parameters:
	- Adapter sequence: Nucleotide sequence of the adapter to be removed from
	the 3' end using Cutadapt. By default the larger of the following values
	is used as the maximum mismatch number: 1, 10% of the adapter length.
	These values can be changed inside the mirpipe.pl script.
	- Minimum read length: Minimum length of a read after trimming to be
	considered in the analysis.
	- Minimum base quality: Minimum phred quality for FASTQ data. Nucleotides
	with lower quality will be trimmed. This parameter is not used if FASTA
	formatted read data is supplied.
	- Maximum read length: Maximum length of a read to be considered in
	analysis.


	Duplicate reads are collapsed to decrease the number of necessary homology
	searches (the number of duplicates per read is noted). Only those
	sequences present a minimum number of times (default = 5) are kept for
	further analyses. This measure is intended to remove unique reads which
	frequently denote sequencing errors or lowly expressed miRNAs that can not
	be reliably quantified. Setting this parameter to "1" will increase
	sensitivity at the cost of an increased false positive rate.
	--- Parameters:
	- Minimum read copy number: A read sequence must be present at least this
	number of times to be included.


	Read counts from isomiRs of the same miRNA are combined. These isomiR read
	sequences may only differ by the 3' end and are thus putatively encoded by
	the same gene and bear the same target specificity. This function allows
	the summary of putatively functionally equivalent isomiRs resulting from
	imperfect digestion by the RNases Drosha and Dicer or RNA-Editing by
	specialized enzymes resulting in 3' modification. Only the final 3'
	nucleotide may differ between two sequences to be counted as isoforms of
	the same miRNA and only the longest isoform sequence is used in the next
	step to reduce the amount of homology searches per miRNA.


	The resulting read sequences are compared versus the chosen reference
	database of miRNAs. Sensitivity and specificity of this BLASTN homology
	search can be controlled using various parameters. Parameters are
	optimized for small query sequences (-num_alignments 15 -word_size 7
	-evalue 10 -dust no -strand plus). The resulting hits are filtered to
	exclude those with too many mismatches ((read length - alignment length) +
	mismatches + gaps = final mismatches).
	--- Parameters:
	- Maximum mismatches: Maximum number of mismatches allowed between
	reference miRNA and read sequence ((read length - alignment length) +
	mismatches + gaps = final mismatches). This parameter controls the size of
	the miRNA clusters: more mismatches allowed = larger clusters.


	Mature miRNAs and their precursors are optionally collated by name on the
	family level to remove redundancy (ex.
	bta-miR-200a,oan-miR-200a-3p,tgu-miR-200a-3p -> miR-200a). Otherwise the
	complete miRNA names given in the reference database are carried over
	resulting in more detailed but also more redundant output. Turning off the
	family name clustering can be advisable in case the reference database of
	miRNA sequences does not obey to the naming convention of miRBase
	(<species>-miR-<#>-<suffix>).
	--- Parameters:
	Family name clustering: Collapse the names of all variants of a miRNA to
	the miRNA family (ex. bta-miR-200a,oan-miR-200a-3p,tgu-miR-200a-3p ->
	miR-200a).


	Detected reference miRNA families per read are scored based on the minimum
	number of mismatches. If a read matched equally well versus multiple miRNA
	families, the respective families are joined by single linkage clustering.
	By default only those read sequences that are at least 5% as abundant as
	the most abundant sequence per miRNA family cluster are denoted (ex. most
	abundant sequence = 100 reads, cutoff = 5 reads). This is intended to
	further suppress reads resulting from sequencing errors or biological
	miRNA variations that are expressed near the detection limit.
	--- Parameters:
	Minimum cluster abundance: Remove read sequences from a cluster that are
	less than x% as abundant as the most abundant sequence. This is intended
	to suppress reads resulting from sequencing errors or biological miRNA
	variations that are expressed near the detection limit. This parameter
	controls the size of the miRNA clusters: lower minimum cluster abundance =
	larger clusters.


	In order to achieve congruent results yielding one count value per miRNA,
	miRNA family clusters are finally split. Since some of the reads match
	multiple miRNAs equally well, these reads are counted fully for all of the
	respective miRNAs. This can lead to a situation where the summarized read
	counts of all miRNAs can be higher than the amount of reads totally
	matching. Each miRNA is associated with an ambiguity value, denoting the
	share of reads that could not be placed clearly (e.g. 11/89 reads
	ambiguous = 0.12). If this value is high, the respective miRNA count may
	be misleading. Finally, the most abundant sequence matching a miRNA is
	given (primary sequence) as well as the number of reads matching it.



	OUTPUT FILES
	---


	- 1. mirpipe_cluster.tsv: MIRPIPE miRNA clusters = output of one read
	sequence per line

	This file is centred on the different read sequences found per miRNA
	cluster that result from biological and technical variation. Only those
	read sequences that are >=5% as abundant as the most abundant sequence per
	cluster are denoted by default. If a read matched equally well versus
	multiple miRNAs, the respective miRNAs or miRNA clusters are joined by
	single linkage clustering.

	Columns:
	Cluster Cluster number
	Sequence Read sequence
	Count Summarized read count for all duplicates of this read
	miRNA Name of miRNA or miRNA families

	Example (sorted for cluster number, expression):
	Cluster Sequence Count miRNA
	90 CAGTACTGTGATAACTGAAGAA 33 miR-101a
	90 CTACTGTGATAACTGACT 17 miR-101c,miR-101a


	- 2. mirpipe_cluster.fasta: MIRPIPE cluster sequences

	All sequences reported in the MIRPIPE miRNA cluster's file in fasta format.

	Example:
	>miR-101a count=33
	CAGTACTGTGATAACTGAAGAA
	>miR-101a,miR-101c count=17
	GTACTGTGATAACTGACT


	- 3. mirpipe_mirna.tsv: MIRPIPE miRNAs on 5% level = output of one miRNA
	per line

	This file includes one count value per miRNA and can directly serve as
	input for subsequent differential expression analyses. It is based on
	clusters of highly similar miRNAs, where a clear assignment of reads is
	not always possible, since the same read can match equally well to
	multiple reference miRNAs. Only those miRNA sequences are reported that
	are >5% as abundant as the most abundant sequence in its cluster.

	Columns:
	miRNA Name of miRNA or miRNA family
	Count Summarized read count including isomiRs,
	biological + technical sequence variations
	Ambiguous reads Ratio of reads that mapped equally well to other
	miRNAs inside the miRNA family cluster
	Cluster miRNA family cluster number
	Primary sequence Most abundant sequence for this miRNA inside the
	cluster
	Primary sequence count Count of the most abundant sequence for this miRNA
	inside the cluster
	Cluster members A comma-separated list of all members of the miRNA
	family cluster

	Example (sorted for cluster number, expression):
	miRNA Expression Ambiguity Cluster Primary Sequence PS Count
	Cluster members
	miR-101a 143 0.12 90 CAGTACTGTGATAACTGAAGAA 33
	miR-101a,miR-101c
	miR-101c 17 1 90 GTACTGTGATAACTGACT 17
	miR-101a,miR-101c



	EXAMPLE
	---

	The following example shows a MIRPIPE result using default parameters. Two
	miRNAs (miR-2478,miR-3968) were joined into a miRNA cluster based on
	BLASTN results.

	mirpipe_cluster.tsv
	Cluster Sequence Count miRNA
	192 ATCCCACTTCTGACACCA 69 miR-2478
	192 ATCCCACTCTCAACACCA 11 miR-3968
	192 ATCCCACTCCTGACACCA 11 miR-2478,miR-3968
	192 ATCCCATTCTTGACACCA 9 miR-2478
	192 TCGAATCCCACTCCTGACACCA 6 miR-3968
	192 AATCCCACTCTCAACACCA 5 miR-3968
	192 TCAAATCCCACTCTCAACACCA 5 miR-3968

	mirpipe_cluster.fasta:
	>miR-2478 count=69
	ATCCCACTTCTGACACCA
	>miR-3968 count=5
	AATCCCACTCTCAACACCA
	>miR-3968 count=11
	ATCCCACTCTCAACACCA
	>miR-2478,miR-3968 count=11
	ATCCCACTCCTGACACCA
	>miR-2478 count=9
	ATCCCATTCTTGACACCA
	>miR-3968 count=5
	TCAAATCCCACTCTCAACACCA
	>miR-3968 count=6
	TCGAATCCCACTCCTGACACCA

	mirpipe_mirna.tsv:
	miRNA Count Ambiguity Cluster Primary Sequence Primary
	Sequence Reads Cluster members
	miR-2478 89 0.12 192 ATCCCACTTCTGACACCA 69
	miR-2478,miR-3968
	miR-3968 38 0.29 192 ATCCCACTCTCAACACCA 11
	miR-2478,miR-3968

	The mirpipe_cluster.tsv file depicts the best BLASTN hit per read sequence
	based on the least number of mismatches. Sequences are sorted for
	expression from top to bottom with the least expressed sequence still at
	least 5% as abundant as the most expressed sequence (69 <> 5). The two
	miRNAs were joined to a cluster because one of the read sequences showed a
	BLASTN hit which fit equally well to both reference sequences (192
	ATCCCACTCCTGACACCA 11 miR-2478,miR-3968). If another query had
	found that e.g. miR-2478 and miR-1000 had resulted in equally similar
	homologies, the two clusters would have been joined to
	miR-2478,miR-3968,miR-1000.

	The mirpipe_cluster.fasta file shows all read sequences found in
	mirpipe_cluster.tsv converted to FASTA format.

	The mirpipe_mirna.tsv file attempts to include one count value per miRNA
	in order to facilitate later quantification. The count values for each
	sequence detected per miRNA are summarized (e.g.: miR-2478 = 69 + 11 + 9 =
	89, miR-3968 = 11 + 11 + 6 + 5 + 5 = 38). Since some of the reads matched
	two different miRNAs equally well (miR-2478,miR-3968 = 11), these reads
	are counted fully for both miRNAs. This leads to a situation where the
	summarized read counts of all miRNAs can be higher than the amount of
	reads totally matching. Each miRNA is associated with an ambiguity value,
	denoting the share of reads that could not be placed clearly (e.g.
	miR-2478: 11/89 ambiguous = 0.12). If this value is high, the respective
	miRNA count may be misleading. Finally, the most abundant sequence
	matching a miRNA is given (primary sequence) as well as the number of
	reads matching it.