added documentation on pre-processing samples

proost · Aug 10, 2017 · 766d70c · 766d70c
1 parent 2ec3d28
commit 766d70c
Show file tree

Hide file tree

Showing 2 changed files with 124 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -38,6 +38,12 @@ Next, move into the directory and copy **config.template.ini** and **data.templa
 
 Configure config.ini and data.ini using these [guidelines](docs/configuration.md)
 
+## Preparing your data
+
+Before running LSTrAP make sure you have all required data. RNA-Seq data needs to be de-multiplexed and de-barcoded, one
+file per samples and paired-end files need to be named properly (*e.g.* sample_one_1.fastq.gz and sample_one_2.fastq.gz).
+
+Instructions on how to do this are included [here](docs/data_preparation.md)
 
 ## Running LSTrAP
 
@@ -60,9 +66,10 @@ Furthermore, steps can be skipped (to avoid re-running steps unnecessarily). Use
 
 ## Further reading
 
+  * [Data preparation](docs/data_preparation.md)
   * [LSTrAP output](docs/example_output.md)
   * [Quality statistics](docs/quality.md): How to check the quality of samples and remove problematic samples
-  * [Helper Scripts](docs/helper.md): To acquire data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra)
+  * [Helper scripts](docs/helper.md): To acquire data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra)
   and process results.
 
 

diff --git a/docs/data_preparation.md b/docs/data_preparation.md
@@ -0,0 +1,116 @@
+# Preparing your data
+
+LSTrAP has a few expectations to the way RNASeq files are pre-processed. When using the [included scripts](helper.md) 
+to download samples from the Sequence Read Archive and converting them to (compressed) .fastq files this is done 
+automatically. When including your own data some pre-processing might be required, read the suggestions below how to get
+other data ready for processing using LSTrAP.
+
+## Preparing RNA-Seq data
+
+RNA-Seq data needs to be **de-multiplexed** and in **.fastq format** prior to running LSTrAP. 
+For single-end samples you need one file (output needs to be merged) for paired-end reads you need to files which have 
+the same  filename except for a suffix, _1 for the left reads, _2 for the right (e.g. paired_end_sample_1.fastq.gz and 
+paired_end_sample_2.fastq.gz).
+
+The extension should be .fq or .fastq for uncompressed files, .fq.gz and .fastq.gz for compressed files (only gzip is
+supported).
+
+### Merging files
+
+Samples are commonly split and sequenced in different lanes or runs (to increase the total number of reads), those files
+need to be combined prior to starting LSTrAP. Fastq file can be done using default linux commands. Pay attention when 
+merging paired-end files.
+
+    # Merge two single-end samples, compressed with gzip
+    zcat sample_one_part_1.fastq.gz sample_one_part_2.fastq.gz | gzip -c > sample_one_merged.fastq.gz
+     
+    # Merge two uncompressed, single-end files
+    cat sample_one_part_1.fastq sample_one_part_2.fastq > sample_one_merged.fastq
+        
+    # Merge two uncompressed, single-end files and compress the result
+    cat sample_one_part_1.fastq sample_one_part_2.fastq | gzip -c > sample_one_merged.fastq.gz
+     
+    # Merge paired-end files
+    zcat sample_one_L001_R1.fastq.gz sample_one_L002_R1.fastq.gz | gzip -c > sample_one_merged_1.fastq.gz
+    zcat sample_one_L001_R2.fastq.gz sample_one_L002_R2.fastq.gz | gzip -c > sample_one_merged_2.fastq.gz
+
+**Note**: In theory, gzipped files can be concatenated directly, which is much more efficiently than using zcat paired with 
+gzip. However this might lead to [errors](https://www.biostars.org/p/81924/#82017) with some tools, use this method 
+carefully at your **own risk**.
+
+    # Merge two single-end samples, compressed with gzip
+    cat sample_one_part_1.fastq.gz sample_one_part_2.fastq.gz > sample_one_merged.fastq.gz
+     
+    # Merge paired-end files
+    cat sample_one_L001_R1.fastq.gz sample_one_L002_R1.fastq.gz > sample_one_merged_1.fastq.gz
+    cat sample_one_L001_R2.fastq.gz sample_one_L002_R2.fastq.gz > sample_one_merged_2.fastq.gz   
+
+### Converting files to Fastq
+
+If you have files in .sam/.bam or .sra format, these need to be converted to .fastq or .fastq.gz files. 
+[Samtools](http://www.htslib.org/doc/samtools.html) allows conversion from .sam/.bam to .fastq while 
+[SraToolKit](https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) included fastq-dump to convert .sra files.
+
+For converting .sra files a [helper script](helper.md) is included.
+
+    # SAMTools
+    samtools fastq input.bam > output.fastq
+    samtools fastq input.bam | gzip -c > output.fastq.gz
+     
+    # SRAToolKit
+    fastq-dump --gzip --skip-technical --readids --dumpbase --split-3 input.sra -O output_dir/
+
+
+### De-multiplexing files
+
+In some cases the sequences might not be de-multiplexed by the sequencing facility, for de-multiplexing RNA-Seq files a 
+third-party tool is required.  
+
+  * [fastq-multx](https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMultx.md)
+  * [REAPER](http://wwwdev.ebi.ac.uk/enright-dev/kraken/reaper/src/reaper-latest/doc/reaper.html)
+
+
+## Recommended folder structure (optional)
+
+When using LSTrAP keeping a consistent file/folder structure for input and output (as specified in data.ini) is 
+recommended. This allows data.ini to be quickly adopted for novel projects.
+
+Below if the structure we've adopted for our projects.
+
+```
+./
+|-- config.ini
+|-- data.ini
++-- data/
+    +-- fastq/
+        |-- sample_one.fastq.gz
+        |-- paired_end_S01_1.fastq.gz
+        |-- paired_end_S01_2.fastq.gz
+        +-- ...
+    +-- genome/
+        |-- species.genome.fasta
+        |-- species.cds.fasta
+        |-- species.pep.fasta
+        +-- species.genes.gff
++-- output
+    +-- species
+        |-- index_files
+        +-- alignment_tophat/
+            |-- sample_one/
+            |-- paired_end_S01/
+            +-- ...
+        +-- htseq/
+            |-- sample_one.htseq
+            |-- paired_end_S01.htseq
+            +-- ...
+        +-- expression/
+            |-- matrix.raw.txt
+            |-- matrix.tpm.txt
+            |-- matrix.rpkm.txt
+            |-- pcc.table.txt
+            |-- pcc.mcl.txt
+            +-- mcl.clusters.txt
+```
+
+
+