A snakemake pipeline to preprocess data from single cell RNAseq. sc-preprocess
can handle data from multiple platforms, e.g. C1, Wafergen and DropSeq.
git clone ...
cd single-cell-preprocessing
conda env create -f environment.yml
Edit the file config/dropseq.yml
and adapt the options to your needs. Pay attention to the data
section and add the paths to individual fastq files like this:
data:
files:
condition1:
r1: path/to/condition1_r1.fastq.gz
r2: path/to/condition1_r2.fastq.gz
condition2:
r1: path/to/condition2_r1.fastq.gz
r2: path/to/condition2_r2.fastq.gz
A call to snakemake -s sc-preprocess.snake
will create a whitelist of true barcodes per condition, demultiplex barcodes into fastq files for each valid barcode and quantify the data.
Edit the file config/plateseq.yml
and adapt the options to your needs. If you already have fastq files for each cell (sample), set demultiplex: False
in the action
section.
Plate-based data needs a samplesheet, that gives information for each sample that should be processed. A samplesheet is a tab-separated table that contains at least a column with a samples name:
Name | Batch | Condition |
---|---|---|
cell1 | batch1 | wildtype |
cell2 | batch1 | mutant |
cell3 | batch2 | wildtype |
cell4 | batch2 | mutant |
Other columns might be needed, depending on the experimental setup:
- C1 data needs a
URL_r1
with the path to the samples fastq file, but noBarcode
column. - Wafergen data needs a
Barcode
column with the barcode that is used during multiplexing, but noURL_r1
column.
The names of columns can be arbirtary, but need to be given for the sample name (index
) and barcode column in the config/plateseq.yml
file:
samplesheet:
file: SampleSheet.txt
index: Name
barcode: Barcode
A call to snakemake -s sc-preprocess.snake
will demultiplex barcodes into fastq files, if needed, and quantify the data.