scRNA-seq analysis using Python
Aim: Analyze the data to determine
- Number of cells
- Number of cell clusters (generate a cluster map)
- Disease-specific clusters
- Disease specific transcript signatures
Steps
-
Examine the data
-
Demultiplexing your data
-
Generate a cell matrix
- cells contain same barcode are from the same cell
-
Filter the cells
- remove cell barcodes appear fewer than x times (to little transcripts)
- consider whether to put a cap on the highest number of transcripts (doublets, multilets have more transcripts)
- background information (set a cut-off point, e.g., how many genes or transcripts constitute the minimum number to define a cell)
-
Filter the genes
- remove any genes that appear fewer than x times, less informative
-
Normalization
- normalisation helps to normalise the discrepancy between the abundance of transcripts between cells
-
Find highly variable genes (Feature selection)
- some genes don't vary much between cells, and carrying forward a matrix of size cells * genes can make computation a bit difficult. Standard pipelines only take into account genes that vary significantly.
-
Scale data (optional, e.g., cell cycle regression)
- This step is not always performed, although it can help make it easier to compare different samples with different depths of sequencing. This step scales the variation between genes to make them more easily comparable (otherwise, genes with strong expression differences will dominate the analysis, hiding subtle differences from other genes). With this step, you can also optionally ‘regress’ genes, which is to say, their variation will not contribute to cluster calling.
-
Dimensionality reduction
-
Identify cell clusters
- group the genes by the trancript signatures
-
Plot your cells
- Select your Cluster Plot here
- Similar cells will be pletted close together
-
Interpret the results
- were there any cells you couldn't classify?
- how many total cells did you find?
- how many cell types (clusters) are in your final map?
- how did you interpret the results?
initial alignment of the raw reads to the genome, to get an Expression Matrix (EM)
Droplet-based scRNA-seq (e.g. 10X v3) Droplet libraries produce data with cell barcodes + UMIs (unique molecule identifier)
Read1: 10x cell barcode+UMI (26bp) Read2: 3' or 5' transcript (98bp)
- Raw read processing
Available tools:
-
10x Cell Ranger: Reads -> Transcriptome/genome mapping -> Barcode/UMI filtering + correction -> UMI deduplication -> Cell filtering -> Expression Matrix
-
UMI-tools: Reads -> Barcode extraction -> -> Counting + UMI deduplication -> Expression Matrix
-
Kallisto/BUStools: Reads -> Pseudoalign -> Barcode correction -> Sort -> Count -> Expression Matrix
-
STARsolo: Read -> Barcode extraction + correction -> Genome mapping -> UMI correction + deduplication -> Count -> Expression Matrix
-
Alevin-fry: Reads -> Barcode extraction + frequency + correction -> Transcriptome mapping -> UMI correction + deduplication -> Count -> Expression Matrix
Alevin-fry paper to read
- Basic Quality Control
Assumption: Each droplet contains mRNAs from a single cell.
- Real single cells
a. Cell barcode correction
b. Empty droplet detection
c. Doublet detection
d. Removal of low-quality cells - Real transcripts
f. UMI deduplication & resolution
g. Removal of ambient mRNAs
The barcode rank plot Doublets detection
- Normalisation
Non-uniform variance because of sampling effects -> Divide each cell by its size factor
Non-uniform variance because of heteroskedasticity -> Apply a variance-stabilising transformation (e.g. shifted log)
The goal is to retain only biologically relevant heterogeneity.
Many statistical methods assume uniform variance in the data.
Remove non-informative genes that might not convey any biologically relevant variation
Using deviance, identify highly-variable genes
- File formats and tools
Data formats:
Features as the Rows (genes.tsv) Samples as the Columns (barcodes.tsv) metadata(se)
- Python toolkit for single-cell analysis
Preprocessing
scanpy.pp
Tools scanpy.tl
Plotting scanpy.pl
Single-cell best practices
Best practices for single-cell analysis across modalities
Pre-processing of 10X Single-Cell RNA Datasets” by Galaxy Training
exercise dataset:
Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523
Objective
Generate a cellxgene matrix for droplet-based single cell sequencing data
Perform quality control (QC) on the generated matrix
Normalise the counts so that the values in the matrix are in a form that is appropriate for downstream analyses
Identify and retain informative features in the count matrix
Steps
- Create a “splice + introns” or splici reference
- Map the raw scRNA-seq reads to the reference
- Generate a permit list of barcodes to include in the count matrix
- Collate permitted barcodes
- Perform quantification, output count matrix
- Remove empty droplets (optional)
- Filter low-quality reads
- Correct for ambient RNA
- Remove doublets
- Normalise the dataset
- Perform feature selection