scRNA seq analysis

scRNA-seq analysis using Python

scRNA best practice

Aim: Analyze the data to determine

Number of cells
Number of cell clusters (generate a cluster map)
Disease-specific clusters
Disease specific transcript signatures

Steps

Examine the data
Demultiplexing your data
Generate a cell matrix
- cells contain same barcode are from the same cell
Filter the cells
- remove cell barcodes appear fewer than x times (to little transcripts)
- consider whether to put a cap on the highest number of transcripts (doublets, multilets have more transcripts)
- background information (set a cut-off point, e.g., how many genes or transcripts constitute the minimum number to define a cell)
Filter the genes
- remove any genes that appear fewer than x times, less informative
Normalization
- normalisation helps to normalise the discrepancy between the abundance of transcripts between cells
Find highly variable genes (Feature selection)
- some genes don't vary much between cells, and carrying forward a matrix of size cells * genes can make computation a bit difficult. Standard pipelines only take into account genes that vary significantly.
Scale data (optional, e.g., cell cycle regression)
- This step is not always performed, although it can help make it easier to compare different samples with different depths of sequencing. This step scales the variation between genes to make them more easily comparable (otherwise, genes with strong expression differences will dominate the analysis, hiding subtle differences from other genes). With this step, you can also optionally ‘regress’ genes, which is to say, their variation will not contribute to cluster calling.
Dimensionality reduction
Identify cell clusters
- group the genes by the trancript signatures
Plot your cells
- Select your Cluster Plot here
- Similar cells will be pletted close together
Interpret the results
- were there any cells you couldn't classify?
- how many total cells did you find?
- how many cell types (clusters) are in your final map?
- how did you interpret the results?

Processing data

Raw reads to expression matrix

initial alignment of the raw reads to the genome, to get an Expression Matrix (EM)

Droplet-based scRNA-seq (e.g. 10X v3) Droplet libraries produce data with cell barcodes + UMIs (unique molecule identifier)

Read1: 10x cell barcode+UMI (26bp) Read2: 3' or 5' transcript (98bp)

Raw read processing

Available tools:

10x Cell Ranger: Reads -> Transcriptome/genome mapping -> Barcode/UMI filtering + correction -> UMI deduplication -> Cell filtering -> Expression Matrix
UMI-tools: Reads -> Barcode extraction -> -> Counting + UMI deduplication -> Expression Matrix
Kallisto/BUStools: Reads -> Pseudoalign -> Barcode correction -> Sort -> Count -> Expression Matrix
STARsolo: Read -> Barcode extraction + correction -> Genome mapping -> UMI correction + deduplication -> Count -> Expression Matrix
Alevin-fry: Reads -> Barcode extraction + frequency + correction -> Transcriptome mapping -> UMI correction + deduplication -> Count -> Expression Matrix

Alevin-fry paper to read

Basic Quality Control

Assumption: Each droplet contains mRNAs from a single cell.

Real single cells
a. Cell barcode correction
b. Empty droplet detection
c. Doublet detection
d. Removal of low-quality cells
Real transcripts
f. UMI deduplication & resolution
g. Removal of ambient mRNAs

The barcode rank plot Doublets detection

Normalisation

Non-uniform variance because of sampling effects -> Divide each cell by its size factor

Non-uniform variance because of heteroskedasticity -> Apply a variance-stabilising transformation (e.g. shifted log)

The goal is to retain only biologically relevant heterogeneity.

Many statistical methods assume uniform variance in the data.

Feature selection

Remove non-informative genes that might not convey any biologically relevant variation

Using deviance, identify highly-variable genes

File formats and tools

Data formats:

Features as the Rows (genes.tsv) Samples as the Columns (barcodes.tsv) metadata(se)

source link

Python toolkit for single-cell analysis Preprocessing scanpy.pp

Tools scanpy.tl

Plotting scanpy.pl

Single-cell best practices
Best practices for single-cell analysis across modalities
Pre-processing of 10X Single-Cell RNA Datasets” by Galaxy Training

exercise dataset:

Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523

Objective

Generate a cellxgene matrix for droplet-based single cell sequencing data
Perform quality control (QC) on the generated matrix
Normalise the counts so that the values in the matrix are in a form that is appropriate for downstream analyses
Identify and retain informative features in the count matrix

Steps

Create a “splice + introns” or splici reference
Map the raw scRNA-seq reads to the reference
Generate a permit list of barcodes to include in the count matrix
Collate permitted barcodes
Perform quantification, output count matrix
Remove empty droplets (optional)
Filter low-quality reads
Correct for ambient RNA
Remove doublets
Normalise the dataset
Perform feature selection

scRNA_analysis/scRNA_best_practices.md

scRNA seq analysis

Processing data

Raw reads to expression matrix