Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

scRNA seq analysis

scRNA-seq analysis using Python

scRNA best practice

Aim: Analyze the data to determine

  • Number of cells
  • Number of cell clusters (generate a cluster map)
  • Disease-specific clusters
  • Disease specific transcript signatures

Steps

  1. Examine the data

  2. Demultiplexing your data

  3. Generate a cell matrix

    • cells contain same barcode are from the same cell
  4. Filter the cells

    • remove cell barcodes appear fewer than x times (to little transcripts)
    • consider whether to put a cap on the highest number of transcripts (doublets, multilets have more transcripts)
    • background information (set a cut-off point, e.g., how many genes or transcripts constitute the minimum number to define a cell)
  5. Filter the genes

    • remove any genes that appear fewer than x times, less informative
  6. Normalization

    • normalisation helps to normalise the discrepancy between the abundance of transcripts between cells
  7. Find highly variable genes (Feature selection)

    • some genes don't vary much between cells, and carrying forward a matrix of size cells * genes can make computation a bit difficult. Standard pipelines only take into account genes that vary significantly.
  8. Scale data (optional, e.g., cell cycle regression)

    • This step is not always performed, although it can help make it easier to compare different samples with different depths of sequencing. This step scales the variation between genes to make them more easily comparable (otherwise, genes with strong expression differences will dominate the analysis, hiding subtle differences from other genes). With this step, you can also optionally ‘regress’ genes, which is to say, their variation will not contribute to cluster calling.
  9. Dimensionality reduction

  10. Identify cell clusters

    • group the genes by the trancript signatures
  11. Plot your cells

    • Select your Cluster Plot here
    • Similar cells will be pletted close together
  12. Interpret the results

    • were there any cells you couldn't classify?
    • how many total cells did you find?
    • how many cell types (clusters) are in your final map?
    • how did you interpret the results?

Processing data

Raw reads to expression matrix

initial alignment of the raw reads to the genome, to get an Expression Matrix (EM)

Droplet-based scRNA-seq (e.g. 10X v3) Droplet libraries produce data with cell barcodes + UMIs (unique molecule identifier)

Read1: 10x cell barcode+UMI (26bp) Read2: 3' or 5' transcript (98bp)

  • Raw read processing

image

Available tools:

  • 10x Cell Ranger: Reads -> Transcriptome/genome mapping -> Barcode/UMI filtering + correction -> UMI deduplication -> Cell filtering -> Expression Matrix

  • UMI-tools: Reads -> Barcode extraction -> -> Counting + UMI deduplication -> Expression Matrix

  • Kallisto/BUStools: Reads -> Pseudoalign -> Barcode correction -> Sort -> Count -> Expression Matrix

  • STARsolo: Read -> Barcode extraction + correction -> Genome mapping -> UMI correction + deduplication -> Count -> Expression Matrix

  • Alevin-fry: Reads -> Barcode extraction + frequency + correction -> Transcriptome mapping -> UMI correction + deduplication -> Count -> Expression Matrix

Alevin-fry paper to read

  • Basic Quality Control

Assumption: Each droplet contains mRNAs from a single cell.

  • Real single cells
    a. Cell barcode correction
    b. Empty droplet detection
    c. Doublet detection
    d. Removal of low-quality cells
  • Real transcripts
    f. UMI deduplication & resolution
    g. Removal of ambient mRNAs

The barcode rank plot Doublets detection

  • Normalisation

Non-uniform variance because of sampling effects -> Divide each cell by its size factor

Non-uniform variance because of heteroskedasticity -> Apply a variance-stabilising transformation (e.g. shifted log)

The goal is to retain only biologically relevant heterogeneity.

Many statistical methods assume uniform variance in the data.

Remove non-informative genes that might not convey any biologically relevant variation

Using deviance, identify highly-variable genes

  • File formats and tools

Data formats:

Features as the Rows (genes.tsv) Samples as the Columns (barcodes.tsv) metadata(se)

source link image image

  • Python toolkit for single-cell analysis Preprocessing scanpy.pp

Tools scanpy.tl

Plotting scanpy.pl

Single-cell best practices
Best practices for single-cell analysis across modalities
Pre-processing of 10X Single-Cell RNA Datasets” by Galaxy Training

exercise dataset:

Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523

Objective

Generate a cellxgene matrix for droplet-based single cell sequencing data
Perform quality control (QC) on the generated matrix
Normalise the counts so that the values in the matrix are in a form that is appropriate for downstream analyses
Identify and retain informative features in the count matrix

Steps

  1. Create a “splice + introns” or splici reference
  2. Map the raw scRNA-seq reads to the reference
  3. Generate a permit list of barcodes to include in the count matrix
  4. Collate permitted barcodes
  5. Perform quantification, output count matrix
  6. Remove empty droplets (optional)
  7. Filter low-quality reads
  8. Correct for ambient RNA
  9. Remove doublets
  10. Normalise the dataset
  11. Perform feature selection