Commit 8cdfd4d2 authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Updating materials for workshop in July 2021

parent e12dd381
# SVI-SAHMRI_scRNA-seq-Workshop_2021-July
A one-day hands-on workshop covering highlights of scRNA-seq data analysis. Organised and hosted by Jimmy Breen at the South Australian Health and Medical Research Institute, presented by Davis McCarthy from St Vincent's Institute of Medical Research
\ No newline at end of file
## Introduction
This workshop provides a quick (~2 hour), hands-on introduction to current computational analysis approaches for studying single-cell RNA sequencing data. We will discuss the data formats used frequently in single-cell analysis, typical steps in quality control and processing of single-cell data, and some of the biological questions these data can be used to answer.
Given we only have 2 hours, we will start with the data already processed into a count matrix, which contains the number of sequencing reads mapping to each gene for each cell. The steps to generate such a count matrix depend on the type of single-cell sequencing technology used and the experimental design. For a more detailed introduction to these methods we recommend the long-form single-cell RNA seq analysis workshop from the BioCellGen group ([available here]( and the analysis of single-cell RNA-seq data course put together by folks at the Sanger Institute [available here]( Briefly, a count matrix is generated from raw sequencing data using the following steps:
1. If multiple sample libraries were pooled together for sequencing (typically to reduce cost), sequences are separated (i.e. demultiplexed) by sample using barcodes that were added to the reads during library preparation.
1. Quality control on the raw sequencing reads (e.g. using [FastQC](
1. Trimming the reads to remove sequencing adapters and low quality sequences from the ends of each read (e.g. using [Trim Galore!](
1. Aligning QCed and trimmed reads to the genome (e.g. using [STARsolo](, [Kalliso-BUStools](, or [CellRanger](
1. Quality control of mapping results.
1. Quantify the expression level of each gene for each cell (e.g. using bulk tools or single-cell specific tools from STAR, Kallisto-BUStools, or CellRanger).
Starting with the expression counts matrix, this workshop will cover the following topics:
1. The SingleCellExperiment object
1. Empty droplet identification
1. Cell-level quality control
1. Normalisation
1. Dimension Reduction and visualisaion
1. Clustering
1. Marker gene/cell annotation
# workflowr options
# Version 1.1.1
# The seed to use for random number generation. See ?set.seed for details.
seed: 20190102
# The working directory to build the R Markdown files. The path is relative to
# _workflowr.yml. See ?rmarkdown::render for details.
knit_root_dir: "."
# Session information function
sessioninfo: "devtools::session_info()"
name: SAHMRI scRNA-seq Workshop
output_dir: ../public
title: SAHMRI scRNA-seq Workshop
- text: "Home"
href: index.html
- text: "About"
href: about.html
- text: "License"
href: license.html
- icon: fa-gitlab
toc: true
toc_float: true
theme: journal
highlight: pygments
code_folding: hide
title: "SAHMRI Single-cell RNA-seq Analysis Workshop"
site: workflowr::wflow_site
toc: true
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
## Introduction
In this workshop, we will introduce the data infrastructure for scRNA-seq analysis
in R and practice a workflow of scRNAseq analysis; from pre-processing, quality
control to dimensionality reduction aand clustering. We will then demonstrate the
usage of marker genes for cell type annotation and an automatic approach for matching
query cells to a reference atlas with labels.
1. [single-cell RNAseq analysis workflow for cell type identification](sahmri_analysis-workflow.html)
This diff is collapsed.
title: "BIOS_sctransform-normalisation"
author: "Ruqian Lyu"
date: "11/30/2020"
bibliography: ../bios.bib
output: html_document
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,dpi=100)
## sctransform
Different from the `size_factor`-based normalisation method that we used before,
`sctransform` is based on probabilistic models for normalisation and variance
stablisation of UMI-count data from scRNAseq. Instead of using a constant factor
for normalising all genes for one cell, `sctransform` [@Hafemeister2019-zh]
models and scales each gene individually.
`sctransform` fits a generalised linear model with a negative binomial error model
for each gene with the sequencing depth (library size) as a covariate followed by
a regularisation procedure to control overfitting. The residuals from the
regularised regression model is then treated as normalised expression levels
with variation caused by sequencing depth removed.
sce.pbmc <- readRDS(file = "raw_data/sce_pbmc.rds")
We use the `log10_umi` as the latent variable that will be regressed out in the
normalised gene expression values.
```{r runsct,message=F}
colData(sce.pbmc)$log10_umi <- log(colData(sce.pbmc)$total,base=10)
pbmc.sctrans <- suppressWarnings(sctransform::vst(assay(sce.pbmc,"counts"),
cell_attr = colData(sce.pbmc),
latent_var = "log10_umi", verbosity = FALSE))
## Add to assay field
`pbmc.sctrans` stores returned value from running variance stablisation using
sctransform. The normalised values are stored in the matrix `y`. We now add `y`
to the `sce` object in the assay field with asaay name "SCT". This is equivalant
to the `logcounts` assay.
## Less genes will be returned by sctransform which filtered out genes that are
## only detected in 5 or less cells.
sce.pbmc <- sce.pbmc[rownames(pbmc.sctrans$y),]
assay(sce.pbmc,"SCT") <- pbmc.sctrans$y
## Select highly variable genes
Feature selection after `sctransform` normalisation is straightforward. We can
just select the top genes that have a high residual variance which contribute
the most biological sources of variation.
head(round(pbmc.sctrans$gene_attr[order(-pbmc.sctrans$gene_attr$residual_variance), ], 2),
select 3,000 highly variable genes for downstream analysis
hvgs_3k <- rownames(round(pbmc.sctrans$gene_attr[order(-pbmc.sctrans$gene_attr$residual_variance), ], 2),
## runPCA with HVGs
Next, we runPCA with the selected number of highly variable genes.
```{r PCA}
sce.pbmc <- runPCA(sce.pbmc,exprs_values="SCT",ncomponents=10,
Generate TSNE plot using PCs
```{r runtsne}
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")
## Clustering
The remaining steps are similar to those presented in the main workflow.
```{r }
g <- buildSNNGraph(sce.pbmc, k=35, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.pbmc) <- factor(clust)
## Plot Clusters
plotTSNE(sce.pbmc, colour_by="label")
plotUMAP(sce.pbmc, colour_by="label")
plotExpression(sce.pbmc, features=c("CD14", "CD68",
"MNDA", "FCGR3A"), x="label", colour_by="label",exprs_values = "SCT")
## More info
`sctransform` is also integrated and interfaced with `Seurat` package which you
can find more information here:
## References
## SessionInfo
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment