Commit 58bf10e6 authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Adding all of Nhi's material

parent 0d6c5b80
Pipeline #8775 passed with stage
in 2 seconds
pages:
stage: deploy
script:
- echo 'Nothing to do...'
artifacts:
paths:
- public
only:
- master
Version: 1.0
RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Yes
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
# SAHMRI-Bulk-RNAseq
# Bulk RNA-seq Workshop
Nhi Hin's bulk RNA-seq workshop
\ No newline at end of file
## Instructions
1. Ensure that RStudio is installed. Install the `renv` package by running the following line in the RStudio Console:
```
install.packages("renv")
```
2. Download the workshop material by clicking [here](https://github.com/sagc-bioinformatics/Bulk_RNAseq/archive/refs/heads/master.zip) and unzip the folder.
- Open the `Bulk_RNAseq.Rproj` file to open the R Project in RStudio.
- Navigate to the `analysis` folder, and click the `de.Rmd` file to open up the analysis in RStudio.
- Also, navigate to the `docs` folder, and click on the `de.html` file to open up the analysis documentation in a web browser.
3. In the RStudio Console, run the following lines of code to download and install all required packages using `renv`:
```
renv::restore(library = "./rlib/")
.libPaths( c( .libPaths(), "./rlib") )
```
4. You are ready to start running through the workshop material!
# workflowr options
# Version 1.6.2
# The seed to use for random number generation. See ?set.seed for details.
seed: 20210629
# The working directory to build the R Markdown files. The path is relative to
# _workflowr.yml. See ?rmarkdown::render for details.
knit_root_dir: "."
name: RNAseq_Workshop
output_dir: ../docs
navbar:
title: Bulk RNA-seq
left:
- text: Introduction
href: index.html
- text: Setting Up
href: processing.html
- text: Differential Gene Expression Analysis
href: de.html
output:
workflowr::wflow_html:
toc: yes
toc_float: yes
theme: journal
highlight: textmate
This diff is collapsed.
---
title: "Introduction"
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: false
editor_options:
chunk_output_type: console
---
## Instructions
1. Ensure that RStudio is installed. Install the `renv` package by running the following line in the RStudio Console:
```{r eval=FALSE}
install.packages("renv")
```
2. Download the workshop material by clicking [here](https://github.com/sagc-bioinformatics/Bulk_RNAseq/archive/refs/heads/master.zip) and unzip the folder.
- Click on the `Bulk_RNAseq.Rproj` file to open up the R project in RStudio.
- Navigate to the `analysis` folder, and click the `de.Rmd` file to open up the analysis in RStudio.
- Also, navigate to the `docs` folder, and click on the `de.html` file to open up the analysis documentation in a web browser.
3. In the RStudio Console, run the following lines of code to download and install all required packages using `renv`:
```{r eval=FALSE}
renv::restore(library = "./rlib/")
.libPaths( c( .libPaths(), "./rlib") )
```
4. You are ready to start running through the workshop material!
---
title: "Setting up for Differential Gene Expression Analysis"
author: "Nhi Hin"
date: "2021-06-28"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
## Summary
- On this page, we will prepare the data for the differential gene expression analysis. This involves:
1. Importing in the gene counts and sample metadata.
2. Obtaining gene annotations for the genes detected in this experiment.
3. Creating a `DGEList` object to store the gene counts, sample metadata, and gene annotations in a single object.
4. Dimension reduction to check how similar samples are, as well as whether there are any mislabeled samples.
5. Export out `DGEList` for use in the [differential gene expression analysis](de.html).
## Data Description
- The dataset we will be using in this analysis comes from a recent study published in [Cell](https://www.cell.com/iscience/pdf/S2589-0042(21)00679-9.pdf).
- The data has been uploaded onto the public repository GEO with the [GSE171110](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171110) accession number, and I've copied the description of the data the authors gave there below:
<blockquote>
The identification of COVID-19 patients with high-risk of severe disease is a challenge in routine care. We performed blood RNA-seq gene expression analyses in severe hospitalized patients compared to healthy donors. Supervised and unsupervised analyses revealed a high abundance of CD177, a specific neutrophil activation marker, contributing to the clustering of severe patients. Gene abundance correlated with high serum levels of CD177 in severe patients. These results highlight neutrophil activation as a hallmark of severe disease and CD177 assessment as a reliable prognostic marker for routine care.
</blockquote>
- The data was pre-processed by the authors of the [Cell](https://www.cell.com/iscience/pdf/S2589-0042(21)00679-9.pdf) paper, by aligning .fastq files to the human `hg18` transcriptome using [Salmon](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/). We will not be covering the pre-processing in this workshop due to time constraints, although I would highly encourage those interested to bookmark the [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#using-salmon) which goes through the commands needed to run Salmon on .fastq files yourself.
## Load R Packages
- The following code chunk contains the R packages that are required to run through the analysis on this page:
```{r message=FALSE, warning=FALSE, error=FALSE}
# Working with data:
library(dplyr)
library(magrittr)
library(readr)
library(tibble)
library(reshape2)
# Visualisation:
library(kableExtra)
library(ggplot2)
library(ggbiplot)
library(ggrepel)
library(grid)
# Set ggplot theme
theme_set(theme_bw())
# Other packages:
library(here)
library(export)
# Bioconductor packages:
library(AnnotationHub)
library(edgeR)
library(limma)
library(Glimma)
```
## Import Data
- The first step is to import the gene counts and sample metadata data into R.
### Gene count matrix
- The raw gene counts after RNA-seq data pre-processing were downloaded from GEO [GSE171110](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171110) under the "Supplementary Files". The `GSE171110_Data_RNAseq_raw_counts_geo.txt.gz` supplementary file was downloaded and uncompressed into the `data` directory, and the resulting tab-delimited text file (containing gene counts) is imported below.
```{r importGeneCounts, message=FALSE, warning=FALSE}
counts <- readr::read_tsv(here("data",
"GSE171110_Data_RNAseq_raw_counts_geo.txt"),
skip = 2,
col_names = TRUE) %>%
as.data.frame %>%
tibble::column_to_rownames("Code_Patient")
```
- By previewing the `counts` object below, we can see that each row is a gene and each column is a sample. The `counts` object is a type of object in R called a `data.frame` which you can basically think of as being like a spreadsheet but in R.
```{r message=FALSE, warning=FALSE}
# Preview first 5 rows and columns
counts[1:5,1:5]
```
- The `nrow` and `ncol` functions can be used to check how many rows and columns are in a `data.frame` object:
```{r message=FALSE, warning=FALSE}
nrow(counts)
ncol(counts)
```
#### Questions
1. How many genes are in this dataset?
2. How many Healthy and Covid-19 samples are in this dataset?
### Sample Metadata Table
- We import the sample metadata from the table stored in the CSV file at `data/sample_table.csv` and store it in an object called `samples`:
```{r}
samples <- readr::read_csv(here("data", "sample_table.csv")) %>%
as.data.frame
```
- As we can see in the preview below, each row in the `samples` `data.frame` is a sample, and the columns represent different characteristics/information about the samples, including `age`, `sex`, and `group` (Healthy or Covid-19).
```{r}
samples %>%
kable %>%
kable_styling()
```
### Readable sample names
- You may have noticed when previewing the gene counts matrix stored in `counts` using `head(counts)` earlier that the sample names are not very descriptive (eg. 507-V, 1189-V etc.). These correspond to the `patient_code` column in the sample metadata table above.
- Below, we will replace these `patient_code` names in the gene counts with readable sample names so that the rest of the analysis is easier to interpret, especially when visualising the data.
- First, we create a new column in the `samples` metadata table, called `sample`, that will contain readable sample names (the `group` they are in followed by a number):
```{r}
samples <- samples %>%
dplyr::mutate(sample = paste0(group, "_", c(1:10, 1:44)))
rownames(samples) <- samples$sample
# Preview the samples table with the added column:
samples %>%
kable %>%
kable_styling()
```
- We will then set the column names in `counts` (corresponding to sample names) to the new readable sample names:
```{r}
# Reorder the columns in `counts` so they are in the same order as the rows in the samples table
counts <- counts[, samples$patient_code]
# Set the colnames to the readable sample names
colnames(counts) <- samples$sample
head(counts)
```
## Gene annotations
- Gene annotations are useful for later interpreting the results of the differential gene expression analysis. They give information about genes, including genomic location (e.g. start and end coordinates), descriptions, type of gene (e.g. whether it's a non-coding or protein-coding gene) etc.
- Below, we retrieve gene annotation metadata from the Ensembl database (this step can take a while to run if running `AnnotationHub` for the first time, so I have saved out the `genes` object as an R object).
```{r eval=FALSE}
# Below is not run in this RMarkdown document:
# Load annotation hub
ah <- AnnotationHub()
# Search for the correct Ensembl gene annotation
ah %>%
subset(grepl("sapiens", species)) %>%
subset(rdataclass == "EnsDb")
ensDb <- ah[["AH78783"]]
# Get gene annotations
genes <- genes(ensDb) %>%
as.data.frame
# genes %>% saveRDS(here("data", "R", "genes.rds"))
```
```{r}
# I load in a saved version below since the above code can take a while to run
# if AnnotationHub objects are not already pre-downloaded.
genes <- readRDS(here("data", "R", "genes.rds"))
```
- The `genes` data.frame is prepared so that the `row.names` (corresponding to Ensembl gene IDs) are in the same order as for the gene count matrix stored in `counts`.
```{r message=FALSE, warning=FALSE}
genes <- data.frame(gene = rownames(counts)) %>%
left_join(genes %>% as.data.frame,
by = c("gene"="symbol")) %>%
dplyr::distinct(gene, .keep_all=TRUE)
rownames(genes) <- genes$gene
```
- From previewing the gene annotations below, we can see that each row is a gene and each column contains information about the gene.
```{r}
genes %>%
head(50) %>%
kable %>%
kable_styling %>%
scroll_box(height = "600px")
```
- The columns of information about the genes include:
- **seqnames**: Chromosome the gene is located on
- **start**: Start coordinate
- **end**: End coordinate
- **width**: Length of gene (bp)
- **strand** Strand the gene is located on
- **gene_id**: Ensembl gene identifier
- **gene_name**: Gene symbol
- **gene_biotype**: Type of gene
- **seq_coord_system**: Whether the gene is located on a chromosome or scaffold
- **description**: Short description of the gene / what it encodes
- **gene_id_version**: Version of the gene model in Ensembl
- **entrezid**: Entrez gene identifier
- This information can help to put the differential gene expression analysis results in context later. For example, in some analyses, people may wish to focus on looking at particular `gene_biotype`s (e.g. protein-coding genes).
- By using the `table` function, we can count the number of genes belonging to each `gene_biotype` in this particular dataset.
```{r}
genes$gene_biotype %>%
table
```
### Questions
3. What percentage of genes in this dataset are `protein_coding`?
## Create DGEList object
- We will store the gene counts, sample information, and gene information in a `DGEList` (Digital Gene Expression) object called `dge`, which simplifies plotting and interacting with this information. It will also be used for the [differential gene expression analysis](de.html) later on.
- The `DGEList` object is prepared as follows. Here, we remove all genes that have 0 counts across all samples, and apply Trimmed Mean of M-values (TMM) normalisation to normalise counts according to library size differences between samples.
```{r createDGEList}
dge <- DGEList(counts = counts,
genes = genes,
samples = samples,
remove.zeros = TRUE) %>%
calcNormFactors("TMM")
# Preview the DGEList object
dge
```
- The `counts`, `genes` or `samples` can be accessed from the `dge` object using the `$` operator. This is exactly the same information as that stored in the original `counts`, `samples` and `genes` objects, but putting them together in the `dge` object makes it easier to do certain plots/analysis later.
```{r}
# Preview first 5 rows and columns of the counts stored in `dge`:
dge$counts[1:5, 1:5]
# Preview first 5 rows and columns of the samples stored in `dge`:
dge$samples[1:5, 1:5]
# Preview first 5 rows and columns of the genes stored in `dge`:
dge$genes[1:5, 1:5]
```
## Quality Checks
- In this section we will do some basic quality checks to see if any samples are outliers or potentially mislabeled.
### Library Sizes of samples
- Overall, the library sizes of samples are shown. Sometimes if there is an outlier sample with an abnormally large or small library size, it is worth keeping an eye on in case it is a failed sample (e.g. library prep went wrong). We can see that certain samples have a larger library size below, but overall they are quite comparable.
```{r libSizes}
dge$samples %>%
ggplot(aes(x = sample, y = lib.size, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library size (total number of mapped and quantified reads)",
x = "Sample", fill = "Group") +
coord_flip()
```
### Dimension Reduction
#### Multi Dimensional Scaling (MDS)
- Multi Dimensional Scaling (MDS) is a dimension reduction technique that summarises expression of the top 500 most variable genes expressed in each sample down to 2 dimensions, known as `Dimension 1` and `Dimension 2`. Dimension 1 represents the largest amount of variance that can be explained in the data. Plotting these principal components can help us visualise which samples show overall similar gene expression to each other -- similar samples will be situated closer to each other on the plot.
- Because our information is stored in a `DGEList` type of object, we can make use of the `glMDSPlot` function from the *Glimma* package to make an interactive MDS plot. The MDS plot produced using the function below can be viewed by navigating to the `glimma-plots` directory and clicking on the MDS-Plot.html file, or by clicking [here](assets/glimma-plots/MDS-Plot.html).
```{r}
glMDSPlot(dge,
labels = dge$samples$sample,
groups = dge$samples$group)
```
#### Principal Component Analysis (PCA)
- Principal component analysis (PCA) is similar to MDS, but it uses all genes (`r nrow(dge)` genes for this particular dataset) to summarise down into a few principal components. Plotting these in the same way as for an MDS plot can also give us information about which samples are similar to each other. In most cases, the MDS and PCA plots will look very similar.
- Below, the PCA plot created using the `ggbiplot` function shows that overall, samples belonging to the same `group` tend to have similar gene expression patterns, and there is quite a large difference between `Healthy` and `Covid-19` samples. There do not appear to be any obvious outlier samples, although there is quite a bit of variation in gene expression amongst `Covid-19` samples.
- We will not remove any samples in this particular dataset as there is no indication so far that any of the samples are obvious outliers.
```{r pca}
# Perform PCA analysis:
pca_analysis <- prcomp(t(cpm(dge, log=TRUE)))
summary(pca_analysis)
# Create the plot:
pca_plot <- ggbiplot::ggbiplot(pca_analysis,
groups = dge$samples$group,
ellipse = TRUE,
var.axes = FALSE)
pca_plot
```
- We also have other sample metadata including `age` and `sex` of the samples in the `dge$samples` table. These variables can also be plotted on the PCA plot by changing what is specified in the `groups` argument below, so that we can see whether samples are grouping by age or sex. Doing this is useful as it can tell us whether these variables are [confounding variables](https://en.wikipedia.org/wiki/Confounding) that have the potential to obscure real biological effects.
- Below, we can see that the ages of patients from which samples were derived ranges from 51 to 61. It does appear that some of the older patients' samples are closer to the top of the plot.
```{r}
ggbiplot::ggbiplot(pca_analysis,
groups = dge$samples$age,
ellipse = FALSE,
var.axes = FALSE)
```
- Below, we can see that `sex` of the samples does not seem to really influence where the points lie on the plot, suggesting that sex does not have a noticeable effect on gene expression of the samples here.
```{r}
ggbiplot::ggbiplot(pca_analysis,
groups = dge$samples$sex,
ellipse = FALSE,
var.axes = FALSE)
```
### Questions
4. Should we treat male and female samples differently in the analysis, based on the information in the MDS/PCA plots?
5. Should we do anything about the older patients' samples, based on the information in the MDS/PCA plots?
## Next Step - Differential gene expression analysis
- Overall the data quality looks good and there do not appear to be outlier samples or mislabeled samples.
- We will export the DGEList object created in this document for the next step of the analysis, [differential gene expression analysis](de.html). The `dge` object is exported into the `data/R` directory.
```{r eval=FALSE}
r.dir <- here("data", "R")
ifelse(!dir.exists(r.dir),
dir.create(r.dir),
FALSE)
dge %>% saveRDS(file.path(r.dir, "dge.rds"))
```
This diff is collapsed.
File added
patient_code,age,sex,group,geo_accession
507-V,53,M,Healthy,GSM5218990
1189-V,57,F,Healthy,GSM5218991
1406-V,61,M,Healthy,GSM5218992
1918-V,56,M,Healthy,GSM5218993
1951-V,57,F,Healthy,GSM5218994
1995-V,55,F,Healthy,GSM5218995
2064-V,55,M,Healthy,GSM5218996
3054-V,60,M,Healthy,GSM5218997
3136-V,51,F,Healthy,GSM5218998
3538-V,60,M,Healthy,GSM5218999
002-0001,51,M,Covid19,GSM5219000
002-0002,60,M,Covid19,GSM5219001
002-0003,53,F,Covid19,GSM5219002
002-0004,57,F,Covid19,GSM5219003
002-0011,58,F,Covid19,GSM5219004
002-0006,55,M,Covid19,GSM5219005
002-0031,53,M,Covid19,GSM5219006
001-0021,52,M,Covid19,GSM5219007
001-0033,60,M,Covid19,GSM5219008
001-0024,61,M,Covid19,GSM5219009
001-0040,58,M,Covid19,GSM5219010
001-0032,59,M,Covid19,GSM5219011
001-0029,61,M,Covid19,GSM5219012
002-0035,51,M,Covid19,GSM5219013
002-0037,55,M,Covid19,GSM5219014
001-0036,59,F,Covid19,GSM5219015
002-0040,55,F,Covid19,GSM5219016
002-0041,57,F,Covid19,GSM5219017
002-0046,60,M,Covid19,GSM5219018
002-0053,55,M,Covid19,GSM5219019
001-0068,56,M,Covid19,GSM5219020
002-0057,51,M,Covid19,GSM5219021
002-0058,57,M,Covid19,GSM5219022
002-0060,53,M,Covid19,GSM5219023
001-0094,60,M,Covid19,GSM5219024
002-0062,55,M,Covid19,GSM5219025
002-0061,57,M,Covid19,GSM5219026
002-0068,53,M,Covid19,GSM5219027
002-0067,55,M,Covid19,GSM5219028
002-0066,51,M,Covid19,GSM5219029
002-0064,52,M,Covid19,GSM5219030
002-0070,58,M,Covid19,GSM5219031
002-0072,57,M,Covid19,GSM5219032
001-0129,56,M,Covid19,GSM5219033
002-0078,55,M,Covid19,GSM5219034
002-0080,55,M,Covid19,GSM5219035
002-0079,59,M,Covid19,GSM5219036
002-0083,54,M,Covid19,GSM5219037
002-0084,59,M,Covid19,GSM5219038
001-0142,53,M,Covid19,GSM5219039
002-0088,61,M,Covid19,GSM5219040
002-0101,58,M,Covid19,GSM5219041
002-0104,58,M,Covid19,GSM5219042
051-0001,54,M,Covid19,GSM5219043
\ No newline at end of file
<!DOCTYPE html>
<html lang='en'>
<meta charset='utf-8' />
<meta name='viewport' content='width=device-width, initial-scale=1'>
<head>
<script src='js/glimma.min.js'></script>
<link rel='stylesheet' href='css/glimma.min.css'>
<title>Glimma Plots</title>
<script>
function clickAdjPval() {
if ($('th:contains(Adj.PValue)').text()) $('th:contains(Adj.PValue)').click(); // sort table by adjusted p-value
}
</script>
</head>
<!-- sort by adjusted p-value on load if possible -->
<body onload=clickAdjPval()>
<div class='container'>
</div>
<script src='js/MDS-Plot.js'></script>
<script>
glimma.init.processCharts();
glimma.init.processTables();
glimma.init.processLinkages();
glimma.init.processInputs();
</script>
<script>
d3.select('.container')
.append('div')
.style('text-align', 'right')
.text('Generated by ')
.append('a')
.text('Glimma')
.attr('href', 'https://www.github.com/shians/Glimma');
</script>
</body>
</html>
<!DOCTYPE html>
<html lang='en'>
<meta charset='utf-8' />
<meta name='viewport' content='width=device-width, initial-scale=1'>
<head>
<script src='js/glimma.min.js'></script>
<link rel='stylesheet' href='css/glimma.min.css'>
<title>Glimma Plots</title>
<script>
function clickAdjPval() {
if ($('th:contains(Adj.PValue)').text()) $('th:contains(Adj.PValue)').click(); // sort table by adjusted p-value
}
</script>
</head>
<!-- sort by adjusted p-value on load if possible -->
<body onload=clickAdjPval()>
<div class='container'>
</div>
<script src='js/XY-Plot.js'></script>
<script>
glimma.init.processCharts();
glimma.init.processTables();
glimma.init.processLinkages();
glimma.init.processInputs();
</script>
<script>
d3.select('.container')
.append('div')
.style('text-align', 'right')
.text('Generated by ')
.append('a')
.text('Glimma')
.attr('href', 'https://www.github.com/shians/Glimma');
</script>
</body>
</html>
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!