Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
BioCellGen-public
SVI-SAHMRI_scRNA-seq-Workshop_2021-July
Commits
01a23270
Commit
01a23270
authored
Jun 30, 2021
by
Jeffrey Pullin
Browse files
Edits from Jeffrey
parent
f1e61d56
Changes
1
Hide whitespace changes
Inline
Side-by-side
analysis/sahmri_analysis-workflow.Rmd
View file @
01a23270
...
...
@@ -283,10 +283,10 @@ high.count <- isOutlier(stats$total, type="higher")
table(high.count)
```
---2. Why would be useful
e
to filter out cells with higher total counts---
---2. Why would be useful to filter out cells with higher total counts---
---3. Use `plotColData` to plot other useful cell-level QC metrics
(e.g. number of detected features, subsets_Mito_detected) and colour by whether cells would be discarded.---
(e.g. number of detected features,
`
subsets_Mito_detected
`
) and colour by whether cells would be discarded.---
Hint: you may find the following code helpful.
...
...
@@ -324,13 +324,13 @@ expression profiles.
We are going to use a normalisation method named Normalisation by Deconvolution.
It is a scaling-based procedure that involves dividing all counts for each
cell by a cell-specific scaling factor, often called a “size factor”.The size
cell by a cell-specific scaling factor, often called a “size factor”.
The size
factor for each cell represents the estimate of the relative bias in that cell,
so division of its counts by its size factor should remove that bias.
Adapted from bulk RNAseq analysis, we assume the majority of genes between
two cells are non-DE (not differential expressed) and the difference among the
nonDE genes are used to represent bias that is used to compute an appropriate
non
-
DE genes are used to represent bias that is used to compute an appropriate
size factor for its removal.
Given the characteristics of low-counts in single cell dataset, we first pool
...
...
@@ -443,9 +443,9 @@ differentially expressed across all the cells, we can set the prop to be 5%. In
a
more
heterogeneous
cell
population
,
we
would
raise
that
number
.
Here
we
are
using
top
10
%.
Note
that
th
is
number
selection
is
actually
quite
arbitrary
.
It
is
recommended
to
simply
pick
a
value
and
go
ahead
with
the
downstream
analysis
but
with
the
intention
of
testing
other
parameters
if
needed
.
Note
that
th
e
selection
of
the
number
of
genes
to
keep
is
actually
quite
arbitrary
.
It
is
recommended
to
simply
pick
a
value
and
go
ahead
with
the
downstream
analysis
but
with
the
intention
of
testing
other
parameters
if
needed
.
...
...
@@ -453,7 +453,7 @@ intention of testing other parameters if needed.
###
Dimensionality
reduction
and
visualisation
Now
we
have
selected
the
highly
variable
genes
,
we
can
use
these
genes
'
s
Now
we
have
selected
the
highly
variable
genes
,
we
can
use
these
genes
'
information to reveal cell population structures in a constructed lower dimensional
space. Working with this lower dimensional representation allows for easier
exploration and reduces the computational burden of downstream analyses,
...
...
@@ -485,7 +485,7 @@ reducedDims(sce.pbmc)$PCA[1:5,]
```
tSNE (t-stochastic neighbor embedding
) is a popular dimensionality reduction
tSNE (t-stochastic neighbor embedding) is a popular dimensionality reduction
methods for visualising single-cell dataset with functions to run them available
in `scater`. They attempt to find a low-dimensional representation of the data
that preserves the distances between each point and its neighbours in the
...
...
@@ -510,7 +510,7 @@ reducedDims(sce.pbmc)
Visualising
column
data
or
plotting
gene
expression
in
tSNE
plot
:
```{
r
tse_plot
}
```{
r
ts
n
e_plot
}
plotTSNE
(
sce
.
pbmc
,
colour_by
=
"detected"
)
plotTSNE
(
sce
.
pbmc
,
colour_by
=
"CD14"
)
...
...
@@ -523,12 +523,12 @@ Notice that in the 2D plots of single cells, sometimes it can get really
runs
into
a
over
-
plotting
issue
where
cells
are
two
close
overlaying
/
blocking
To
deal
with
that
,
we
can
use
the
hexbin
plots
[@
Freytag2020
-
us
].
`
schex
`
package
provides
binning
strategies
of
cells
into
hexagon
cells
.
Plotting
summarized
information
of
all
cells
in
their
respective
hexagon
cells
presents
The
`
schex
`
package
strategies
for
binning
the
observed
cells
into
hexagon
al
cells
.
Plotting
summarized
information
of
all
cells
in
their
respective
hexagon
al
cells
presents
information
without
obstructions
.
We
use
the
`
plot_hexbin_feature
`
to
plot
of
feature
expression
of
single
cells
in
bivariate
hexagon
cells
.
We
use
the
`
plot_hexbin_feature
`
to
plot
the
feature
expression
of
single
cells
in
bivariate
hexagon
al
cells
.
```{
r
hexgonplot
}
...
...
@@ -567,13 +567,13 @@ plot_hexbin_feature(pbmc_hex, feature = "CD14",
##
Clustering
and
cell
annotation
One
of
the
most
promising
applications
of
scRNA
-
seq
is
de
novo
discovery
and
One
of
the
most
promising
applications
of
scRNA
-
seq
is
*
de
novo
*
discovery
and
annotation
of
cell
-
types
based
on
transcription
profiles
.
We
are
going
to
use
unsupervised
clustering
method
,
that
is
,
we
need
to
identify
groups
of
cells
based
on
the
similarities
of
the
transcriptomes
without
any
prior
knowledge
of
the
labels
.
In
scRNAseq
analyses
,
shared
K
-
nearest
neighbour
graph
-
based
clustering
is
In
scRNA
-
seq
analyses
,
shared
K
-
nearest
neighbour
graph
-
based
clustering
is
often
applied
.
Each
cell
is
represented
in
the
graph
as
a
node
,
and
the
first
step
is
to
find
all
the
KNN
(
K
-
neareast
neighbours
)
for
each
cell
by
comparing
their
pair
-
wise
Euclidean
distances
.
After
that
,
the
nodes
are
connected
if
they
...
...
@@ -602,7 +602,7 @@ table(factor(clust))
###
Clustree
`
clustree
`
can
be
used
to
visualise
the
different
clustering
results
in
a
The
`
clustree
`
package
can
be
used
to
visualise
the
different
clustering
results
in
a
tree
structure
.
It
helps
to
visualise
how
cells
move
as
the
clustering
resolution
changes
.
...
...
@@ -625,8 +625,8 @@ clustree(multipleKResults, prefix = "k", prop_filter = 0.05, layout = "tree")
```
Varying
clustering
resolutions
is
an
important
part
of
data
exploration
for
scRNAseq
.
It
is
challenging
to
find
the
best
resolution
,
and
you
usually
need
to
experiment
with
different
parameters
multiple
times
and
t
ry
find
the
proper
result
.
scRNA
-
seq
.
It
is
challenging
to
find
the
best
resolution
,
and
you
usually
need
to
experiment
with
different
parameters
multiple
times
and
t
o
find
the
best
result
.
##
Plot
Clusters
...
...
@@ -641,42 +641,44 @@ plotTSNE(sce.pbmc, colour_by = "label")
Perhaps
add
a
section
here
on
MRtree
(
Peng
et
al
,
NAR
,
2021
)
##
Marker
dection
/
cluster
annotation
To
interpret
our
clustering
results
,
we
identify
the
genes
that
drive
separation
between
clusters
.
These
marker
genes
allow
us
to
assign
biological
meaning
to
each
cluster
based
on
their
functional
annotation
.
We
demonstrate
this
approach
in
the
[
"Manual"
annotation
].
between
clusters
.
These
so
-
called
'marker genes'
allow
us
to
assign
biological
meaning
to
each
cluster
based
by
comparing
the
expression
profiles
of
the
marker
genes
to
the
expression
profiles
of
previously
identified
cell
types
.
We
demonstrate
this
approach
in
the
section
[
"Manual"
annotation
].
An
alternative
way
is
to
use
some
automatic
annotation
approach
,
see
the
[
Automatic
annotation
]
[
"
Automatic
"
annotation
]
section
.
###
"Manual"
annotation
`
findMarkers
`
is
done
via
pair
-
wise
comparisions
of
all
pari
-
wise
combinations
of
clusters
.
Each
cluster
is
compared
to
each
of
the
remaining
clusters
and
up
-
regulated
genes
are
identified
using
t
-
test
.
The
returned
results
for
one
cluster
(
versus
any
of
the
remaining
clusters
)
are
combined
into
a
single
list
and
the
combined
.
P
.
values
are
calculated
.
One
can
specifty
the
`
pval
.
type
`
which
determines
how
the
DEGs
are
combined
:
`
any
`,
return
genes
that
are
determined
as
up
-
regulated
in
any
of
the
comparisons
;
`
all
`,
return
genes
that
are
determined
as
up
-
regulated
in
`
all
`
of
the
comparisons
,`
some
`
is
a
compromise
between
`
any
`
and
`
all
`.
```{
r
,
findMarkers
}
To
identify
the
marker
genes
for
each
cluster
we
use
the
`
findMarkers
()`
function
from
the
scran
package
.
The
`
findMarkers
()`
function
uses
pair
-
wise
differential
expression
testing
to
identify
marker
genes
.
This
means
that
,
for
each
gene
,
it
tests
for
differences
in
expression
between
the
cluster
of
interest
and
each
of
the
other
clusters
in
the
dataset
seperately
.
By
default
,
a
t
-
test
is
used
,
but
a
binomial
test
or
the
Wilcoxon
rank
sum
test
are
also
implemented
.
In
the
code
below
we
specify
that
we
only
want
to
idenfity
genes
which
are
`
"up"
`
regulated
in
the
cluster
of
interest
.
The
returned
results
for
one
cluster
(
versus
all
of
the
remaining
clusters
)
are
combined
into
a
single
list
and
combined
p
-
values
are
calculated
.
One
can
specifty
the
`
pval
.
type
`
which
determines
how
the
DEGs
are
combined
:
`
any
`,
return
genes
that
are
determined
as
up
-
regulated
in
any
of
the
comparisons
;
`
all
`,
return
genes
that
are
determined
as
up
-
regulated
in
`
all
`
of
the
comparisons
,
`
some
`
is
a
compromise
between
`
any
`
and
`
all
`.
```{
r
findMarkers
}
markers
<-
findMarkers
(
sce
.
pbmc
,
pval
.
type
=
"some"
,
direction
=
"up"
)
```
We
examine
the
markers
for
cluster
7
and
14
in
more
detail
.
```{
r
,
plotMarkers
}
```{
r
plotMarkers
}
marker
.
set
<-
markers
[[
"7"
]]
as
.
data
.
frame
(
marker
.
set
[
1
:
10
,
1
:
3
])
marker
.
set
<-
markers
[[
"14"
]]
as
.
data
.
frame
(
marker
.
set
[
1
:
10
,
1
:
3
])
```
###
Question
...
...
@@ -730,15 +732,14 @@ gridExtra::grid.arrange(
)
```
```{
r
}
plotExpression
(
sce
.
pbmc
,
features
=
c
(
"CD14"
),
x
=
"label"
,
colour_by
=
"label"
)
```
###
Automatic
annotation
###
"
Automatic
"
annotation
The
i
ntuiti
on
of
automatic
annotation
i
s
by
comparing
current
dataset
with
previously
annotat
ion
reference
samples
and
find
the
cell
type
of
the
most
I
ntuiti
vely
,
automatic
annotation
work
s
by
comparing
the
current
dataset
with
previously
annotat
ed
reference
samples
and
find
ing
the
cell
type
of
the
most
similar
/
correlated
cells
.
`
SingleR
`
method
[@
Aran2019
-
pw
]
provides
one
such
option
for
cell
type
annotation
.
This
method
assigns
labels
to
cells
based
on
the
reference
samples
with
the
highest
Spearman
rank
correlations
,
using
only
the
...
...
@@ -772,13 +773,40 @@ colData(sce.pbmc) <- cbind(colData(sce.pbmc),pred)
plotTSNE
(
sce
.
pbmc
,
colour_by
=
"labels"
)
```
##
Differential
expression
(
DE
)
analysis
###
Introduction
to
DE
analysis
###
DE
in
a
real
dataset
Unlike
analysing
bulk
RNA
-
seq
data
,
where
only
one
type
of
DE
analysis
is
possible
,
many
types
of
DE
analysis
can
be
performed
on
single
cell
data
.
Some
different
types
DE
analysis
possible
using
single
cell
data
are
:
1.
Comparing
one
cluster
to
all
other
clusters
in
a
dataset
1.
Comparing
a
specific
pair
of
clusters
(
or
groups
of
clusters
)
within
a
single
dataset
1.
Comparing
the
same
cell
type
across
samples
in
different
conditions
Each
of
these
different
possible
analyses
require
different
methodologies
.
The
first
type
of
analysis
is
the
same
as
finding
marker
genes
(
described
above
)
and
any
marker
gene
methods
which
use
differential
expression
testing
can
be
used
.
A
variety
of
methods
are
avaliable
for
the
second
type
of
analysis
.
Recent
benchmarking
[@
Soneson2018
-
hy
]
has
suggested
that
[
edgeR
](
https
://
bioconductor
.
org
/
packages
/
release
/
bioc
/
html
/
edgeR
.
html
),
a
differential
expression
method
developed
for
bulk
RNA
-
seq
data
performs
well
.
Other
options
are
[
MAST
](
https
://
bioconductor
.
org
/
packages
/
release
/
bioc
/
html
/
MAST
.
html
),
a
differential
expression
method
developed
for
scRNA
-
seq
data
or
[
limma
-
voom
](
https
://
bioconductor
.
org
/
packages
/
release
/
bioc
/
html
/
limma
.
html
),
another
method
developed
for
bulk
RNA
-
seq
data
.
Finally
,
for
the
third
type
of
analysis
is
it
critical
to
account
for
the
fact
that
measurements
of
cells
within
each
sample
are
not
independent
.
Currently
,
the
best
methods
for
this
analysis
use
a
pseudobulk
approach
-
aggregating
cells
in
each
sample
and
then
using
methods
designed
for
bulk
data
.
This
strategy
is
implemented
in
the
[
muscat
](
https
://
www
.
bioconductor
.
org
/
packages
/
release
/
bioc
/
html
/
muscat
.
html
)
package
.
In
addition
,
other
types
of
analysis
,
similar
to
DE
analysis
,
are
possible
in
scRNA
-
seq
data
.
Firstly
,
we
can
test
for
differences
in
the
number
of
cells
in
each
cluster
between
conditions
.
This
is
called
differential
abundance
or
differential
composition
testing
.
Secondly
,
the
large
number
of
cells
in
single
cell
data
allow
testing
of
differences
in
distribution
-
not
just
mean
as
in
traditional
DE
analysis
between
samples
.
For
both
of
these
analyses
method
development
is
ongoing
and
systematic
benchmarking
of
available
methods
does
not
yet
exist
.
###
DE
in
a
real
dataset
##
Further
exploration
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment