scRNAseq-analysis-notes

my scRNAseq analysis notes

The reason

Single cell RNAseq is becoming more and more popular, and as a technique, it might become as common as PCR. I just got some 10x genomics single cell RNAseq data to play with, it is a good time for me to take down notes here. I hope it is useful for other people as well.

readings before doing anything

single cell tutorials

Course material in notebook format for learning about single cell bioinformatics methods
Analysis of single cell RNA-seq data course, Cambridge University Great tutorial!
f1000 workflow paper A step-by-step workflow for low-level analysis of single-cell RNA-seq data by Aaron Lun, the athour of diffHiC, GenomicInteractions and csaw.
2016 Bioconductor workshop: Analysis of single-cell RNA-seq data with R and Bioconductor
paper: Single-Cell Transcriptomics Bioinformatics and Computational Challenges
Variance stabilizing scRNA-seq counts is log2(x+1) reasonable?

single cell RNA-seq normalization

paper: Assessment of single cell RNA-seq normalization methods
paper: A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications
Normalizing single-cell RNA sequencing data: challenges and opportunities Nature Methods
SinQC: A Method and Tool to Control Single-cell RNA-seq Data Quality.
Scone Single-Cell Overview of Normalized Expression data

single cell impute

SAVER: gene expression recovery for single-cell RNA sequencing an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.
DeepImpute: an accurate, fast and scalable deep neural network method to impute single-cell RNA-Seq data https://www.biorxiv.org/content/early/2018/06/22/353607
MAGIC (Markov Affinity-based Graph Imputation of Cells), is a method for imputing missing values restoring structure of large biological datasets.

single cell batch effect

Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors
Panoramic stitching of heterogeneous single-cell transcriptomic data Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration.

Single cell RNA-seq

Considerable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.

paper: Power Analysis of Single Cell RNA‐Sequencing Experiments
paper: The contribution of cell cycle to heterogeneity in single-cell RNA-seq data
paper: Batch effects and the effective design of single-cell gene expression studies
On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data
paper: Comparison of methods to detect differentially expressed genes between single-cell populations
review: Single-cell genome sequencing: current state of the science
Ginkgo A web tool for analyzing single-cell sequencing data.
SingleCellExperiment bioc package Defines a S4 class for storing data from single-cell experiments. This includes specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size factors for each cell, along with the usual metadata for genes and libraries.
ASAP: a Web-based platform for the analysis and inter-active visualization of single-cell RNA-seq data
Seurat is an R package designed for the analysis and visualization of single cell RNA-seq data. It contains easy-to-use implementations of commonly used analytical techniques, including the identification of highly variable genes, dimensionality reduction (PCA, ICA, t-SNE), standard unsupervised clustering algorithms (density clustering, hierarchical clustering, k-means), and the discovery of differentially expressed genes and markers.
R package for the statistical assessment of cell state hierarchies from single-cell RNA-seq data
Monocle Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments.
Single Cell Differential Expression: bioconductor package scde
Sincera:A Computational Pipeline for Single Cell RNA-Seq Profiling Analysis. Bioconductor package will be available soon.
MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data
scDD: A statistical approach for identifying differential distributions in single-cell RNA-seq experiments
Inference and visualisation of Single-Cell RNA-seq Data data as a hierarchical tree structure: bioconductor CellTree
Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts by Lior Pachter et.al
cellity: Classification of low quality cells in scRNA-seq data using R.
bioconductor: using scran to perform basic analyses of single-cell RNA-seq data
scater: single-cell analysis toolkit for expression with R
Monovar: single-nucleotide variant detection in single cells
paper: Comparison of methods to detect differentially expressed genes between single-cell populations
Single-cell mRNA quantification and differential analysis with Census
CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data
CellView: Interactive Exploration Of High Dimensional Single Cell RNA-Seq Data
Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The Python-based implementation efficiently deals with datasets of more than one million cells.

single cell RNA-seq clustering

Single Cell Clustering Comparison A blog post.
A systematic performance evaluation of clustering methods for single-cell RNA-seq data F1000 paper by Mark Robinson. tl;dr version: "SC3 and Seurat show the most favorable results".
Geometry of the Gene Expression Space of Individual Cells
pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.
CountClust: Clustering and Visualizing RNA-Seq Expression Data using Grade of Membership Models. Fits grade of membership models (GoM, also known as admixture models) to cluster RNA-seq gene expression count data, identifies characteristic genes driving cluster memberships, and provides a visual summary of the cluster memberships
FastProject: A Tool for Low-Dimensional Analysis of Single-Cell RNA-Seq Data
SNN-Cliq Identification of cell types from single-cell transcriptomes using a novel clustering method
Compare clusterings for single-cell sequencing bioconductor package.The goal of this package is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns.
CIDR: Ultrafast and accurate clustering through imputation for single cell RNA-Seq data
SC3- consensus clustering of single-cell RNA-Seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on twelve published datasets show that SC3 outperforms five existing methods while remaining scalable, as shown by the analysis of a large dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells.
GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection
FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data
matchSCore: Matching Single-Cell Phenotypes Across Tools and Experiments In this work we introduce matchSCore (https://github.com/elimereu/matchSCore), an approach to match cell populations fast across tools, experiments and technologies. We compared 14 computational methods and evaluated their accuracy in clustering and gene marker identification in simulated data sets.
Cluster Headache: Comparing Clustering Tools for 10X Single Cell Sequencing Data
The celaref (cell labelling by reference) package aims to streamline the cell-type identification step, by suggesting cluster labels on the basis of similarity to an already-characterised reference dataset - wheather that's from a similar experiment performed previously in the same lab, or from a public dataset from a similar sample.

dimention reduction and visualization of clusters

Principal Component Analysis Explained Visually
PCA, MDS, k-means, Hierarchical clustering and heatmap. I wrote it.
horseshoe effect from PCA Spurious structures in latent space decomposition and low-dimensional embedding methods
also read chapter 9 of http://web.stanford.edu/class/bios221/book/Chap-MultivaHetero.html
A tale of two heatmaps. I wrote it.
Heatmap demystified. I wrote it.
Cluster Analysis in R - Unsupervised machine learning very practical intro on STHDA website.
I wrote on PCA, and heatmaps on Rpub
A most read for clustering analysis for high-dimentional biological data:Avoiding common pitfalls when clustering biological data
How does gene expression clustering work? A must read for clustering.
How to read PCA plots for scRNAseq by VALENTINE SVENSSON.

See https://t.co/yxCb85ctL1: "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters" @mikelove @AndrewLBeam
— Rileen Sinha (@RileenSinha) August 25, 2016

paper: Outlier Preservation by Dimensionality Reduction Techniques

"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"

How to Use t-SNE Effectively
t-SNE explained in plain javascript
Rtsne R package for T-SNE
rtsne An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding) a bug was in rtsne: https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32
t-SNE-Heatmaps Beta version of 1D t-SNE heatmaps to visualize expression patterns of hundreds of genes simultaneously in scRNA-seq.
Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks standard methods, such as t-stochastic neighbor embedding (t-SNE), are not scalable to datasets with millions of cells and the resulting visualizations cannot be generalized to analyze new datasets. Here we introduce net-SNE, a generalizable visualization approach that trains a neural network to learn a mapping function from high-dimensional single-cell gene-expression profiles to a low-dimensional visualization.
PHATE dimensionality reduction method paper: http://biorxiv.org/content/early/2017/03/24/120378 PHATE also uncovers and emphasizes progression and transitions (when they exist) in the data, which are often missed in other visualization-capable methods. Such patterns are especially important in biological data that contain, for example, single-cell phenotypes at different phases of differentiation, patients at different stages of disease progression, and gut microbial compositions that vary gradually between individuals, even of the same enterotype.
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data. Run from R: https://gist.github.com/crazyhottommy/caa5a4a4b07ee7f08f7d0649780832ef
umapr UMAP dimensionality reduction in R
uwot An R package implementing the UMAP dimensionality reduction method. UMAP multi-threaded.
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE) The FIt-SNE implementation is generally faster than UMAP when you have more than 3,000 cells. In the realm of 10,000's of cells FIt-SNE scales at the same rate as UMAP. However, note that this is a log-log scale. Even if FI-tSNE starts scaling at the rate of UMAP, it is still consistently about 4 times faster. In other words, a dataset that takes an hour for UMAP will take 15 minutes for FIt-SNE. see the benchmark here https://nbviewer.jupyter.org/gist/vals/a138b6b13ae566403687a241712e693b by Valentine Svensson.

interesting papers to read

database

single cell expression atlas

advance of scRNA-seq tech

Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding no isolation of single cells needed!
Dynamics and Spatial Genomics of the Nascent Transcriptome by Intron seqFISH
Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transcriptional Spaces blog post by Lior Patcher The benefits of multiplexing. Need to re-read carefully.
Three-dimensional intact-tissue sequencing of single-cell transcriptional states

pseudotemporal modelling

large scale single cell analysis

bigSCale: an analytical framework for big-scale single-cell data. github link for millions of cells (starts with a count matrix)
Alevin: An integrated method for dscRNA-seq quantification based on Salmon.
SCope: Visualization of large-scale and high dimensional single cell data

The field is advancing so fast!!

check this website for the tools being added:
https://www.scrna-tools.org/

paper published:
Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database

contamination of 10x data

https://twitter.com/constantamateur/status/994832241107849216?s=11

Did you know that droplet based single cell RNA-seq data (like 10X) is contaminated by ambient mRNAs? Good news though, we've written a paper (https://www.biorxiv.org/content/early/2018/04/20/303727 …) and created an R package called SoupX (https://github.com/constantAmateur/SoupX) to fix this problem.

Is this really a problem? It depends on your experiment. Contamination ranges from 2% - 50%. 10% seems common; it's 8% for 10X PBMC data. Solid tissues are typically worse, but there's no way to know in advance. Wouldn't you like to know how contaminated your data are?

These mRNAs come from the single cell suspension fed into the droplet creation system. They mostly get their from lysed cells and so resemble the cells being studied. This means the profile of the contamination is experiment specific and creates a batch effect.

cellranger is the toolkit developed by the 10x genomics company to deal with the data.

some tools for 10x

DropletUtils Provides a number of utility functions for handling single-cell (RNA-seq) data from droplet technologies such as 10X Genomics. This includes data loading, identification of cells from empty droplets, removal of barcode-swapped pseudo-cells, and downsampling of the count matrix.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scRNAseq-analysis-notes

The reason

readings before doing anything

single cell tutorials

single cell RNA-seq normalization

single cell impute

single cell batch effect

Single cell RNA-seq

single cell RNA-seq clustering

dimention reduction and visualization of clusters

interesting papers to read

database

advance of scRNA-seq tech

pseudotemporal modelling

large scale single cell analysis

The field is advancing so fast!!

contamination of 10x data

some tools for 10x

About

Releases

Packages

License

mhagemann86/scRNAseq-analysis-notes

Folders and files

Latest commit

History

Repository files navigation

scRNAseq-analysis-notes

The reason

readings before doing anything

single cell tutorials

single cell RNA-seq normalization

single cell impute

single cell batch effect

Single cell RNA-seq

single cell RNA-seq clustering

dimention reduction and visualization of clusters

interesting papers to read

database

advance of scRNA-seq tech

pseudotemporal modelling

large scale single cell analysis

The field is advancing so fast!!

contamination of 10x data

some tools for 10x

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages