Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging before SCTransform and including nFeature_RNA and nCount_RNA in vars.to.regress. #9595

Open
ChrisSteel-bio opened this issue Jan 3, 2025 · 0 comments

Comments

@ChrisSteel-bio
Copy link

Dear developers,

Thank you for providing such an amazing suite of tools! I have a question regarding SCTransform and its use with regression.

I am working with a dataset of ~100K cells spanning a few different conditions, including technical replicates. The dataset consists of a single cell type (a cell line) derived from a fast-growing, relatively homogeneous tumor.

I’ve observed that merging samples and then running SCTransform results in minimal batch effects, whereas running SCTransform separately for each sample introduces more noise. However, one technical replicate still exhibits a distinct batch cluster that becomes apparent when using more than 10 principal components (PCs) for clustering. This particular replicate has significantly higher UMI and gene counts per cell.

I’ve tried several approaches, including down-sampling at both the matrix and molecule levels, which helps marginally. However, the batch cluster persists when using PCs >10. I’ve also attempted integration methods (Scanorama, Harmony etc), but these tend to over-correct.

When using SCTransform as shown below, the problematic batch cluster is corrected, and the replicates and conditions align well with expectations:

RBL_merg <- SCTransform(
RBL_merg,
vars.to.regress = c("S.Score", "G2M.Score", "percent.mt", "nFeature_RNA", "nCount_RNA"),
conserve.memory = TRUE,
verbose = TRUE)

My understanding is that running SCTransform on the merged dataset leads to a consistent per-gene residual calculation across all cells, which in my case is giving good results for comparisons across samples.

I am aware that SCTransform inherently corrects for nCount_RNA differences. However, explicitly specifying both nFeature_RNA and nCount_RNA in the vars.to.regress option yields robust results for us (using just one does not), with the complete dissolution of a problematic batch cluster. I imagine this combined adjustment applies a more liberal correction, which seems to suit our data well.

As I understand it, this approach would only obscure biological signal if global RNA abundance were an interest, which is not the case for these data. If anyone has any thoughts on this approach I'd be really keen to hear them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant