Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute differential expression (tumor versus normal in paired samples) for cancers #31

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Oct 5, 2016

Note to untrack data/complete/differential-expression.tsv.bz2 before merging.

This is something that @ksimeono -- a cancer biologist -- was interested it. It's potentially out of scope for Cognoma, but I thought it's pretty useful.

No rush to merge, just wanted to get this up here.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 5, 2016

Output looks like this:

acronym entrez_gene_id patients tumor_mean normal_mean mean_diff t_stat mlog10_p_value symbol
BLCA 1 19 5.328 4.966 0.3621 1.062 0.5197 A1BG
BLCA 2 19 12.48 15.25 -2.765 -10.23 8.202 A2M
BLCA 9 19 6.339 6.009 0.3295 1.197 0.6073 NAT1
BLCA 10 19 1.008 0.5923 0.4162 1.472 0.8006 NAT2
BLCA 12 19 6.082 9.63 -3.548 -4.912 3.95 SERPINA3
BLCA 13 19 4.782 4.503 0.2795 0.3025 0.1159 AADAC
BLCA 14 19 11.43 11.45 -0.02632 -0.2713 0.1028 AAMP
BLCA 15 19 0.7949 0.5788 0.2161 1.029 0.4986 AANAT

Tagging @ksimeono who had interest in the colon adenocarcinoma (COAD) data.
entrez_gene_id was a float due to odd behavior by df.merge in pandas. This
resulting in `float_format='%.4g'` of to_csv causing exponent formatting of
entrez_gene_id and irreversibly corrupting their IDs.
Rerun with gene data created by cognoma#32.
Should result in all genes having a symbol.
@dhimmel
Copy link
Member Author

dhimmel commented Oct 10, 2016

@gwaygenomics what do you think of the plot in 5.differential-expression.ipynb? In other words, do you see biology within?

cancer-by-nmf-component

The heatmap shows differential expression signatures for each cancer. Genes were transformed to 100 genes using NMF. Fill color represents the t-statistic.

@gwaybio
Copy link
Member

gwaybio commented Oct 10, 2016

what do you think of the plot

There is a lot going on in it! I am going to outline what it is and try to extract biology along the way.

  1. NMF for 100 components - this ideally would bicluster your data into samples and linear gene modules.
    • Do you have any sense of if 100 components reconstructs the solution well? A pareto reconstruction curve would be useful to visualize, if its not too computationally taxing.
    • You define components this way and then you subset based on matching samples
  2. Disease-type specific t-tests for each NMF component
    • Testing the mean difference between tumor vs. tumor adjacent
    • Equal variance assumption is probably violated - maybe good to use Welch's t test instead
      • Could be part of the reason for huge t statistics?
      • Not sure if python has implementation to solve that but R does (t.test(tumor, tumor_adjacent, var.equal=False))
    • This analysis is somewhat similar to Gross et al. 2015
  3. This is tough to interpret without seeing the gene contributions that make up each component (could do some sort of rough pathway analysis (like WebGestalt) using the high-weight genes from the more variable components). If you're curious, I would probably start with some of the components from BRCA or KIRC (bigger sample size)
    • But despite this, it looks really cool - diseases cluster as expected (e.g. LUAD/LUSC and COAD/READ) and there appear to be components that are consistently up or down in tumor vs. tumor adjacent

I think a rough description of what is going on with the genes in each component would spark more biological discussion. Another thing to keep in mind is that the "normals" are actually "tumor adjacent" and are opportunistically extracted from "nearby" tissue when the surgeon can (therefore, no GBM tumor adjacent). I think its important to not consider this "normal" (Troester et al. 2016) (to be clear, the terminology is ok, but I mean thinking about this as normal tissue could be a trap!)

@cgreene
Copy link
Member

cgreene commented Oct 10, 2016

A conceptual summary comparing the approach with Gross et al. would be good somewhere in the notebook - particularly if you link to that paper.

@ksimeono
Copy link

Agree with @gwaygenomics that some sort of biologically meaningful notation of the metagenes would be beneficial. Out of my element in terms of what's possible, but grouping genes by pathway initially instead of metagene could be something similar but with inherent meaning.

Along the same lines, expanded names for the cancers, rather than just TCGA acronyms would improve readability.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 10, 2016

@gwaygenomics we're using a paired t-test, about which the following has been said:

The paired t-test calculates the difference within each before-and-after pair of measurements, determines the mean of these changes, and reports whether this mean of the differences is statistically significant. A paired t-test can be more powerful than a 2-sample t-test because the latter includes additional variation occurring from the independence of the observations. A paired t-test is not subject to this variation because the paired observations are dependent. Also, a paired t-test does not require both samples to have equal variance. Therefore, if you can logically address your research question with a paired design, it may be advantageous to do so, in conjunction with a paired t-test, to get more statistical power.

Thanks @cgreene, @ksimeono, & @gwaygenomics for the comments. Will be at least a week before I get around to addressing them.

@gwaybio
Copy link
Member

gwaybio commented Oct 10, 2016

A paired t-test is not subject to this variation because the paired observations are dependent. Also, a paired t-test does not require both samples to have equal variance.

Ah yes, good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants