Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function “fast.cophenetic.phylo” failed #6

Open
microbiomeCat opened this issue Jun 21, 2017 · 4 comments
Open

Function “fast.cophenetic.phylo” failed #6

microbiomeCat opened this issue Jun 21, 2017 · 4 comments

Comments

@microbiomeCat
Copy link

microbiomeCat commented Jun 21, 2017

Hi Sebastian,
I am trying to run the "prepare.community_similarity.R" script with a higher PARAM$thresh.otu_size <- 300 for higher speed, however, I get
`[[998]]
[1] "Error in nodepath(x, from = tip, to = my.root) : can't find 'my.root'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in nodepath(x, from = tip, to = my.root): can't find 'my.root'>

[[999]]
[1] "Error in nodepath(x, from = tip, to = my.root) : can't find 'my.root'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in nodepath(x, from = tip, to = my.root): can't find 'my.root'>

[[1000]]
[1] "Error in nodepath(x, from = tip, to = my.root) : can't find 'my.root'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in nodepath(x, from = tip, to = my.root): can't find 'my.root'>

[ reached getOption("max.print") -- omitted 2329 entries ]
Calculate distances between all pairs of tips => 2017-06-21 10:05:55
Done => 2017-06-21 10:06:32
Coerce into matrix => 2017-06-21 10:06:32
Done. Returning => 2017-06-21 10:06:34
Warning message:
In mclapply(seq(1, Ntip(x)), function(tip) { :
all scheduled cores encountered errors in user code`

I replace my.root with 1 in the script “my.node_paths <- mclapply(seq(1, Ntip(x)), function (tip) {nodepath(x, from=tip, to=my.root)}, mc.cores=use.cores);” it works. I don't kown if it's correct?

By the way, I get the histogram.SparCC.pdf file with blank.

bests!
Hongyi

@defleury
Copy link
Owner

Dear Hongyi,

I haven't tried running the script with such a high OTU size cutoff previously; I assume that this way, you probably remove so many OTUs that some samples end up with (almost) no taxa. I am currently out of office, but will be back tomorrow and give this a try. I'll post an update here :-)

Best,

Sebastian

@microbiomeCat
Copy link
Author

Dear Sebastian
I have tried your script with my 16s data (178 soil samples with sequences in each sample rarefied to 40000). I have it run successfully with some modification. I think some of them are worth mentioning.

First, I replace "my.root" with "1" in the script “my.node_paths <- mclapply(seq(1, Ntip(x)), function (tip) {nodepath(x, from=tip, to=my.root)}, mc.cores=use.cores);” in functions.community_similarity.R as I have mentioned before. "my.root" is not defined.

Second, since my sequence number is very large, I have to set the "size.thresh <-10" when doing Sparcc (to avoid consuming excessive amount of memory or time). However, if you look at the script below:
if (get.cs[[t]]$name %in% c("TINA, unweighted", "TINA, weighted")) {
curr.cs <- community.similarity.corr.par(ot, S=S.sparcc, distance=get.cs[[t]]$call, blocksize=600, use.cores=15)
} else if (get.cs[[t]]$name %in% c("PINA, unweighted", "PINA, weighted")) {
curr.cs <- community.similarity.corr.par(ot, S=S.phylo, distance=get.cs[[t]]$call, blocksize=600, use.cores=15)
};
the "ot" is not filtered while the "S.sparcc" or "S.phylo" is calculated with filtered OTU matrix. So I used "ot.2" instead of "ot" and added:
remove.otus <- rowSums(ot) < size.thresh; keep.otus <- ! remove.otus;
otus <- rownames(o.t);
ot.2=ot[otus,]
Using "o.t" directly causes problems. By the way this two script is memory consuming (700G) with the parameter " blocksize=600, use.cores=15".

At last, I found the numbers in the matrix of PINA (both weighted and unweighted) are all very small (1e-3~1e-5), is that right? I found PINA was also small in your ISME paper, but larger than mine. Should I use higher "size.thresh" since the otu table of soil samples seems to be sparser than hmp otu table?

Best,

Hongyi

@microbiomeCat
Copy link
Author

Oh, I forgot to say that "my.tree" should also be filtered for PINA:
my.tree <- drop.tip(tmp.tree, rownames(ot.raw)[ !rownames(ot.raw) %in% rownames(ot.2)]);
my.tree.rooted <- root(my.tree, sample(rownames(ot.2), 1), resolve.root=T);
my.ps <- phyloseq(otu_table(as.matrix(ot.2), taxa_are_rows=T), sample_data(sample.data), phy_tree(my.tree.rooted));
my.tree <- phy_tree(my.ps).
I used the filtered "ot.2" or "my.tree" for PINA or TINA, while "ot" and unfiltered "my.tree" for Unifrac

@defleury
Copy link
Owner

defleury commented Aug 3, 2017

Dear Hongyi,

thanks for the update and the comments, and sorry for not coming back to you about this before...

First, I replace my.root with 1 in the script my.node_paths <- mclapply(seq(1, Ntip(x)), function (tip) {nodepath(x, from=tip, to=my.root)}, mc.cores=use.cores); in functions.community_similarity.R as I have mentioned before. my.root is not defined.

I'm aware of the issue, but I'm not sure if your fix is the correct one. The original problem of my.root not being defined is clearly a bug on my side. I tried to follow this up a bit to come up with a "reasonable" fix, and there seem to be some more general problems with how ape handles (re-)rooted trees.
Currently, I implemented a random rooting, following the practice used in the phyloseq package to calculate UniFrac. This is very likely not ideal, and I've been wondering how to get a more appropriate root automatically, with no or very limited manual tinkering needed. In any case, I think using tip 1 as "root" is probably not correct.
How did you generate your tree in the first place? Does it come with a resolved root? If yes, point explicitly to that; otherwise select the root randomly or manually and provide it explicitly to the function. That's what my.root was doing in the original script I made; but obviously it doesn't work in the current standalone function.

the "ot" is not filtered while the "S.sparcc" or "S.phylo" is calculated with filtered OTU matrix. So I used "ot.2" instead of "ot" and added

I am aware of this problem; I had out-commented the size filtering within the SparCC code for various reasons. Again, I hope to get around to properly re-write the code, and removing such inconsistencies is a big part of that. In the meantime, sorry for the inconvenience!

By the way this two script is memory consuming (700G) with the parameter " blocksize=600, use.cores=15"

Yes, the memory management is the other big problem of the current implementation. I have spent quite some time trying to figure this out, and I believe the problem is at the level of mclapply's memory management. For some reason, even if you use structures as read-only, mclapply can run into very high memory usage. Others have observed the problem (see e.g. here and here, but I am not aware of an easy fix. This is another reason why I am hoping to re-implement core functions for TINA/PINA without relying on mclapply.

At last, I found the numbers in the matrix of PINA (both weighted and unweighted) are all very small (1e-3~1e-5), is that right? I found PINA was also small in your ISME paper, but larger than mine. Should I use higher "size.thresh" since the otu table of soil samples seems to be sparser than hmp otu table?

I have also encountered low PINA values in various contexts. One possible reason is the rooting bug mentioned above when computing the cophenetic distances, although I am not certain how much it would actually influence the absolute value range.
In general, the ranges of both TINA and PINA are "non-standard" in the sense that a distance of 0, 0.5 and 1 have different meanings than e.g. for Jaccard, Bray-Curtis of UniFrac. A T/PINA distance of 0 indicates complete "equivalence" of the samples on the network - all pairwise similarities between taxa between samples are 1. A distance of 0.5 indicates "neutral" similarity, for example if no taxa are shared and all pairwise taxa similarities are 0.5 (as transformed from a correlation of 0). And a distance of 1 means that all pairwise taxa similarities are 0 (complete anti-correlation). So values in the range of [0, 0.5] are not unexpected; the very low absolute values may be a feature of the tree used. In the end, for practical purposes (ordination, PERMANOVA, clustering, etc.), what matters are the ranks of pairwise distances, as well as their ratios, and not so much the scale.

Thank you again for your comments. I hope to be able to clean up the code accordingly; for now, there is no "easy fix" for the rooting issue and the mclapply memory consumption issue. I'll push updates as soon as I figure those out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants