-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with procuring MSigDB gene lists for mice #192
Comments
thank you for raising this! it seems that the function is a bit too stringent on validating the input, I'll try and revise the behaviour |
Opening that up would be nice, but I'm also very concerned with the fact that I'm fairly certain it isn't actually pulling the mouse gene lists at all but just pulling the human gene lists and renaming the genes to the mouse variations. There are numerous examples of gene lists that are unique to the human gene lists, yet somehow found in the pulled mouse gene lists. |
I'll investigate and keep you updated |
investigated this and there's no need for any change at the moment. pathfindR uses the msig_df <- msigdbr::msigdbr(species = species, category = collection, subcategory = subcollection)
>> msig_df <- msigdbr::msigdbr(species = "Mus musculus", category = "MH", subcategory = NULL)
Error in msigdbr::msigdbr(species = "Mus musculus", category = "MH", subcategory = NULL) :
unknown category this does return human gene sets, but best to contact maintainers of let me know if I can help further. |
Sorry, I'm a bit confused by this result. MSigDB does in fact have fully curated gene sets for mice that are not the same as they are for humans. Using the ones for humans isn't an acceptable thing to do if I'm going to publish my data in mice. I'm not as technically inclined in computer science as you are, but it looks as if you're saying that some connecting library or package that allows pathfindR to do what it does might be outdated and doesn't allow for collecting the mouse gene sets yet, but you do recognize it is collecting the human genes, not the mouse genes, as a result of this. The mouse gene sets aren't mouse equivalents of the human genes. So are you saying this is just a msigdbr package issue where they haven't updated their own package for the new database yet? Or are you saying they don't lend enough trust to the validity of their mouse gene sets to allow them to be collected yet? |
I understand your frustration and agree with you that this is very "improper" for msigdbr. You can open another issue in the msigdbr repo, linked above, and raise your rightful concern. From my understanding, they just haven't had the resources to update the package to support the mouse-specific gene sets, readily-available on MSigDB. |
Describe the bug
According to the following vignette: https://cran.r-project.org/web/packages/pathfindR/vignettes/obtain_data.html , It is possible to bring in the mouse MSigDB gene lists for nonhuman use of pathfindR. The issue with using the script recommended here is that it asks for a species identifier as well as a collection when the collections between human and mice on MSigDB are distinctly different and have different names. All the mouse collections start with 'M' and simply giving it a 'H' or 'C' identifier like it suggests for humans, supposedly would pull the wrong gene lists. The obvious thing to do would be to put the mouse collection identifier here, but the function gives you an error specifying you can only put collections starting with an 'H' or 'C,' so it's unclear if this is based on an earlier MSigDB where the collections maybe didn't have unique names, or if it would successfully pull the mouse gene list only even if you give it the 'H' or 'C' identifier for the collection. I would assume it's based off an older versiopn of MSigDB, but only because it doesn't include C8 as a collection to pull from, suggesting it didn't exist in earlier versions. It would be a big shame for C8/M8 to not be allowed as a genelist, as it's one of the newer great resources for deconvoluting cell type in bulk-RNAseq.
What's also very strange is that there is no 'M7' for the mice, while there is for the humans. So if you tell it your species is mice yet to collect 'C7' as you're trusting it to pull the mouse version of that, it will in fact pull a unique gene list with mouse gene identifiers. I have no idea where it's getting this from though as MSigDB states there is no 'M7.'
Trying to compare lists that are shared between mouse and human like C2/M2, looking at the number of gene lists in the sets on their site vs the gene set pulled by PathfindR, the gene list numbers pulled are much closer to the number of gene lists in the human sets than the mouse sets, so it almost looks like it might be converting human gene names to mouse variants rather than pulling the actual mouse gene sets. This suspicion is further supported by the gene list descriptions containing descriptions found in the human sets, but not the mouse sets.
To Reproduce
Steps to reproduce the behavior:
species = "Mus musculus",
collection = "MH")'
collection
should be one of “H”, “C1”, “C2”, “C3”, “C4”, “C5”, “C6”, “C7”'Expected behavior
I would expect it to pull the mouse collection by giving it the mouse collection identifier.
Desktop (please complete the following information):
** R Session Information:**
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fastcluster_1.2.3 corrplot_0.92 Hmisc_5.1-1 rgl_1.2.8
[5] biomaRt_2.58.0 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[9] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[13] ggplot2_3.4.4 tidyverse_2.0.0 dplyr_1.1.4 pathfindR_2.3.0
[17] pathfindR.data_2.0.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.15.0 jsonlite_1.8.8 magrittr_2.0.3
[4] magick_2.8.2 modeltools_0.2-23 farver_2.1.1
[7] rmarkdown_2.25 zlibbioc_1.48.0 vctrs_0.6.5
[10] memoise_2.0.1 RCurl_1.98-1.13 base64enc_0.1-3
[13] htmltools_0.5.7 progress_1.2.3 curl_5.2.0
[16] broom_1.0.5 Formula_1.2-5 htmlwidgets_1.6.4
[19] cachem_1.0.8 igraph_1.6.0 lifecycle_1.0.4
[22] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.6-4
[25] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.11
[28] digest_0.6.33 colorspace_2.1-0 AnnotationDbi_1.64.1
[31] S4Vectors_0.40.2 RSQLite_2.3.4 filelock_1.0.3
[34] labeling_0.4.3 fansi_1.0.6 timechange_0.2.0
[37] httr_1.4.7 polyclip_1.10-6 compiler_4.3.2
[40] bit64_4.0.5 withr_2.5.2 doParallel_1.0.17
[43] htmlTable_2.4.2 backports_1.4.1 viridis_0.6.4
[46] DBI_1.2.0 ggforce_0.4.1 R.utils_2.12.3
[49] MASS_7.3-60 rappdirs_0.3.3 tools_4.3.2
[52] foreign_0.8-85 prabclus_2.3-3 nnet_7.3-19
[55] R.oo_1.25.0 glue_1.6.2 grid_4.3.2
[58] checkmate_2.3.1 cluster_2.1.4 generics_0.1.3
[61] gtable_0.3.4 tzdb_0.4.0 R.methodsS3_1.8.2
[64] class_7.3-22 data.table_1.14.10 hms_1.1.3
[67] tidygraph_1.3.0 xml2_1.3.6 utf8_1.2.4
[70] XVector_0.42.0 flexmix_2.3-19 BiocGenerics_0.48.1
[73] ggrepel_0.9.4 foreach_1.5.2 pillar_1.9.0
[76] vroom_1.6.5 robustbase_0.99-1 tweenr_2.0.2
[79] BiocFileCache_2.10.1 lattice_0.22-5 bit_4.0.5
[82] tidyselect_1.2.0 Biostrings_2.70.1 knitr_1.45
[85] gridExtra_2.3 IRanges_2.36.0 stats4_4.3.2
[88] xfun_0.41 graphlayouts_1.0.2 Biobase_2.62.0
[91] diptest_0.77-0 DEoptimR_1.1-3 stringi_1.8.3
[94] evaluate_0.23 codetools_0.2-19 kernlab_0.9-32
[97] ggraph_2.1.0 BiocManager_1.30.22 cli_3.6.2
[100] rpart_4.1.21 munsell_0.5.0 Rsubread_2.16.0
[103] Rcpp_1.0.11 GenomeInfoDb_1.38.5 dbplyr_2.4.0
[106] png_0.1-8 XML_3.99-0.16 parallel_4.3.2
[109] blob_1.2.4 prettyunits_1.2.0 mclust_6.0.1
[112] bitops_1.0-7 viridisLite_0.4.2 scales_1.3.0
[115] crayon_1.5.2 fpc_2.2-11 rlang_1.1.2
[118] KEGGREST_1.42.0
The text was updated successfully, but these errors were encountered: