Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with procuring MSigDB gene lists for mice #192

Closed
t3h4nt1chr15t opened this issue Feb 8, 2024 · 7 comments
Closed

Issue with procuring MSigDB gene lists for mice #192

t3h4nt1chr15t opened this issue Feb 8, 2024 · 7 comments
Assignees

Comments

@t3h4nt1chr15t
Copy link

t3h4nt1chr15t commented Feb 8, 2024

Describe the bug
According to the following vignette: https://cran.r-project.org/web/packages/pathfindR/vignettes/obtain_data.html , It is possible to bring in the mouse MSigDB gene lists for nonhuman use of pathfindR. The issue with using the script recommended here is that it asks for a species identifier as well as a collection when the collections between human and mice on MSigDB are distinctly different and have different names. All the mouse collections start with 'M' and simply giving it a 'H' or 'C' identifier like it suggests for humans, supposedly would pull the wrong gene lists. The obvious thing to do would be to put the mouse collection identifier here, but the function gives you an error specifying you can only put collections starting with an 'H' or 'C,' so it's unclear if this is based on an earlier MSigDB where the collections maybe didn't have unique names, or if it would successfully pull the mouse gene list only even if you give it the 'H' or 'C' identifier for the collection. I would assume it's based off an older versiopn of MSigDB, but only because it doesn't include C8 as a collection to pull from, suggesting it didn't exist in earlier versions. It would be a big shame for C8/M8 to not be allowed as a genelist, as it's one of the newer great resources for deconvoluting cell type in bulk-RNAseq.

What's also very strange is that there is no 'M7' for the mice, while there is for the humans. So if you tell it your species is mice yet to collect 'C7' as you're trusting it to pull the mouse version of that, it will in fact pull a unique gene list with mouse gene identifiers. I have no idea where it's getting this from though as MSigDB states there is no 'M7.'

Trying to compare lists that are shared between mouse and human like C2/M2, looking at the number of gene lists in the sets on their site vs the gene set pulled by PathfindR, the gene list numbers pulled are much closer to the number of gene lists in the human sets than the mouse sets, so it almost looks like it might be converting human gene names to mouse variants rather than pulling the actual mouse gene sets. This suspicion is further supported by the gene list descriptions containing descriptions found in the human sets, but not the mouse sets.

To Reproduce
Steps to reproduce the behavior:

  1. Run the following function: 'gsets_list <- get_gene_sets_list(source = "MSigDB",
    species = "Mus musculus",
    collection = "MH")'
  2. See error 'Error in get_mgsigdb_gsets(species = species, collection = collection, :
    collection should be one of “H”, “C1”, “C2”, “C3”, “C4”, “C5”, “C6”, “C7”'

Expected behavior
I would expect it to pull the mouse collection by giving it the mouse collection identifier.

Desktop (please complete the following information):

  • OS: [Windows 10]

** R Session Information:**
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] fastcluster_1.2.3 corrplot_0.92 Hmisc_5.1-1 rgl_1.2.8
[5] biomaRt_2.58.0 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[9] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[13] ggplot2_3.4.4 tidyverse_2.0.0 dplyr_1.1.4 pathfindR_2.3.0
[17] pathfindR.data_2.0.0

loaded via a namespace (and not attached):
[1] rstudioapi_0.15.0 jsonlite_1.8.8 magrittr_2.0.3
[4] magick_2.8.2 modeltools_0.2-23 farver_2.1.1
[7] rmarkdown_2.25 zlibbioc_1.48.0 vctrs_0.6.5
[10] memoise_2.0.1 RCurl_1.98-1.13 base64enc_0.1-3
[13] htmltools_0.5.7 progress_1.2.3 curl_5.2.0
[16] broom_1.0.5 Formula_1.2-5 htmlwidgets_1.6.4
[19] cachem_1.0.8 igraph_1.6.0 lifecycle_1.0.4
[22] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.6-4
[25] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.11
[28] digest_0.6.33 colorspace_2.1-0 AnnotationDbi_1.64.1
[31] S4Vectors_0.40.2 RSQLite_2.3.4 filelock_1.0.3
[34] labeling_0.4.3 fansi_1.0.6 timechange_0.2.0
[37] httr_1.4.7 polyclip_1.10-6 compiler_4.3.2
[40] bit64_4.0.5 withr_2.5.2 doParallel_1.0.17
[43] htmlTable_2.4.2 backports_1.4.1 viridis_0.6.4
[46] DBI_1.2.0 ggforce_0.4.1 R.utils_2.12.3
[49] MASS_7.3-60 rappdirs_0.3.3 tools_4.3.2
[52] foreign_0.8-85 prabclus_2.3-3 nnet_7.3-19
[55] R.oo_1.25.0 glue_1.6.2 grid_4.3.2
[58] checkmate_2.3.1 cluster_2.1.4 generics_0.1.3
[61] gtable_0.3.4 tzdb_0.4.0 R.methodsS3_1.8.2
[64] class_7.3-22 data.table_1.14.10 hms_1.1.3
[67] tidygraph_1.3.0 xml2_1.3.6 utf8_1.2.4
[70] XVector_0.42.0 flexmix_2.3-19 BiocGenerics_0.48.1
[73] ggrepel_0.9.4 foreach_1.5.2 pillar_1.9.0
[76] vroom_1.6.5 robustbase_0.99-1 tweenr_2.0.2
[79] BiocFileCache_2.10.1 lattice_0.22-5 bit_4.0.5
[82] tidyselect_1.2.0 Biostrings_2.70.1 knitr_1.45
[85] gridExtra_2.3 IRanges_2.36.0 stats4_4.3.2
[88] xfun_0.41 graphlayouts_1.0.2 Biobase_2.62.0
[91] diptest_0.77-0 DEoptimR_1.1-3 stringi_1.8.3
[94] evaluate_0.23 codetools_0.2-19 kernlab_0.9-32
[97] ggraph_2.1.0 BiocManager_1.30.22 cli_3.6.2
[100] rpart_4.1.21 munsell_0.5.0 Rsubread_2.16.0
[103] Rcpp_1.0.11 GenomeInfoDb_1.38.5 dbplyr_2.4.0
[106] png_0.1-8 XML_3.99-0.16 parallel_4.3.2
[109] blob_1.2.4 prettyunits_1.2.0 mclust_6.0.1
[112] bitops_1.0-7 viridisLite_0.4.2 scales_1.3.0
[115] crayon_1.5.2 fpc_2.2-11 rlang_1.1.2
[118] KEGGREST_1.42.0

@egeulgen egeulgen self-assigned this Feb 8, 2024
@egeulgen
Copy link
Owner

egeulgen commented Feb 8, 2024

thank you for raising this! it seems that the function is a bit too stringent on validating the input, I'll try and revise the behaviour

@t3h4nt1chr15t
Copy link
Author

t3h4nt1chr15t commented Feb 8, 2024

Opening that up would be nice, but I'm also very concerned with the fact that I'm fairly certain it isn't actually pulling the mouse gene lists at all but just pulling the human gene lists and renaming the genes to the mouse variations. There are numerous examples of gene lists that are unique to the human gene lists, yet somehow found in the pulled mouse gene lists.

@egeulgen
Copy link
Owner

egeulgen commented Feb 8, 2024

I'll investigate and keep you updated

@egeulgen
Copy link
Owner

egeulgen commented Feb 10, 2024

investigated this and there's no need for any change at the moment. pathfindR uses the msigdbr R package internally (and further processes it:

    msig_df <- msigdbr::msigdbr(species = species, category = collection, subcategory = subcollection)

msigdbr expects categories as "H" etc:

>> msig_df <- msigdbr::msigdbr(species = "Mus musculus", category = "MH", subcategory = NULL)
Error in msigdbr::msigdbr(species = "Mus musculus", category = "MH", subcategory = NULL) : 
  unknown category

this does return human gene sets, but best to contact maintainers of msigdbr about it. My thinking is that they somehow do not have supporr for these mouse gene sets (and simply provide mouse-equivalent genes for these)

let me know if I can help further.

@egeulgen egeulgen removed the bug label Feb 10, 2024
@egeulgen
Copy link
Owner

see igordot/msigdbr#32

@t3h4nt1chr15t
Copy link
Author

Sorry, I'm a bit confused by this result. MSigDB does in fact have fully curated gene sets for mice that are not the same as they are for humans. Using the ones for humans isn't an acceptable thing to do if I'm going to publish my data in mice.

I'm not as technically inclined in computer science as you are, but it looks as if you're saying that some connecting library or package that allows pathfindR to do what it does might be outdated and doesn't allow for collecting the mouse gene sets yet, but you do recognize it is collecting the human genes, not the mouse genes, as a result of this.

The mouse gene sets aren't mouse equivalents of the human genes. So are you saying this is just a msigdbr package issue where they haven't updated their own package for the new database yet? Or are you saying they don't lend enough trust to the validity of their mouse gene sets to allow them to be collected yet?

@egeulgen
Copy link
Owner

I understand your frustration and agree with you that this is very "improper" for msigdbr. You can open another issue in the msigdbr repo, linked above, and raise your rightful concern. From my understanding, they just haven't had the resources to update the package to support the mouse-specific gene sets, readily-available on MSigDB.
sorry I couldn't help further but this falls out-of-scope for the main responsibilities of maintaining pathfindR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants