Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bdbag downloaded from portal for superset_collection including files, samples and subjects of subset_collection #356

Open
nsuvarnaiari opened this issue May 25, 2022 · 2 comments

Comments

@nsuvarnaiari
Copy link
Contributor

Hi Deriva team,

Question:
I have a “superset_collection” with X number of "subset-collections" and "xxxx_in_collection.tsv" files are filled including each subset-collection. If I download the bdbag for “superset_collection”, do I get to see files, subjects and samples associated with each each "subset_collections" since I filled them in "xxxx_in_collection.tsv" and the superset_collection<->subset_collection linking is in "collection_in_collection.tsv"?

Karl thinks files, subjects and samples from "subset_collections" will not be included in the bdbag for "superset_collection" . He thinks this could be fixed so that it dumps the transitive closure of collection + collections subordinate via collection-in-collection.

@RLC-DCPPC @lliming @karlcz bringing this issue to your notice for future discussion.

Thanks,
Suvvi

@karlcz
Copy link
Contributor

karlcz commented Jun 14, 2022

Hmm, this didn't get hooked into the project planning and will not be addressed in the upcoming release.

Also, thinking about this a little more, it is unfortunately pretty complicated and nuanced. I think we will need further discussion to see if we can find consensus on export mode(s) that are of general use. I do not know right now which user expectations can be met and/or which export modes are easiest to explain.

But, I think it is infeasible to say that we will walk transitive closures of the many paths in C2M2 because it would often distort a filtered export back into a much larger set of items due to all the interconnectivity. If too many paths effectively mean "full export" I think we might as well just offer a canonical full dump BDBag for those who want to spelunk all the data, while keeping a much more slim/narrow export mode for dynamic filters so that people can ask for brief subsets directly focused on their search critiera...

@RLC-DCPPC @lliming @abradyIGS @mikedarcy

@karlcz
Copy link
Contributor

karlcz commented Jun 14, 2022

As a general rule right now, the exports have a focus on the central table from which the user activates the export option.

  1. The central table should have only the C2M2 entities matched by their search critiera or the single entity if they exported from a single record (detail) page.
  2. Other tables are brought in via a connection/relevance to the entities exported in the central table. We can only follow one "path" for each export table, so we have chosen some reasonable heuristics for the most significant path. (See below)
  3. Sometimes the extra connected tables may dump a superset where we could exploit some path through the portal model which we know brings all the relevant values but might also bring more irrelevant ones too. For example, we might dump some vocabulary terms even though they are not actually referenced by the core entities in the user query.

Export paths by focus

This is a summary of the export modes in the portal as of 2022-06. Each subsection is named by the central focus table that the user is viewing when they activate an export. The list of exported paths describes what content is exported.

Collection

  1. collection.csv: the exact collections matched by the search
  2. file.csv: all files associated to (1) by file_in_collection records
  3. biosample.csv: all biosamples assocated to (1) by biosample_in_collection records
  4. subject.csv: all subjects associated to (1) by subject_in_collection records
  5. file_format.csv, data_type.csv, assay_type.csv, anatomy.csv, disease.csv, phenotype.csv, gene.csv, substance.csv, compound.csv, protein.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv : all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
  6. collection_disease.csv, collection_phenotype.csv, collection_gene.csv, collection_compound.csv, collection_substance.csv, collection_taxonomy.csv, collection_anatomy.csv, collection_protein.csv: all associations referencing (1)
  7. biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (3)
  8. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv, subject_disease.csv, subject_phenotype.csv`: all associations referencing (4)
  9. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  10. project_in_project.csv: all associations where child-project is in (9)

Notable gaps:

  • file_describes_biosample and file_describes_subject do not seem to be dumped at all, inconsistently with biosample_from_subject
  • biosample_from_subject.csv might reference subjects which are not included in subject.csv since the latter is dumped via the subject_in_collection path
  • file_in_collection, biosample_in_collection, subject_in_collection, and collection_in_collection are not dumped at all

File

  1. file.csv: the exact files matched by the user search
  2. biosample.csv: all biosamples linked to (1) by direct file_describes_biosample associations
  3. subject.csv: all subjects linked to (1) by direct file_describes_subject associations
  4. file_format.csv, data_type.csv, assay_type.csv, anatomy.csv, disease.csv, gene.csv, substance.csv, compound.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
  5. biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (2)
  6. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv, subject_disease.csv, subject_phenotype.csv`: all associations referencing (3)
  7. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  8. project_in_project.csv: all associations where child-project is in (7)

Notable gaps:

  • by design, collection and collection-level associations are not dumped at all
  • file_describes_biosample and file_describes_subject do not seem to be dumped at all, inconsistently with biosample_from_subject
  • biosample_from_subject.csv might reference subjects which are not included in subject.csv since the latter is dumped via the file_describes_subject path

Biosample

  1. biosample.csv: the exact biosamples matched by the user search
  2. subject.csv: all subjects linked to (1) by direct biosample_from_subject associations
  3. assay_type.csv, anatomy.csv, disease.csv, gene.csv, substance.csv, compound.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
  4. biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (1)
  5. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv, subject_disease.csv, subject_phenotype.csv`: all associations referencing (2)
  6. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  7. project_in_project.csv: all associations where child-project is in (6)

Notable gaps:

  • by design, collection and collection-level associations are not dumped at all
  • by design, file and file-level associations are not dumped at all

Subject

  1. subject.csv: the exact subjects matched by the user search
  2. disease.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact search classes referenced by (1)
  3. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv, subject_disease.csv, subject_phenotype.csv`: all associations referencing (1)
  4. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  5. project_in_project.csv: all associations where child-project is in (4)

Notable gaps:

  • by design, collection and collection-level associations are not dumped at all
  • by design, file and file-level associations are not dumped at all
  • by design, biosample and biosample-level associations are not dumped at all
  • substance is not dumped even though subject_substance is!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants