Skip to content

Latest commit

 

History

History
128 lines (91 loc) · 5.08 KB

README.md

File metadata and controls

128 lines (91 loc) · 5.08 KB

Microntrient Information Center (MIC) Ingest

| Documentation |

Ingest for Micronutrient Information Center using OntoGPT

Requirements

Data Sources

The Microntrient Information Center (MIC) Ingest differs from our other standard modular ingests in that the data source is not simple flat files downloaded from an authority. Instead, the information for the MIC ingest comes from the (Linus Pauling Institute Website)[https://lpi.oregonstate.edu/] website within the (Micronutrient Information Center)[https://lpi.oregonstate.edu/mic]. We intend to use OntoGPT to scrape the content of the site and assemble it into rows of data. Then we hope to use our existing Koza system and GitHub release infrastructure to import this resource as nodes and edges into the Monarch KG.

Source Files

As mentioned above, source files for this ingest will be generated by scraping using OntoGPT rather than downloaded from sources directly. OntoGPT will use the site map to crawl through all of the pages of the MIC to generate output with rows of

As of ontogpt's latest version (v1.0.10) it now includes a template for extracting the following relation types from MIC pages:

  • Nutrient to disease
  • Nutrient to phenotype
  • Nutrient to biological process
  • Nutrient to health status of a body part or system (like "calcium supports healthy bones")
  • Nutrient to food source
  • Nutrient to nutrient

A couple caveats:

  • I haven't set the relations to ground to RO or Biolink types - this will require some discussion to identify appropriate mappings
  • References for each claim are extracted, though only as a list of their numerical identifiers in the page's reference list. Other approaches introduced too much hallucination and/or the LLM just refused to parse more than a fraction of the reflist. Could be solved with some minimal scraping.

Nodes and Edges -- Not Complete

Use this section describe the nodes and edges generated from the ingest for instance

  • Gene Nodes - Description of which nodes are created and what data may be excluded from the ingest.
  • Gene → Disease - Similar description of the edges and which edges are created or how the data may be filtered.

Transform Code and Configuration

Metadata for the infest is in the metadata.yaml file and may require some adjustment depending on your configuration. Data files and locations are listed in the download.yaml file which is used to download all of the data sources before the transform. The transform.yaml file and python file transform.py contain the configuration and transformation code, respectively.

For more information, see the Koza documentation and kghub-downloader.

Dependencies are listed in pyproject.toml file. This project uses pytest for development testing located in the tests directory to test the functionality of your transform.

Documentation

The documentation for this ingest is in this README.md file and additional documentation is in the docs directory.

Note: After the GitHub Actions for deploying documentation runs, the documentation will be automatically deployed to GitHub Pages.

GitHub Actions

This project is set up with several GitHub Actions workflows. You should not need to modify these workflows unless you want to change the behavior. The workflows are located in the .github/workflows directory:

  • test.yaml: Run the pytest suite.
  • create-release.yaml: Create a new release once a week, or manually.
  • deploy-docs.yaml: Deploy the documentation to GitHub Pages (on pushes to main).
  • update-docs.yaml: After a release, update the documentation with node/edge reports.

Installation

cd mic-ingest
make install
# or
poetry install

Note that the make install command is just a convenience wrapper around poetry install.

Once installed, you can check that everything is working as expected:

# Run the pytest suite
make test
# Download the data and run the Koza transform
make download
make run

Usage

This project is set up with a Makefile for common tasks.
To see available options:

make help

Download and Transform

Download the data for the mic_ingest transform:

poetry run mic_ingest download

To run the Koza transform for mic-ingest:

poetry run mic_ingest transform

To see available options:

poetry run mic_ingest download --help
# or
poetry run mic_ingest transform --help

Testing

To run the test suite:

make test

This project was generated using monarch-initiative/cookiecutter-monarch-ingest.
Keep this project up to date using cruft by occasionally running in the project directory:

cruft update

For more information, see the cruft documentation