Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and comments data loaders #109

Closed
fkiraly opened this issue Oct 30, 2024 · 3 comments
Closed

Review and comments data loaders #109

fkiraly opened this issue Oct 30, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request modeldev Developing modeling pipelines for meal annotation task.

Comments

@fkiraly
Copy link

fkiraly commented Oct 30, 2024

Short review of data loaders, with the aim to ensure easy useability by other project members and maintainability.

Referring to current state of 1.01-cjr-change-point-index-creator.ipynb, top cell

Main comments:

  • first, and importantly, I would move the data loading code out of the jupyter notebook and into py files, possibly somewhere in the data folder. I would also organize the repository that all data concerns are separate, possibly also separating processed data from raw data. (more generally, raw data above a certain total size, perhaps 10 MB, should not be in GitHub repos unless dedicated data repos, as it clutters the repo)
  • the jupyter notebook would then just import from the "mini-package".
  • once the data loader is in the py file, I would strongly recommend to refactor it. At the moment it is a monolithic end-to-end function with subroutines inside. I would suggest to have a design that separates: (a) switching between file and in-memory, (b) processing pipeline; the processing is also a loop over files, so I would make it "process a single file". More precisely, I would suggest to split things as follows:
    • raw data loader: file to memory, for raw data to in-memory repr of raw data
    • processed data loader: file to memory, for processed data
    • processed data saver: memory to file, for processed data
    • data processing: memory in, memory out, for a single file
    • primary user facing routines with syntactic sugar:
      • processed data loader with sugar: if processed files not present, runs everything until those are created, and makes default choices for locations. Also loops over multiple files. Optionally, one could force rerun and do checksums.
      • raw data loader with sugar
      • data processing for multiple files sugar. Could be the same method as above, but still might make sense to add the sugar in a second step to avoid.
  • finally, the different methods should have correct docstrings. The docstrings should reference the data dictionaries, see Review and comments data dictionary #108, so users know what format they can expect the data in.
  • optional, but recommended best practice: tests for the loaders. There are already test files in the repo, so we could use them for tests via pytest and pydantic or similar.
@RobotPsychologist
Copy link
Owner

Thanks for the feedback @fkiraly!

  1. My intention was to move the function in some form into this file here once I had demonstrated it is working in the notebook: 0_meal_identification/meal_identification/meal_identification/dataset.py

So, instead of having a dataset.py script for each project, do I have one for all projects in the bg_control/data folder? I was initially planning on having the script with classes and functions available to use in 0_meal_identification/meal_identification/meal_identification/dataset.py (with a similar file for each project) , and the data was going to be stored in:

  • 0_meal_identification/meal_identification/data/raw
  • 0_meal_identification/meal_identification/data/processed
  • 0_meal_identification/meal_identification/data/interim
  • 0_meal_identification/meal_identification/data/external

@RobotPsychologist
Copy link
Owner

@andytubeee if you're interested.

@RobotPsychologist
Copy link
Owner

@andytubeee @Phiruby @Tony911029 please check if there is anything that Franz mentioned here that might also be included in the data work as an enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request modeldev Developing modeling pipelines for meal annotation task.
Development

No branches or pull requests

2 participants