Review and comments data loaders #109

fkiraly · 2024-10-30T11:29:55Z

Short review of data loaders, with the aim to ensure easy useability by other project members and maintainability.

Referring to current state of 1.01-cjr-change-point-index-creator.ipynb, top cell

Main comments:

first, and importantly, I would move the data loading code out of the jupyter notebook and into py files, possibly somewhere in the data folder. I would also organize the repository that all data concerns are separate, possibly also separating processed data from raw data. (more generally, raw data above a certain total size, perhaps 10 MB, should not be in GitHub repos unless dedicated data repos, as it clutters the repo)
the jupyter notebook would then just import from the "mini-package".
once the data loader is in the py file, I would strongly recommend to refactor it. At the moment it is a monolithic end-to-end function with subroutines inside. I would suggest to have a design that separates: (a) switching between file and in-memory, (b) processing pipeline; the processing is also a loop over files, so I would make it "process a single file". More precisely, I would suggest to split things as follows:
- raw data loader: file to memory, for raw data to in-memory repr of raw data
- processed data loader: file to memory, for processed data
- processed data saver: memory to file, for processed data
- data processing: memory in, memory out, for a single file
- primary user facing routines with syntactic sugar:
  - processed data loader with sugar: if processed files not present, runs everything until those are created, and makes default choices for locations. Also loops over multiple files. Optionally, one could force rerun and do checksums.
  - raw data loader with sugar
  - data processing for multiple files sugar. Could be the same method as above, but still might make sense to add the sugar in a second step to avoid.
finally, the different methods should have correct docstrings. The docstrings should reference the data dictionaries, see Review and comments data dictionary #108, so users know what format they can expect the data in.
optional, but recommended best practice: tests for the loaders. There are already test files in the repo, so we could use them for tests via pytest and pydantic or similar.

The text was updated successfully, but these errors were encountered:

RobotPsychologist · 2024-10-30T18:20:41Z

Thanks for the feedback @fkiraly!

My intention was to move the function in some form into this file here once I had demonstrated it is working in the notebook: 0_meal_identification/meal_identification/meal_identification/dataset.py

So, instead of having a dataset.py script for each project, do I have one for all projects in the bg_control/data folder? I was initially planning on having the script with classes and functions available to use in 0_meal_identification/meal_identification/meal_identification/dataset.py (with a similar file for each project) , and the data was going to be stored in:

0_meal_identification/meal_identification/data/raw
0_meal_identification/meal_identification/data/processed
0_meal_identification/meal_identification/data/interim
0_meal_identification/meal_identification/data/external

RobotPsychologist · 2024-10-30T20:36:40Z

@andytubeee if you're interested.

RobotPsychologist · 2024-11-15T16:09:38Z

@andytubeee @Phiruby @Tony911029 please check if there is anything that Franz mentioned here that might also be included in the data work as an enhancement.

fkiraly mentioned this issue Oct 30, 2024

Model Development - Data Cleaning Script #91

Closed

RobotPsychologist added this to @RobotPsychologist's Automatic Meal Detection from Blood Glucose CGM Reading Oct 30, 2024

RobotPsychologist self-assigned this Oct 30, 2024

RobotPsychologist added enhancement New feature or request modeldev Developing modeling pipelines for meal annotation task. labels Oct 30, 2024

RobotPsychologist added this to the Modeling Pipeline Completion milestone Oct 30, 2024

This was referenced Oct 31, 2024

Refactor + Fix dataset generator #110

Merged

Model Development - Data Generation Script #90

Closed

RobotPsychologist mentioned this issue Nov 12, 2024

Create change point index from data #68

Closed

RobotPsychologist closed this as completed Nov 20, 2024

github-project-automation bot moved this from In review to Done in @RobotPsychologist's Automatic Meal Detection from Blood Glucose CGM Reading Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review and comments data loaders #109

Review and comments data loaders #109

fkiraly commented Oct 30, 2024 •

edited

Loading

RobotPsychologist commented Oct 30, 2024

RobotPsychologist commented Oct 30, 2024

RobotPsychologist commented Nov 15, 2024

Review and comments data loaders #109

Review and comments data loaders #109

Comments

fkiraly commented Oct 30, 2024 • edited Loading

RobotPsychologist commented Oct 30, 2024

RobotPsychologist commented Oct 30, 2024

RobotPsychologist commented Nov 15, 2024

fkiraly commented Oct 30, 2024 •

edited

Loading