You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Short review of data loaders, with the aim to ensure easy useability by other project members and maintainability.
Referring to current state of 1.01-cjr-change-point-index-creator.ipynb, top cell
Main comments:
first, and importantly, I would move the data loading code out of the jupyter notebook and into py files, possibly somewhere in the data folder. I would also organize the repository that all data concerns are separate, possibly also separating processed data from raw data. (more generally, raw data above a certain total size, perhaps 10 MB, should not be in GitHub repos unless dedicated data repos, as it clutters the repo)
the jupyter notebook would then just import from the "mini-package".
once the data loader is in the py file, I would strongly recommend to refactor it. At the moment it is a monolithic end-to-end function with subroutines inside. I would suggest to have a design that separates: (a) switching between file and in-memory, (b) processing pipeline; the processing is also a loop over files, so I would make it "process a single file". More precisely, I would suggest to split things as follows:
raw data loader: file to memory, for raw data to in-memory repr of raw data
processed data loader: file to memory, for processed data
processed data saver: memory to file, for processed data
data processing: memory in, memory out, for a single file
primary user facing routines with syntactic sugar:
processed data loader with sugar: if processed files not present, runs everything until those are created, and makes default choices for locations. Also loops over multiple files. Optionally, one could force rerun and do checksums.
raw data loader with sugar
data processing for multiple files sugar. Could be the same method as above, but still might make sense to add the sugar in a second step to avoid.
finally, the different methods should have correct docstrings. The docstrings should reference the data dictionaries, see Review and comments data dictionary #108, so users know what format they can expect the data in.
optional, but recommended best practice: tests for the loaders. There are already test files in the repo, so we could use them for tests via pytest and pydantic or similar.
The text was updated successfully, but these errors were encountered:
My intention was to move the function in some form into this file here once I had demonstrated it is working in the notebook: 0_meal_identification/meal_identification/meal_identification/dataset.py
So, instead of having a dataset.py script for each project, do I have one for all projects in the bg_control/data folder? I was initially planning on having the script with classes and functions available to use in 0_meal_identification/meal_identification/meal_identification/dataset.py (with a similar file for each project) , and the data was going to be stored in:
@andytubeee@Phiruby@Tony911029 please check if there is anything that Franz mentioned here that might also be included in the data work as an enhancement.
Short review of data loaders, with the aim to ensure easy useability by other project members and maintainability.
Referring to current state of
1.01-cjr-change-point-index-creator.ipynb
, top cellMain comments:
py
files, possibly somewhere in thedata
folder. I would also organize the repository that all data concerns are separate, possibly also separating processed data from raw data. (more generally, raw data above a certain total size, perhaps 10 MB, should not be in GitHub repos unless dedicated data repos, as it clutters the repo)py
file, I would strongly recommend to refactor it. At the moment it is a monolithic end-to-end function with subroutines inside. I would suggest to have a design that separates: (a) switching between file and in-memory, (b) processing pipeline; the processing is also a loop over files, so I would make it "process a single file". More precisely, I would suggest to split things as follows:pytest
andpydantic
or similar.The text was updated successfully, but these errors were encountered: