-
Notifications
You must be signed in to change notification settings - Fork 56
Research: Getting started and Configuration
As specified in the https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-2420-getting-started-with-my-data
We have 5 goals so let's outline solutions for each. Those solutions are going beyond the scope of single initiative. And only some of them would be implemented in this initiative. But the goal is to gather as many ideas as possible and later they can be prioritized and scoped better.
Utilize the existing configuration builder collected metadata to automatically
- [CLI] generate a UI in markdown or streamlit
- [Notebook] Automatically populate Jupyter Notebook User settings page (section for VDK)
Extend add() to include a group_id, which will group related properties together.
Grouping: The add() method signature can be modified as follows:
add(key: ConfigKey, default_value: ConfigValue, ..., group_id: str = None)
- [CLI] vdk config --group postgres
- [Notebook] show in Settings in Group Postgres ?
Consider switching from .ini to Python files for configuration. Then you can have Autocompletion, type checking, syntax highlighting, tooltips when you hover and so on, better depreciation of properties
If you can declare configuration values in python like (could be auto-generated from vdk configuration builder)
@dataclass
class SnowflakeConfig(DBConfig):
account: str
user: str
password: str
warehouse: str
role: str
database: str
schema: str = 'PUBLIC' # default to PUBLIC schema
class MainConfig:
db_default_type: DbConfig,
...
and user provides config.vdk.py
config=MainConfig(db_default_type=SnowflaeConfig(acocunt="xxx", ...))
And you can have config.staging.vdk.py
with differnet configuration for staigng
It would have special separate extension still .py
but also requiring .vdk.py
to separate it.
To make sure this classes are to used outside of a configuration file we can can make checks in the constructor for that. Or we can override import function (or extend sys.meta_path with new loader) to introduce custom behaviors when certain modules or classes are imported (edited).
- [CLI] vdk run --config-file config.staging.vdk.py
- [Notebook] We can add VDK Config Cell . But this config cells would need to be obfuscated (e.g if config option is marked as sensitive). This could be achieved using IPython cell magic (e.g user enters as password below 1234 and on save it's obfuscated)
%%vdkconfig
c.Postgres.user = name
c.Postgres.password = ****
Possible implementation:
Leveraging traitlets library which provides some support for python configuration. Create a new plugin, named vdk-traitlets, to facilitate this (TODO: Evaluate alternatives to Traitlets, such as Pydantic or other listed here )
Failing at time of user sometimes it is too late. Enhance the add() method to include a validator function to validate configurations at runtime.
add(key: ConfigKey, default_value: ConfigValue, ..., validator: Callable)
Introduce a search feature that enables users to easily find properties in the UI or CLI
- [CLI] vdk config --search .?
- [Notebook/IDE] However if we adopt Python based properties we could leverage the native python based auto-complete and IDE search capabilities.
Provide templates with pre-filled configurations for common tasks, so users can start with a working example.
See below in section 2 for the workflow
Extend the CLI and Jupyter Notebooks to offer an interactive job or step creation process that handles all needed configuration dynamically.
Below is example workflow with the CLI
- Initiate Interactive CLI
vdk create --interactive
Here, --interactive flag initiates the guided workflow
Step 1: Choose Job Type
Prompt: "What type of job would you like to create?"
- Data Ingestion
- Data Transformation
- Data Validation
- Custom
Choose job type: Data Ingestion
Step 2: Source Configuration
Prompt: "Select your data source type:"
- File
- Database
- Stream
- API
- Custom
Choose source type: Database
Prompt: What database
Choose some_db
Prompt: Database-specific configurations will appear based on the group_id (group_id == some_db)
Step 3: Destination Configuration
Prompt: "Select your data destination:"
- File
- Database
- Stream
- API
Similar to source
Step 4 Display a summary of all configurations.
Prompt: "Would you like to proceed?"
The system will then generate the necessary code for the chosen configurations. The code will be production-ready. It will have necessary configuration keys set, it will have the correct methods called for extracting and loading data (in case of ingestion). It would have the necessary pluigns and dependencies set in job requirements.txt and automatically installed.
Step 5. Run the job
vdk run <job_name>
Also it's important that it is a flexible, extendable framework that allows contributors to easily add new templates with custom workflows.
Key Components:
- Template Repository: A GitHub repository (vdk-templates) where contributors can add their own folder templates with the necessary files and description.
- Configuration Metafile: Each folder (template) should have a config.meta file that describes the parameters and the workflow logic. This can be written in JSON or YAML.
- CLI Interface: Enhance the existing VDK CLI to support the guided workflow by fetching all templates and then interpreting the config.meta file.
- Notebook interface: Enhance existing Notebook UI to support the guided workflow similar to CLI interface
Structure: Every template folder should contain: The code template config.meta file README.md for manual instructions
/template_folder
/example_code_folder
config.meta
README.md
# config.meta
{
"job_type": "Data Ingestion",
"parameters": [
{"name": "Database Host", "type": "string", "group_id": "database"},
// ...
],
"workflow_logic": "workflow.py" // optional, more below
}
For more complex, dynamic workflow logic, the workflow_logic key in config.meta can point to a Python script that's responsible for conditionally setting parameters.
# workflow.py
def execute_workflow(user_choices):
if user_choices['source'] == 'API':
# Do something
else:
# Do something else
Since both CLI and UI/Notebok need to be support we need to make sure to abstract the Workflow logic.
-
CLI and Notebook should have the ability to fetch the list of available templates from vdk-templates GitHub repo. And present them to the user as options
-
Parsing config.meta (if there's one) if not just coph the example job
- Interpret and validate config.meta for each template.
- Present options and parameters to the user based on the config.meta.
- Dynamic Workflow Logic: Optionally, execute a Python script (workflow.py mentioned above) to allow conditional logic based on user's choices.
- Creating a new folder in the vdk-templates GitHub repository.
- Adding the necessary code template.
- Optionally, Writing a config.meta file that defines the parameters and workflow.
- Optionally, adding a workflow.py for dynamic logic.
TODO: evaluate also leveraging libraries like
- Cookiecutter https://github.com/cookiecutter/cookiecutter
- https://github.com/SBoudrias/Inquirer.js or https://inquirerpy.readthedocs.io/
- https://github.com/prompt-toolkit/python-prompt-toolkit
- Formik(React)
Provide user ability to auto-generate code snippets based on keywords or some other way
More advanced - Auto-generate code snippets based on the user's activity in the IDE to accelerate development.
Store configuration in a centralized configuration system rather than in config.ini in the source code. Leverage VDK Control Service's Properties API and Secrets API for this.
Workflow:
When the user do vdk deploy -p <directory> --env prod
it will read the configuration from config.ini (or config.vdk.py) and ingest of uploading the information to source control it would be stored in Secrets or properties in a special part separated for that.
Remove the promotion of environment variables from documentation. Search all environment variables mentioned and replaced them with config.ini
Problem with using config.ini is that it is per data job :
- [CLI] We should introduce more global settings using ~/.vdk/config file
- [Notebook] Use Jupyter Settnings
Establish and document rules for what takes precedence when both env vars and properties are set as the number of configuration providers rises.
Allow running jobs within the IDE without going through the VDK CLI. This can be facilitated by enabling a method like StandaloneDataJob().run() in the main Python file. Implementation Details:
Define a class StandaloneDataJob with a run() method. This can internally call the necessary hooks and setup required by VDK. Then used by developers like that;
def main():
result = StandaloneDataJob().run()
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts