-
Notifications
You must be signed in to change notification settings - Fork 56
Research: Getting started and Configuration
As specified in the https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-2420-getting-started-with-my-data
We have 5 goals so let's outline solutions for each. Those solutions are going beyond the scope of single initiative. And only some of them would be implemented in this initiative. But the goal is to gather as many ideas as possible and later they can be prioritized and scoped better.
Pre-requisite reading. To make sense of the page please read
- The motivation and goals in the VEP
- Optionally, the workflow toolings research
Utilize the existing configuration builder collected metadata to automatically
- [CLI] generate a UI in markdown or streamlit
- [Notebook] Automatically populate Jupyter Notebook User settings page (section for VDK)
Extend add() to include a group_id, which will group related properties together.
Grouping: The add() method signature can be modified as follows:
add(key: ConfigKey, default_value: ConfigValue, ..., group_id: str = None)
Groups can be anything E.g all Postgres settings would in one group, Redshift in another, DAG plugin in yet another and so on. This would allow searching all relevant and related properties easier. Grouping can also be used to a wizard type of workflow (see below)
- [CLI] vdk config --group postgres
- [Notebook] show in Settings in Group Postgres ?
Consider switching from .ini to Python files for configuration. Then you can have Autocompletion, type checking, syntax highlighting, tooltips when you hover, better depreciation of options, and so on There are some tools like Flask and Jupyter that already use python files as configuration so there's tooling around that that can be reused.
If you can declare configuration values in python like (this below could be auto-generated from vdk configuration builder)
@dataclass
class SnowflakeConfig(DBConfig):
account: str
user: str
password: str
warehouse: str
role: str
database: str
schema: str = 'PUBLIC' # default to PUBLIC schema
class MainConfig:
db_default_type: DbConfig,
...
and user provides config.vdk.py
config=MainConfig(db_default_type=SnowflaeConfig(acocunt="xxx", ...))
And you can have config.staging.vdk.py
with differnet configuration for staigng
It would have special separate extension still .py
but also requiring .vdk.py
to separate it.
To make sure this classes are not used outside of a configuration file we can can make checks in the constructor. Or we can override import function (or extend sys.meta_path with new loader) to introduce custom behaviors when certain modules or classes are imported (edited).
- [CLI] vdk run --config-file config.staging.vdk.py
- [Notebook] We can add VDK Config Cell . But this config cells would need to be obfuscated (e.g if config option is marked as sensitive). This could be achieved using IPython cell magic (e.g user enters as password below 1234 and on save it's obfuscated)
%%vdkconfig
c.Postgres.user = name
c.Postgres.password = ****
Possible implementation:
Leveraging traitlets library which provides some support for python configuration. Create a new plugin, named vdk-traitlets, to facilitate this (TODO: Evaluate alternatives to Traitlets, such as Pydantic or other listed here )
Failing at time of use may be too late. Better to fail as soon as the value is set by user. Enhance the add() method to include a validator function to validate configurations at runtime.
add(key: ConfigKey, default_value: ConfigValue, ..., validator: Callable)
Introduce a search feature that enables users to easily find properties in the UI or CLI
- [CLI] vdk config --search .?
- [Notebook/IDE] However if we adopt Python based properties we could leverage the native python based auto-complete and IDE search capabilities.
Provide templates with pre-filled configurations for common tasks, so users can start with a working example.
See below in section 2 for the workflow
Extend the CLI and Jupyter Notebooks to offer an interactive job or step creation process that handles all needed configuration dynamically.
Below is example workflow with the CLI
- Initiate Interactive CLI
vdk create --interactive
Here, --interactive flag initiates the guided workflow
Step 1: Choose Job Type
Prompt: "What type of job would you like to create?"
- Data Ingestion
- Data Transformation
- Data Validation
- Custom
Choose job type: Data Ingestion
Step 2: Source Configuration
Prompt: "Select your data source type:"
- File
- Database
- Stream
- API
- Custom
Choose source type: Database
Prompt: What database
Choose some_db
Prompt: Database-specific configurations will appear based on the group_id (group_id == some_db)
Step 3: Destination Configuration
Prompt: "Select your data destination:"
- File
- Database
- Stream
- API
Similar to source
Step 4 Display a summary of all configurations.
Prompt: "Would you like to proceed?"
The system will then generate the necessary code for the chosen configurations. The code will be production-ready. It will have necessary configuration keys set, it will have the correct methods called for extracting and loading data (in case of ingestion). It would have the necessary pluigns and dependencies set in job requirements.txt and automatically installed.
Step 5. Run the job
vdk run <job_name>
Also it's important that it is a flexible, extendable framework that allows contributors to easily add new templates with custom workflows.
Key Components:
- Template Repository: A GitHub repository (vdk-templates) where contributors can add their own folder templates with the necessary files and description.
- Configuration Metafile: Each folder (template) should have a config.meta file that describes the parameters and the workflow logic. This can be written in JSON or YAML.
- CLI Interface: Enhance the existing VDK CLI to support the guided workflow by fetching all templates and then interpreting the config.meta file.
- Notebook interface: Enhance existing Notebook UI to support the guided workflow similar to CLI interface
Structure: Every template folder should contain: The code template config.meta file README.md for manual instructions
/template_folder
/example_code_folder
config.meta
README.md
# config.meta
{
"job_type": "Data Ingestion",
"parameters": [
{"name": "Database Host", "type": "string", "group_id": "database"},
// ...
],
"workflow_logic": "workflow.py" // optional, more below
}
For more complex, dynamic workflow logic, the workflow_logic key in config.meta can point to a Python script that's responsible for conditionally setting parameters.
# workflow.py
def execute_workflow(user_choices):
if user_choices['source'] == 'API':
# Do something
else:
# Do something else
Since both CLI and UI/Notebok need to be support we need to make sure to abstract the Workflow logic.
-
CLI and Notebook should have the ability to fetch the list of available templates from vdk-templates GitHub repo. And present them to the user as options
-
Parsing config.meta (if there's one) if not just coph the example job
- Interpret and validate config.meta for each template.
- Present options and parameters to the user based on the config.meta.
- Dynamic Workflow Logic: Optionally, execute a Python script (workflow.py mentioned above) to allow conditional logic based on user's choices.
- Creating a new folder in the vdk-templates GitHub repository.
- Adding the necessary code template.
- Optionally, Writing a config.meta file that defines the parameters and workflow.
- Optionally, adding a workflow.py for dynamic logic.
TODO: evaluate also leveraging libraries like
- Cookiecutter https://github.com/cookiecutter/cookiecutter
- https://github.com/SBoudrias/Inquirer.js or https://inquirerpy.readthedocs.io/
- https://github.com/prompt-toolkit/python-prompt-toolkit
- Formik(React)
Provide user ability to auto-generate code snippets based on keywords or some other way
More advanced - Auto-generate code snippets based on the user's activity in the IDE to accelerate development.
Provide standard API Reference documentation that user are used to . We could generate it using Sphinx or similar tool
Currently, config.ini is stored in source control, making it difficult to maintain confidential or sensitive information securely.
We can transition to a more secure and centralized configuration by leveraging the VDK Control Service's Properties and Secrets API to keep vdk configuration.
To make the change smooth and ensure the user experience is preserved, the new workflow will allow users to still use config.ini for configuration. However, instead of committing this file to source control, we'll parse it and securely upload its contents to VDK Control Service.
Workflow:
- user do
vdk deploy -p <directory> --env prod
- VDK will read the configuration from config.ini (or config.vdk.py) or config.prod.ini
- Instead of uploading the information to source control it would be stored in Secrets or properties in a special part separated for that.
What if a user wants to keep config.ini in their own source control? They can still do that. It's possible to provide vdk obsfusate-config command to obfuscate only sensitive values.
New Commands that may be introduced
- vdk generate-config: To generate a new config.ini.
- vdk upload-config: To upload the parsed configuration to the centralized system.
For this to happen we need to have Dynamic configuration - See research here for more : https://github.com/vmware/versatile-data-kit/wiki/Research:-Dynamic-Configuration
Remove the promotion of environment variables from documentation. Search all environment variables mentioned and replaced them with config.ini
Problem with using config.ini is that it is per data job. And users have many jobs that really have common configuration (e.g database settings)
- [CLI] We should introduce more global settings using ~/.vdk/config file
- [Notebook] Use Jupyter Settnings
Establish and document rules for what takes precedence when both env vars and properties are set as the number of configuration providers rises.
Allow running jobs within the IDE without going through the VDK CLI. This can be facilitated by enabling a method like StandaloneDataJob().run() in the main Python file. Implementation Details:
Define a class StandaloneDataJob with a run() method. This can internally call the necessary hooks and setup required by VDK. Then used by developers like that;
def main():
result = StandaloneDataJob().run()
Here are my preferences and notes on the topic:
- Python Files Configuration – I believe this is a great idea… It would be even better if you can have an interactive tool – even a CLI too which asks through a series of prompts to fill in the configuration and validates/stores it and saves it in a file… Does it make sense?
- Guided workflow / Wizard Assistant – I would prefer this one to even a large library of templates (although I would assume it would be based on exactly on a – potentially extensible – library of templates)
- Environment Variables – this ties in pretty well with my first choice and it has been a bane of mine… I find it very hard to change configuration through environment variables and putting everything in a single well organized file would be great.
Separate notes: Isn’t the IDE Support relatively cheap to implement? It could be a low hanging fruit for. Production-ready jobs – before we/you start implementing these, should we clarify the general idea for staging/prod deployments, environments, etc… this has a long way to go in terms of maturing…
./setup-vdk [exsting_config.py]
- which db do you want impala, presto ...
enter impala configs
logging configuration telemetry endpoint smtp server control service
database="SuperCollider"
job_input.execute_query("SuperCollider", "select 1")
my three favorites:
- #1: https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#guided-workflow--wizard-assistant-1
- #2: https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#uinotebook-integration
- #3: https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#fix-documentation
I selected the ones that seems to add the most value when you start with VDK'/'try to set your initial PoC' with the framework. However, the next thing (that might be extremally important for the evaluators) is the security and therefore, after the three in the list above, I would add: https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#store-configini-in-cs-propertiessecrets or alternatively(as cheaper soution) provide a very well documented way for the VDK OSS users explaining (with tutorial) how to securely set their sensitive data(->how to set and use CS secrets with example).
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts