Skip to content

Research: Getting started and Configuration

Antoni Ivanov edited this page Aug 30, 2023 · 20 revisions

As specified in the https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-2420-getting-started-with-my-data

We have 5 goals so let's outline solutions for each. Those solutions are going beyond the scope of single initiative. And only some of them would be implemented in this initiative. But the goal is to gather as many ideas as possible and later they can be prioritized and scoped better.

1 Easily finding out which properties need to be set for a given task.

UI/Notebook Integration

Utilize the existing configuration builder collected metadata to automatically

  • [CLI] generate a UI in markdown or streamlit
  • [Notebook] Automatically populate Jupyter Notebook User settings page (section for VDK)

Configuration Grouping

Extend add() to include a group_id, which will group related properties together.

Grouping: The add() method signature can be modified as follows:

add(key: ConfigKey, default_value: ConfigValue, ..., group_id: str = None)
  • [CLI] vdk config --group postgres
  • [Notebook] show in Settings in Group Postgres ?

Python Files Configuration

Consider switching from .ini to Python files for configuration. Then you can have Autocompletion, type checking, syntax highlighting, tooltips when you hover and so on, better depreciation of properties

If you can declare configuration values in python like (could be auto-generated from vdk configuration builder)

@dataclass
class SnowflakeConfig(DBConfig):
    account: str
    user: str
    password: str
    warehouse: str
    role: str
    database: str
    schema: str = 'PUBLIC'  # default to PUBLIC schema
class MainConfig: 
    db_default_type: DbConfig, 
    ...

and user provides config.vdk.py

config=MainConfig(db_default_type=SnowflaeConfig(acocunt="xxx", ...)) 

And you can have config.staging.vdk.py with differnet configuration for staigng

It would have special separate extension still .py but also requiring .vdk.py to separate it. To make sure this classes are to used outside of a configuration file we can can make checks in the constructor for that. Or we can override import function (or extend sys.meta_path with new loader) to introduce custom behaviors when certain modules or classes are imported (edited).

  • [CLI] vdk run --config-file config.staging.vdk.py
  • [Notebook] We can add VDK Config Cell . But this config cells would need to be obfuscated (e.g if config option is marked as sensitive). This could be achieved using IPython cell magic (e.g user enters as password below 1234 and on save it's obfuscated)
%%vdkconfig
c.Postgres.user = name
c.Postgres.password = ****

Possible implementation:

Leveraging traitlets library which provides some support for python configuration. Create a new plugin, named vdk-traitlets, to facilitate this (TODO: Evaluate alternatives to Traitlets, such as Pydantic or other listed here )

Runtime Validation:

Failing at time of user sometimes it is too late. Enhance the add() method to include a validator function to validate configurations at runtime.

add(key: ConfigKey, default_value: ConfigValue, ..., validator: Callable) 

Search Functionality

Introduce a search feature that enables users to easily find properties in the UI or CLI

  • [CLI] vdk config --search .?
  • [Notebook/IDE] However if we adopt Python based properties we could leverage the native python based auto-complete and IDE search capabilities.

Guided workflow / Wizard Assistant

Provide templates with pre-filled configurations for common tasks, so users can start with a working example.

See below in section 2 for the workflow

2 Easily finding out which SDK methods need to be set for a given task

Guided workflow / Wizard Assistant

Extend the CLI and Jupyter Notebooks to offer an interactive job or step creation process that handles all needed configuration dynamically.

Below is example workflow with the CLI

  • Initiate Interactive CLI
vdk create --interactive

Here, --interactive flag initiates the guided workflow

Step 1: Choose Job Type

Prompt: "What type of job would you like to create?"
- Data Ingestion
- Data Transformation
- Data Validation
- Custom

Choose job type: Data Ingestion

Step 2: Source Configuration

Prompt: "Select your data source type:"
- File
- Database
- Stream
- API
- Custom

Choose source type: Database

Prompt: What database 

Choose some_db

Prompt: Database-specific configurations will appear based on the group_id (group_id == some_db)

Step 3: Destination Configuration

Prompt: "Select your data destination:"
- File
- Database
- Stream
- API

Similar to source

Step 4 Display a summary of all configurations.

Prompt: "Would you like to proceed?"

The system will then generate the necessary code for the chosen configurations. The code will be production-ready. It will have necessary configuration keys set, it will have the correct methods called for extracting and loading data (in case of ingestion). It would have the necessary pluigns and dependencies set in job requirements.txt and automatically installed.

Step 5. Run the job

vdk run <job_name>

Also it's important that it is a flexible, extendable framework that allows contributors to easily add new templates with custom workflows.

Possible implementation

Key Components:

  • Template Repository: A GitHub repository (vdk-templates) where contributors can add their own folder templates with the necessary files and description.
  • Configuration Metafile: Each folder (template) should have a config.meta file that describes the parameters and the workflow logic. This can be written in JSON or YAML.
  • CLI Interface: Enhance the existing VDK CLI to support the guided workflow by fetching all templates and then interpreting the config.meta file.
  • Notebook interface: Enhance existing Notebook UI to support the guided workflow similar to CLI interface
Template repository

Structure: Every template folder should contain: The code template config.meta file README.md for manual instructions

/template_folder
    /example_code_folder
    config.meta
    README.md
# config.meta
{
  "job_type": "Data Ingestion",
  "parameters": [
    {"name": "Database Host", "type": "string", "group_id": "database"},
    // ...
  ],
  "workflow_logic": "workflow.py" // optional, more below
}

For more complex, dynamic workflow logic, the workflow_logic key in config.meta can point to a Python script that's responsible for conditionally setting parameters.

# workflow.py
def execute_workflow(user_choices):
    if user_choices['source'] == 'API':
        # Do something
    else:
        # Do something else

Since both CLI and UI/Notebok need to be support we need to make sure to abstract the Workflow logic.

CLI/Notebook interface

  1. CLI and Notebook should have the ability to fetch the list of available templates from vdk-templates GitHub repo. And present them to the user as options

  2. Parsing config.meta (if there's one) if not just coph the example job

  • Interpret and validate config.meta for each template.
  • Present options and parameters to the user based on the config.meta.
  • Dynamic Workflow Logic: Optionally, execute a Python script (workflow.py mentioned above) to allow conditional logic based on user's choices.

Adding a New Template/Example

  • Creating a new folder in the vdk-templates GitHub repository.
  • Adding the necessary code template.
  • Optionally, Writing a config.meta file that defines the parameters and workflow.
  • Optionally, adding a workflow.py for dynamic logic.

TODO: evaluate also leveraging libraries like

Snippet Generation

Provide user ability to auto-generate code snippets based on keywords or some other way

More advanced - Auto-generate code snippets based on the user's activity in the IDE to accelerate development.

3. Production-Ready Jobs

Store config.ini in CS Properties/Secrets

Store configuration in a centralized configuration system rather than in config.ini in the source code. Leverage VDK Control Service's Properties API and Secrets API for this.

Workflow:

When the user do vdk deploy -p <directory> --env prod it will read the configuration from config.ini (or config.vdk.py) and ingest of uploading the information to source control it would be stored in Secrets or properties in a special part separated for that.

4. Environment Variables

Fix Documentation

Remove the promotion of environment variables from documentation. Search all environment variables mentioned and replaced them with config.ini

Provide machine level Global Settings:

Problem with using config.ini is that it is per data job :

  • [CLI] We should introduce more global settings using ~/.vdk/config file
  • [Notebook] Use Jupyter Settnings

Precedence Rules

Establish and document rules for what takes precedence when both env vars and properties are set as the number of configuration providers rises.

5. IDE Support

Non CLI entry point option

Allow running jobs within the IDE without going through the VDK CLI. This can be facilitated by enabling a method like StandaloneDataJob().run() in the main Python file. Implementation Details:

Define a class StandaloneDataJob with a run() method. This can internally call the necessary hooks and setup required by VDK. Then used by developers like that;

def main():
    result = StandaloneDataJob().run()
Clone this wiki locally