Skip to content

Research: Getting started and Configuration

Antoni Ivanov edited this page Aug 31, 2023 · 20 revisions

As specified in the https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-2420-getting-started-with-my-data

We have 5 goals so let's outline solutions for each. Those solutions are going beyond the scope of single initiative. And only some of them would be implemented in this initiative. But the goal is to gather as many ideas as possible and later they can be prioritized and scoped better.

Pre-requisite reading. To make sense of the page please read

1 Easily finding out which properties need to be set for a given task.

UI/Notebook Integration

Utilize the existing configuration builder collected metadata to automatically

Configuration Grouping

Extend add() to include a group_id, which will group related properties together.

Grouping: The add() method signature can be modified as follows:

add(key: ConfigKey, default_value: ConfigValue, ..., group_id: str = None)

Groups can be anything E.g all Postgres settings would in one group, Redshift in another, DAG plugin in yet another and so on. This would allow searching all relevant and related properties easier. Grouping can also be used to a wizard type of workflow (see below)

  • [CLI] vdk config --group postgres
  • [Notebook] show in Settings in Group Postgres ?

Python Files Configuration

Consider switching from .ini to Python files for configuration. Then you can have Autocompletion, type checking, syntax highlighting, tooltips when you hover, better depreciation of options, and so on There are some tools like Flask and Jupyter that already use python files as configuration so there's tooling around that that can be reused.

If you can declare configuration values in python like (this below could be auto-generated from vdk configuration builder)

@dataclass
class SnowflakeConfig(DBConfig):
    account: str
    user: str
    password: str
    warehouse: str
    role: str
    database: str
    schema: str = 'PUBLIC'  # default to PUBLIC schema
class MainConfig: 
    db_default_type: DbConfig, 
    ...

and user provides config.vdk.py

config=MainConfig(db_default_type=SnowflaeConfig(acocunt="xxx", ...)) 

And you can have config.staging.vdk.py with differnet configuration for staigng

It would have special separate extension still .py but also requiring .vdk.py to separate it. To make sure this classes are not used outside of a configuration file we can can make checks in the constructor. Or we can override import function (or extend sys.meta_path with new loader) to introduce custom behaviors when certain modules or classes are imported (edited).

  • [CLI] vdk run --config-file config.staging.vdk.py
  • [Notebook] We can add VDK Config Cell . But this config cells would need to be obfuscated (e.g if config option is marked as sensitive). This could be achieved using IPython cell magic (e.g user enters as password below 1234 and on save it's obfuscated)
%%vdkconfig
c.Postgres.user = name
c.Postgres.password = ****

Possible implementation:

Leveraging traitlets library which provides some support for python configuration. Create a new plugin, named vdk-traitlets, to facilitate this (TODO: Evaluate alternatives to Traitlets, such as Pydantic or other listed here )

Runtime Validation:

Failing at time of use may be too late. Better to fail as soon as the value is set by user. Enhance the add() method to include a validator function to validate configurations at runtime.

add(key: ConfigKey, default_value: ConfigValue, ..., validator: Callable) 

Search Functionality

Introduce a search feature that enables users to easily find properties in the UI or CLI

  • [CLI] vdk config --search .?
  • [Notebook/IDE] However if we adopt Python based properties we could leverage the native python based auto-complete and IDE search capabilities.

Guided workflow / Wizard Assistant

Provide templates with pre-filled configurations for common tasks, so users can start with a working example.

See below in section 2 for the workflow

2 Easily finding out which SDK functionalities and methods as needed for a given task

Guided workflow / Wizard Assistant

Extend the CLI and Jupyter Notebooks to offer an interactive job or step creation process that handles all needed configuration dynamically.

Below is example workflow with the CLI

  • Initiate Interactive CLI
vdk create --interactive

Here, --interactive flag initiates the guided workflow

Step 1: Choose Job Type

Prompt: "What type of job would you like to create?"
- Data Ingestion
- Data Transformation
- Data Validation
- Custom

Choose job type: Data Ingestion

Step 2: Source Configuration

Prompt: "Select your data source type:"
- File
- Database
- Stream
- API
- Custom

Choose source type: Database

Prompt: What database 

Choose some_db

Prompt: Database-specific configurations will appear based on the group_id (group_id == some_db)

Step 3: Destination Configuration

Prompt: "Select your data destination:"
- File
- Database
- Stream
- API

Similar to source

Step 4 Display a summary of all configurations.

Prompt: "Would you like to proceed?"

The system will then generate the necessary code for the chosen configurations. The code will be production-ready. It will have necessary configuration keys set, it will have the correct methods called for extracting and loading data (in case of ingestion). It would have the necessary pluigns and dependencies set in job requirements.txt and automatically installed.

Step 5. Run the job

vdk run <job_name>

Also it's important that it is a flexible, extendable framework that allows contributors to easily add new templates with custom workflows.

Possible implementation

Key Components:

  • Template Repository: A GitHub repository (vdk-templates) where contributors can add their own folder templates with the necessary files and description.
  • Configuration Metafile: Each folder (template) should have a config.meta file that describes the parameters and the workflow logic. This can be written in JSON or YAML.
  • CLI Interface: Enhance the existing VDK CLI to support the guided workflow by fetching all templates and then interpreting the config.meta file.
  • Notebook interface: Enhance existing Notebook UI to support the guided workflow similar to CLI interface
Template repository

Structure: Every template folder should contain: The code template config.meta file README.md for manual instructions

/template_folder
    /example_code_folder
    config.meta
    README.md
# config.meta
{
  "job_type": "Data Ingestion",
  "parameters": [
    {"name": "Database Host", "type": "string", "group_id": "database"},
    // ...
  ],
  "workflow_logic": "workflow.py" // optional, more below
}

For more complex, dynamic workflow logic, the workflow_logic key in config.meta can point to a Python script that's responsible for conditionally setting parameters.

# workflow.py
def execute_workflow(user_choices):
    if user_choices['source'] == 'API':
        # Do something
    else:
        # Do something else

Since both CLI and UI/Notebok need to be support we need to make sure to abstract the Workflow logic.

CLI/Notebook interface

  1. CLI and Notebook should have the ability to fetch the list of available templates from vdk-templates GitHub repo. And present them to the user as options

  2. Parsing config.meta (if there's one) if not just coph the example job

  • Interpret and validate config.meta for each template.
  • Present options and parameters to the user based on the config.meta.
  • Dynamic Workflow Logic: Optionally, execute a Python script (workflow.py mentioned above) to allow conditional logic based on user's choices.

Adding a New Template/Example

  • Creating a new folder in the vdk-templates GitHub repository.
  • Adding the necessary code template.
  • Optionally, Writing a config.meta file that defines the parameters and workflow.
  • Optionally, adding a workflow.py for dynamic logic.

TODO: evaluate also leveraging libraries like

Snippet Generation

Provide user ability to auto-generate code snippets based on keywords or some other way

More advanced - Auto-generate code snippets based on the user's activity in the IDE to accelerate development.

Documentation (API Reference)

Provide standard API Reference documentation that user are used to . We could generate it using Sphinx or similar tool

3. Production-Ready Jobs

Store config.ini in CS Properties/Secrets

Currently, config.ini is stored in source control, making it difficult to maintain confidential or sensitive information securely.

We can transition to a more secure and centralized configuration by leveraging the VDK Control Service's Properties and Secrets API to keep vdk configuration.

To make the change smooth and ensure the user experience is preserved, the new workflow will allow users to still use config.ini for configuration. However, instead of committing this file to source control, we'll parse it and securely upload its contents to VDK Control Service.

Workflow:

  1. user do vdk deploy -p <directory> --env prod
  2. VDK will read the configuration from config.ini (or config.vdk.py) or config.prod.ini
  3. Instead of uploading the information to source control it would be stored in Secrets or properties in a special part separated for that.

What if a user wants to keep config.ini in their own source control? They can still do that. It's possible to provide vdk obsfusate-config command to obfuscate only sensitive values.

New Commands that may be introduced

  • vdk generate-config: To generate a new config.ini.
  • vdk upload-config: To upload the parsed configuration to the centralized system.

For this to happen we need to have Dynamic configuration - See research here for more : https://github.com/vmware/versatile-data-kit/wiki/Research:-Dynamic-Configuration

4. Environment Variables

Fix Documentation

Remove the promotion of environment variables from documentation. Search all environment variables mentioned and replaced them with config.ini

Provide machine level Global Settings:

Problem with using config.ini is that it is per data job. And users have many jobs that really have common configuration (e.g database settings)

  • [CLI] We should introduce more global settings using ~/.vdk/config file
  • [Notebook] Use Jupyter Settnings

Precedence Rules

Establish and document rules for what takes precedence when both env vars and properties are set as the number of configuration providers rises.

5. IDE Support

Non CLI entry point option

Allow running jobs within the IDE without going through the VDK CLI. This can be facilitated by enabling a method like StandaloneDataJob().run() in the main Python file. Implementation Details:

Define a class StandaloneDataJob with a run() method. This can internally call the necessary hooks and setup required by VDK. Then used by developers like that;

def main():
    result = StandaloneDataJob().run()
Clone this wiki locally