Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppl project table command #936

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

YANG-DB
Copy link
Member

@YANG-DB YANG-DB commented Nov 20, 2024

Description

add project table command to allow materializing queries into a concrete table / view that can later be efficiently queried or stored in OpenSearch MV

PPL project command

Overview

Using project command to materialize a query into a dedicated view:
In some cases it is required to construct a projection view (materialized into a view) of the query results.
This projection can be later used as a source of continued queries for further slicing and dicing the data, in addition such tables can be also saved into a MV table that are pushed into OpenSearch and can be used for visualization and enhanced performant queries.

The command can also function as an ETL process where the original datasource will be transformed and ingested into the output projected view using the ppl transformation and aggregation operators

Syntax

PROJECT (IF NOT EXISTS)? viewName (USING datasource)? (OPTIONS optionsList)? (PARTITIONED BY partitionColumnNames)? location?

  • viewName
    Specifies a view name, which may be optionally qualified with a database name.

  • USING datasource
    Data Source is the input format used to create the table. Data source can be CSV, TXT, ORC, JDBC, PARQUET, etc.

  • OPTIONS optionsList
    Specifies a set of key-value pairs used to configure the data source. These options vary depending on the chosen data source and may include properties such as file paths, authentication details, format-specific parameters, etc.

  • PARTITIONED BY
    Specifies the columns on which the data should be partitioned. Partitioning splits the data into separate logical divisions based on distinct values of the specified column(s), which can optimize query performance.

  • location
    Specifies the physical location where the view or table data is stored. This could be a path in a distributed file system like HDFS, S3 Object storage or a local filesystem.

  • QUERY**
    The outcome view (viewName) is populated using the data from the select statement.

Usage Guidelines

The project command produces a view based on the resulting rows returned from the query.
Any query can be used in the AS <query> statement and attention must be used to the volume and compute that may incur due to such queries.

As a precautions an explain cost | source = table | ... can be run prior to the project statement to have a better estimation.

Examples:

project newTableName using csv | source = table | where fieldA > value | stats count(fieldA) by fieldB

project ipRanges using parquet | source = table | where isV6 = true | eval inRange = case(cidrmatch(ipAddress, '2003:db8::/32'), 'in' else 'out') | fields ip, inRange

project avgBridgesByCountry using json | source = table | fields country, bridges | flatten bridges | fields country, length | stats avg(length) as avg by country

project ageDistribByCountry using parquet partitioned by (age, country)  |
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

project ageDistribByCountry using parquet OPTIONS('parquet.bloom.filter.enabled'='true', 'parquet.bloom.filter.enabled#age'='false') partitioned by (age, country) |
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

project ageDistribByCountry using parquet OPTIONS('parquet.bloom.filter.enabled'='true', 'parquet.bloom.filter.enabled#age'='false') partitioned by (age, country)  location 's://demo-app/my-bucket'|
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

Usage Guidelines

The project command produces a view based on the resulting rows returned from the query.
Any query can be used in the AS <query> statement and attention must be used to the volume and compute that may incur due to such queries.

As a precautions an explain cost | source = table | ... can be run prior to the project statement to have a better estimation.

Examples:

project newTableName as |
   source = table | where fieldA > value | stats count(fieldA) by fieldB

project ipRanges as |
       source = table | where isV6 = true | eval inRange = case(cidrmatch(ipAddress, '2003:db8::/32'), 'in' else 'out') | fields ip, inRange

project avgBridgesByCountry as |
       source = table | fields country, bridges | flatten bridges | fields country, length | stats avg(length) as avg by country

project ageDistribByCountry as |
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

Effective SQL push-down query

The project command is translated into an equivalent SQL create table <viewName> [Using <datasuorce>] As <statement> as shown here:

CREATE TABLE [ IF NOT EXISTS ] table_identifier
    [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
    USING data_source
    [ OPTIONS ( key1=val1, key2=val2, ... ) ]
    [ PARTITIONED BY ( col_name1, col_name2, ... ) ]
    [ CLUSTERED BY ( col_name3, col_name4, ... ) 
        [ SORTED BY ( col_name [ ASC | DESC ], ... ) ] 
        INTO num_buckets BUCKETS ]
    [ LOCATION path ]
    [ COMMENT table_comment ]
    [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
    [ AS select_statement ]
SELECT customer exploded_productId
FROM table
LATERAL VIEW explode(productId) AS exploded_productId

References

Related Issues

#928

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ete table / view that can later be efficiently queried or stored in OpenSearch MV

Signed-off-by: YANGDB <[email protected]>
@YANG-DB YANG-DB marked this pull request as draft November 20, 2024 06:44
@YANG-DB YANG-DB changed the title add project table command ppl project table command Nov 20, 2024
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.6 0.7 labels Nov 20, 2024
@YANG-DB YANG-DB marked this pull request as ready for review November 26, 2024 22:16
@LantaoJin
Copy link
Member

LantaoJin commented Nov 27, 2024

Haven't go to implementation detail. One high level question:
A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?

Besides, IMO name dataset or table would be more appropriate than project.

@ykmr1224
Copy link
Collaborator

This raises question on if we want to extend PPL to DDL. My idea was that PPL focuses on DQL.
@dai-chen @penghuo Any thoughts?

@dai-chen
Copy link
Collaborator

This raises question on if we want to extend PPL to DDL. My idea was that PPL focuses on DQL. @dai-chen @penghuo Any thoughts?

I'm thinking about the same. Not clear in which case users have to run DDL in PPL.

I know Kusto supports management commands. Is it because they don't support SQL? https://learn.microsoft.com/en-us/kusto/query/?view=microsoft-fabric#management-commands

@YANG-DB
Copy link
Member Author

YANG-DB commented Nov 28, 2024

This raises question on if we want to extend PPL to DDL. My idea was that PPL focuses on DQL. @dai-chen @penghuo Any thoughts?

@LantaoJin @dai-chen @ykmr1224
The main goal here is to allow a similar syntax for PPL to match SQL create table as select ...
It is useful in the sense that when users are familiar with PPL and use it to query the data - they eventually would like to materialize the query results for many purposes (create views, reports, pre-calculated joins, summary and such) and they would need to take the PPL query and translate it into SQL in order to materialize it - which will be very lengthy and difficult in some cases.

IMO we should review each DDL command to see if it can save time/effort/learning for the customer and if that is the case we should add it - the goal here is to create a fully functional language that is a one stop shop and not to mandate users moving back to SQL when they need some missing functionality

In addition other pipeline languages (such as splunk) do offer DDL commands :
outputcsv
outputlookup
Meventcollect

@YANG-DB
Copy link
Member Author

YANG-DB commented Dec 2, 2024

@dai-chen @ykmr1224
Can you please review my latest comments ?

@YANG-DB
Copy link
Member Author

YANG-DB commented Dec 2, 2024

Haven't go to implementation detail. One high level question: A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?

Besides, IMO name dataset or table would be more appropriate than project.

Hi @LantaoJin
Can you please refer me to the location of the .create... starting with the . ?
I haven’t seen it in our FlintSparkSqlExtensions.g4 file ?
thanks

@LantaoJin
Copy link
Member

Haven't go to implementation detail. One high level question: A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?
Besides, IMO name dataset or table would be more appropriate than project.

Hi @LantaoJin Can you please refer me to the location of the .create... starting with the . ? I haven’t seen it in our FlintSparkSqlExtensions.g4 file ? thanks

This syntax was mentioned in PPL vision doc. Sent to you offline.

@LantaoJin
Copy link
Member

LantaoJin commented Dec 4, 2024

In addition other pipeline languages (such as splunk) do offer DDL commands :
outputcsv
outputlookup
Meventcollect

Hmm, these search output commands are DML commands IMO. They might equal to INSERT INTO SELECT SQL query which is DML. @YANG-DB

@LantaoJin
Copy link
Member

The main goal here is to allow a similar syntax for PPL to match SQL create table as select ...
It is useful in the sense that when users are familiar with PPL and use it to query the data - they eventually would like to materialize the query results for many purposes...

I agree that the DDL-like PPL commands import fundamental capabilities. Before delivering the totally new concept to PPL syntax, I have several questions:

  1. What is the proposed architecture for ACL integration with DDL? Will introducing DDL in PPL influence future ACL implementation?
  2. Will this PPL command be compatible with PPL-on-OpenSearch? What is its complexity?
  3. Have we seen any requests from clients, or any examples from other pipeline languages?

@YANG-DB YANG-DB removed the 0.6 label Dec 4, 2024
@YANG-DB
Copy link
Member Author

YANG-DB commented Dec 13, 2024

The main goal here is to allow a similar syntax for PPL to match SQL create table as select ...
It is useful in the sense that when users are familiar with PPL and use it to query the data - they eventually would like to materialize the query results for many purposes...

I agree that the DDL-like PPL commands import fundamental capabilities. Before delivering the totally new concept to PPL syntax, I have several questions:

  1. What is the proposed architecture for ACL integration with DDL? Will introducing DDL in PPL influence future ACL implementation?
  2. Will this PPL command be compatible with PPL-on-OpenSearch? What is its complexity?
  3. Have we seen any requests from clients, or any examples from other pipeline languages?
  1. The same ACL that is used by the create MV / skipping index / covering index we are using today
  2. It will be compatible as it represents a transform or reindex DSL command
  3. SPL has similar approach as previously mentioned including the collect command

@LantaoJin I've change the syntax from project to view as you recommend
thanks

@penghuo
Copy link
Collaborator

penghuo commented Jan 16, 2025

Will this PPL command be compatible with PPL-on-OpenSearch? What is its complexity?
+1, as we all aggree on unified OpenSearch PPL experience. @YANG-DB we should also consider OpenSearch experience

A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?
kusto DDL command start with prefix ., it is may convient for client/front-end to different the DQL and DDL command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants