ppl `project` table command #936

YANG-DB · 2024-11-20T06:44:09Z

Description

add project table command to allow materializing queries into a concrete table / view that can later be efficiently queried or stored in OpenSearch MV

PPL `project` command

Overview

Using project command to materialize a query into a dedicated view:
In some cases it is required to construct a projection view (materialized into a view) of the query results.
This projection can be later used as a source of continued queries for further slicing and dicing the data, in addition such tables can be also saved into a MV table that are pushed into OpenSearch and can be used for visualization and enhanced performant queries.

The command can also function as an ETL process where the original datasource will be transformed and ingested into the output projected view using the ppl transformation and aggregation operators

Syntax

PROJECT (IF NOT EXISTS)? viewName (USING datasource)? (OPTIONS optionsList)? (PARTITIONED BY partitionColumnNames)? location?

viewName
Specifies a view name, which may be optionally qualified with a database name.
USING datasource
Data Source is the input format used to create the table. Data source can be CSV, TXT, ORC, JDBC, PARQUET, etc.
OPTIONS optionsList
Specifies a set of key-value pairs used to configure the data source. These options vary depending on the chosen data source and may include properties such as file paths, authentication details, format-specific parameters, etc.
PARTITIONED BY
Specifies the columns on which the data should be partitioned. Partitioning splits the data into separate logical divisions based on distinct values of the specified column(s), which can optimize query performance.
location
Specifies the physical location where the view or table data is stored. This could be a path in a distributed file system like HDFS, S3 Object storage or a local filesystem.
QUERY**
The outcome view (viewName) is populated using the data from the select statement.

Usage Guidelines

The project command produces a view based on the resulting rows returned from the query.
Any query can be used in the AS <query> statement and attention must be used to the volume and compute that may incur due to such queries.

As a precautions an explain cost | source = table | ... can be run prior to the project statement to have a better estimation.

Examples:

project newTableName using csv | source = table | where fieldA > value | stats count(fieldA) by fieldB

project ipRanges using parquet | source = table | where isV6 = true | eval inRange = case(cidrmatch(ipAddress, '2003:db8::/32'), 'in' else 'out') | fields ip, inRange

project avgBridgesByCountry using json | source = table | fields country, bridges | flatten bridges | fields country, length | stats avg(length) as avg by country

project ageDistribByCountry using parquet partitioned by (age, country)  |
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

project ageDistribByCountry using parquet OPTIONS('parquet.bloom.filter.enabled'='true', 'parquet.bloom.filter.enabled#age'='false') partitioned by (age, country) |
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

project ageDistribByCountry using parquet OPTIONS('parquet.bloom.filter.enabled'='true', 'parquet.bloom.filter.enabled#age'='false') partitioned by (age, country)  location 's://demo-app/my-bucket'|
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

Usage Guidelines

The project command produces a view based on the resulting rows returned from the query.
Any query can be used in the AS <query> statement and attention must be used to the volume and compute that may incur due to such queries.

As a precautions an explain cost | source = table | ... can be run prior to the project statement to have a better estimation.

Examples:

project newTableName as |
   source = table | where fieldA > value | stats count(fieldA) by fieldB

project ipRanges as |
       source = table | where isV6 = true | eval inRange = case(cidrmatch(ipAddress, '2003:db8::/32'), 'in' else 'out') | fields ip, inRange

project avgBridgesByCountry as |
       source = table | fields country, bridges | flatten bridges | fields country, length | stats avg(length) as avg by country

project ageDistribByCountry as |
       source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats 
            avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as 
            avg_adult_country_age by country

Effective SQL push-down query

The project command is translated into an equivalent SQL create table <viewName> [Using <datasuorce>] As <statement> as shown here:

CREATE TABLE [ IF NOT EXISTS ] table_identifier
    [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
    USING data_source
    [ OPTIONS ( key1=val1, key2=val2, ... ) ]
    [ PARTITIONED BY ( col_name1, col_name2, ... ) ]
    [ CLUSTERED BY ( col_name3, col_name4, ... ) 
        [ SORTED BY ( col_name [ ASC | DESC ], ... ) ] 
        INTO num_buckets BUCKETS ]
    [ LOCATION path ]
    [ COMMENT table_comment ]
    [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
    [ AS select_statement ]

SELECT customer exploded_productId
FROM table
LATERAL VIEW explode(productId) AS exploded_productId

References

https://spark.apache.org/docs/3.5.3/sql-ref-syntax-ddl-create-table-datasource.html

Related Issues

#928

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ete table / view that can later be efficiently queried or stored in OpenSearch MV Signed-off-by: YANGDB <[email protected]>

Signed-off-by: YANGDB <[email protected]>

LantaoJin · 2024-11-27T07:44:48Z

Haven't go to implementation detail. One high level question:
A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?

Besides, IMO name dataset or table would be more appropriate than project.

ykmr1224 · 2024-11-27T18:59:13Z

This raises question on if we want to extend PPL to DDL. My idea was that PPL focuses on DQL.
@dai-chen @penghuo Any thoughts?

dai-chen · 2024-11-27T19:06:58Z

This raises question on if we want to extend PPL to DDL. My idea was that PPL focuses on DQL. @dai-chen @penghuo Any thoughts?

I'm thinking about the same. Not clear in which case users have to run DDL in PPL.

I know Kusto supports management commands. Is it because they don't support SQL? https://learn.microsoft.com/en-us/kusto/query/?view=microsoft-fabric#management-commands

YANG-DB · 2024-11-28T19:51:31Z

This raises question on if we want to extend PPL to DDL. My idea was that PPL focuses on DQL. @dai-chen @penghuo Any thoughts?

@LantaoJin @dai-chen @ykmr1224
The main goal here is to allow a similar syntax for PPL to match SQL create table as select ...
It is useful in the sense that when users are familiar with PPL and use it to query the data - they eventually would like to materialize the query results for many purposes (create views, reports, pre-calculated joins, summary and such) and they would need to take the PPL query and translate it into SQL in order to materialize it - which will be very lengthy and difficult in some cases.

IMO we should review each DDL command to see if it can save time/effort/learning for the customer and if that is the case we should add it - the goal here is to create a fully functional language that is a one stop shop and not to mandate users moving back to SQL when they need some missing functionality

In addition other pipeline languages (such as splunk) do offer DDL commands :
outputcsv
outputlookup
Meventcollect

YANG-DB · 2024-12-02T17:44:59Z

@dai-chen @ykmr1224
Can you please review my latest comments ?

YANG-DB · 2024-12-02T17:45:32Z

Haven't go to implementation detail. One high level question: A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?

Besides, IMO name dataset or table would be more appropriate than project.

Hi @LantaoJin
Can you please refer me to the location of the .create... starting with the . ?
I haven’t seen it in our FlintSparkSqlExtensions.g4 file ?
thanks

LantaoJin · 2024-12-04T02:59:44Z

Haven't go to implementation detail. One high level question: A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?
Besides, IMO name dataset or table would be more appropriate than project.

Hi @LantaoJin Can you please refer me to the location of the .create... starting with the . ? I haven’t seen it in our FlintSparkSqlExtensions.g4 file ? thanks

This syntax was mentioned in PPL vision doc. Sent to you offline.

LantaoJin · 2024-12-04T03:10:40Z

In addition other pipeline languages (such as splunk) do offer DDL commands :
outputcsv
outputlookup
Meventcollect

Hmm, these search output commands are DML commands IMO. They might equal to INSERT INTO SELECT SQL query which is DML. @YANG-DB

LantaoJin · 2024-12-04T05:44:19Z

The main goal here is to allow a similar syntax for PPL to match SQL create table as select ...
It is useful in the sense that when users are familiar with PPL and use it to query the data - they eventually would like to materialize the query results for many purposes...

I agree that the DDL-like PPL commands import fundamental capabilities. Before delivering the totally new concept to PPL syntax, I have several questions:

What is the proposed architecture for ACL integration with DDL? Will introducing DDL in PPL influence future ACL implementation?
Will this PPL command be compatible with PPL-on-OpenSearch? What is its complexity?
Have we seen any requests from clients, or any examples from other pipeline languages?

Signed-off-by: YANGDB <[email protected]>

YANG-DB · 2024-12-13T23:58:36Z

The main goal here is to allow a similar syntax for PPL to match SQL create table as select ...
It is useful in the sense that when users are familiar with PPL and use it to query the data - they eventually would like to materialize the query results for many purposes...

I agree that the DDL-like PPL commands import fundamental capabilities. Before delivering the totally new concept to PPL syntax, I have several questions:

What is the proposed architecture for ACL integration with DDL? Will introducing DDL in PPL influence future ACL implementation?

Will this PPL command be compatible with PPL-on-OpenSearch? What is its complexity?

Have we seen any requests from clients, or any examples from other pipeline languages?

The same ACL that is used by the create MV / skipping index / covering index we are using today
It will be compatible as it represents a transform or reindex DSL command
SPL has similar approach as previously mentioned including the collect command

@LantaoJin I've change the syntax from project to view as you recommend
thanks

# Conflicts: # docs/ppl-lang/PPL-Example-Commands.md # ppl-spark-integration/src/main/antlr4/OpenSearchPPLLexer.g4

penghuo · 2025-01-16T01:30:16Z

Will this PPL command be compatible with PPL-on-OpenSearch? What is its complexity?
+1, as we all aggree on unified OpenSearch PPL experience. @YANG-DB we should also consider OpenSearch experience

A DDL such as .create materialized view my_view as defined starting with dot. Project command seems a DDL too. Should it move as part of .create DDL?
kusto DDL command start with prefix ., it is may convient for client/front-end to different the DQL and DDL command?

add project table command to allow materializing queries into a concr…

6aa2a21

…ete table / view that can later be efficiently queried or stored in OpenSearch MV Signed-off-by: YANGDB <[email protected]>

YANG-DB requested review from dai-chen, mengweieric, vamsimanohar, penghuo, seankao-az, anirudha, kaituo, noCharger, LantaoJin and ykmr1224 as code owners November 20, 2024 06:44

YANG-DB marked this pull request as draft November 20, 2024 06:44

YANG-DB changed the title ~~add project table command~~ ppl project table command Nov 20, 2024

YANG-DB added Lang:PPL Pipe Processing Language support 0.6 0.7 labels Nov 20, 2024

YANG-DB added 9 commits November 25, 2024 12:37

update tests & command spec changes

95e9504

Signed-off-by: YANGDB <[email protected]>

update tests & command spec changes

b8e02fc

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into ppl-projection-command

11900d9

update tests & simplify grammar

7d643ae

Signed-off-by: YANGDB <[email protected]>

add support for options table spec

73b4a05

Signed-off-by: YANGDB <[email protected]>

add support for location table spec

6de1a20

Signed-off-by: YANGDB <[email protected]>

update documentation & examples

36038aa

Signed-off-by: YANGDB <[email protected]>

update tests with projected join query

9a84325

Signed-off-by: YANGDB <[email protected]>

update tests with projected join query

726ae24

Signed-off-by: YANGDB <[email protected]>

YANG-DB marked this pull request as ready for review November 26, 2024 22:16

update tests with projected partitioning verification of correctness

10eb8a1

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into ppl-projection-command

066eaca

YANG-DB removed the 0.6 label Dec 4, 2024

YANG-DB added 6 commits December 4, 2024 13:25

Merge branch 'main' into ppl-projection-command

550a238

Merge branch 'main' into ppl-projection-command

a061844

Merge branch 'main' into ppl-projection-command

a90f9b1

Merge branch 'main' into ppl-projection-command

c9c5b14

update syntax from project to view

bfbb555

Signed-off-by: YANGDB <[email protected]>

update syntax from project to view

0d55aa8

Signed-off-by: YANGDB <[email protected]>

YANG-DB added 4 commits December 16, 2024 11:50

Merge branch 'main' into ppl-projection-command

e6bb6b2

Merge branch 'main' into ppl-projection-command

2337358

Merge branch 'main' into ppl-projection-command

bf90692

# Conflicts: # docs/ppl-lang/PPL-Example-Commands.md # ppl-spark-integration/src/main/antlr4/OpenSearchPPLLexer.g4

Merge branch 'main' into ppl-projection-command

85ed116

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ppl `project` table command #936

ppl `project` table command #936

YANG-DB commented Nov 20, 2024 •

edited

Loading

LantaoJin commented Nov 27, 2024 •

edited

Loading

ykmr1224 commented Nov 27, 2024

dai-chen commented Nov 27, 2024

YANG-DB commented Nov 28, 2024 •

edited

Loading

YANG-DB commented Dec 2, 2024 •

edited

Loading

YANG-DB commented Dec 2, 2024

LantaoJin commented Dec 4, 2024

LantaoJin commented Dec 4, 2024 •

edited

Loading

LantaoJin commented Dec 4, 2024

YANG-DB commented Dec 13, 2024 •

edited

Loading

penghuo commented Jan 16, 2025

ppl project table command #936

Are you sure you want to change the base?

ppl project table command #936

Conversation

YANG-DB commented Nov 20, 2024 • edited Loading

Description

PPL project command

Overview

Syntax

Usage Guidelines

Examples:

Usage Guidelines

Examples:

Effective SQL push-down query

References

Related Issues

Check List

LantaoJin commented Nov 27, 2024 • edited Loading

ykmr1224 commented Nov 27, 2024

dai-chen commented Nov 27, 2024

YANG-DB commented Nov 28, 2024 • edited Loading

YANG-DB commented Dec 2, 2024 • edited Loading

YANG-DB commented Dec 2, 2024

LantaoJin commented Dec 4, 2024

LantaoJin commented Dec 4, 2024 • edited Loading

LantaoJin commented Dec 4, 2024

YANG-DB commented Dec 13, 2024 • edited Loading

penghuo commented Jan 16, 2025

ppl `project` table command #936

ppl `project` table command #936

YANG-DB commented Nov 20, 2024 •

edited

Loading

PPL `project` command

LantaoJin commented Nov 27, 2024 •

edited

Loading

YANG-DB commented Nov 28, 2024 •

edited

Loading

YANG-DB commented Dec 2, 2024 •

edited

Loading

LantaoJin commented Dec 4, 2024 •

edited

Loading

YANG-DB commented Dec 13, 2024 •

edited

Loading