Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update spark-docker example with jupyter tutorial & notebook #1001

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Please refer to the [Flint Index Reference Manual](./docs/index.md) for more inf

* For additional details on Spark PPL commands project, see [PPL Project](https://github.com/orgs/opensearch-project/projects/214/views/2)

* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)
* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md)

## Prerequisites

Expand Down Expand Up @@ -88,7 +88,7 @@ bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPS
```

### PPL Run queries on a local spark cluster
See ppl usage sample on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)
See ppl usage sample on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md)

### Running integration tests on a local spark cluster
See integration test documentation [Docker Integration Tests](integ-test/script/README.md)
Expand Down
12 changes: 12 additions & 0 deletions docker/apache-spark-iceberg/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
MASTER_UI_PORT=8080
MASTER_PORT=7077
UI_PORT=4040
PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar
FLINT_JAR=../../flint-spark-integration/target/scala-2.12/flint-spark-integration-assembly-0.7.0-SNAPSHOT.jar
OPENSEARCH_VERSION=latest
DASHBOARDS_VERSION=latest

OPENSEARCH_NODE_MEMORY=512m
OPENSEARCH_ADMIN_PASSWORD=C0rrecthorsebatterystaple.
OPENSEARCH_PORT=9200
OPENSEARCH_DASHBOARDS_PORT=5601
50 changes: 50 additions & 0 deletions docker/apache-spark-iceberg/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Sanity Test OpenSearch Spark PPL
This document shows how to locally test OpenSearch PPL commands on top of Spark IceBerg using docker-compose based of [docker-spark-Iceberg](https://github.com/databricks/docker-spark-iceberg) opensource repository.

Additional instructions for running the original test harness [here](https://iceberg.apache.org/spark-quickstart/)

## Running Docker Compose
For running docker-compose run: `docker compose up -d`

See additional instructions [here](../../docs/spark-docker.md)

## Running Spark Shell

Can run `spark-sql` on the master node using `docker exec`:

```
docker exec -it spark-iceberg /opt/spark/bin/spark-sql
```

In the spark-sql shell - [run the next create table statements](../../docs/local-spark-ppl-test-instruction.md#testing-ppl-commands)

Now PPL commands can [run](../../docs/local-spark-ppl-test-instruction.md#test-grok--top-commands-combination) on top of the table just created

### Using Iceberg Tables
The following example utilize [Icberg](https://iceberg.apache.org/) table as an example
```sql
CREATE TABLE iceberg_table (
id INT,
name STRING,
age INT,
city STRING
)
USING iceberg
PARTITIONED BY (city)
LOCATION 'file:/tmp/iceberg-tables/default/iceberg_table';

INSERT INTO iceberg_table VALUES
(1, 'Alice', 30, 'New York'),
(2, 'Bob', 25, 'San Francisco'),
(3, 'Charlie', 35, 'New York'),
(4, 'David', 40, 'Chicago'),
(5, 'Eve', 28, 'San Francisco');
```

### PPL queries
```sql
source=`default`.`iceberg_table`;
source=`default`.`iceberg_table` | where age > 30 | fields id, name, age, city | sort - age;
source=`default`.`iceberg_table` | where age > 30 | stats count() by city;
source=`default`.`iceberg_table` | stats avg(age) by city;
```
136 changes: 136 additions & 0 deletions docker/apache-spark-iceberg/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
version: "3"

services:
spark-tutorial:
image: tabulario/spark-iceberg
container_name: spark-tutorial
networks:
iceberg_net:
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/PPL
- type: bind
source: ./spark-defaults.conf
target: /opt/spark/conf/spark-defaults.conf
- type: bind
source: $PPL_JAR
target: /opt/spark/jars/ppl-spark-integration.jar
- type: bind
source: $FLINT_JAR
target: /opt/spark/jars/flint-spark-integration.jar
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8080:8080
- 10000:10000
- 10001:10001
rest:
image: apache/iceberg-rest-fixture
container_name: iceberg-rest
networks:
iceberg_net:
ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
volumes:
- minio-data:/data
networks:
iceberg_net:
aliases:
- warehouse.minio
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
opensearch:
image: opensearchproject/opensearch:${OPENSEARCH_VERSION:-latest}
container_name: opensearch
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch
- discovery.seed_hosts=opensearch
- cluster.initial_cluster_manager_nodes=opensearch
- bootstrap.memory_lock=true
- plugins.security.ssl.http.enabled=false
- OPENSEARCH_JAVA_OPTS=-Xms${OPENSEARCH_NODE_MEMORY:-512m} -Xmx${OPENSEARCH_NODE_MEMORY:-512m}
- OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_ADMIN_PASSWORD}
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- opensearch-data:/usr/share/opensearch/data
- ./opensearch/opensearch.yml:/usr/share/opensearch/config/opensearch.yml
- ./opensearch/security/config:/usr/share/opensearch/plugins/opensearch-security/securityconfig
ports:
- ${OPENSEARCH_PORT:-9200}:9200
- 9600:9600
expose:
- "${OPENSEARCH_PORT:-9200}"
healthcheck:
test: ["CMD", "curl", "-f", "-u", "admin:${OPENSEARCH_ADMIN_PASSWORD}", "http://localhost:9200/_cluster/health"]
interval: 1m
timeout: 5s
retries: 3
networks:
iceberg_net:
opensearch-dashboards:
image: opensearchproject/opensearch-dashboards:${DASHBOARDS_VERSION}
container_name: opensearch-dashboards
ports:
- ${OPENSEARCH_DASHBOARDS_PORT:-5601}:5601
expose:
- "${OPENSEARCH_DASHBOARDS_PORT:-5601}"
environment:
OPENSEARCH_HOSTS: '["http://opensearch:9200"]'
depends_on:
- opensearch
networks:
iceberg_net:
networks:
iceberg_net:

volumes:
opensearch-data:
minio-data:
Loading
Loading