-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
206 changed files
with
3,616 additions
and
1,177 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Performance Tuning | ||
|
||
Processing performance always is an important topic for data transformation, and so is the case with Flowman. In order | ||
to improve overall performance, there are different configurations, some of them being well known configuration | ||
parameters for Apache Spark, while others are specific to Flowman. | ||
|
||
|
||
## Spark Parameters | ||
|
||
Since Flowman is based on Apache Spark, you can apply all the performance tuning strategies that apply to Spark. | ||
You can specify almost all settings either in the [`default-namespace.yml` file](../setup/config.md) or in any other | ||
project file in a `config` section. The most important settings probably are as follows: | ||
|
||
```yaml | ||
config: | ||
# Use 8 CPU cores per Spark executor | ||
- spark.executor.cores=8 | ||
# Allocate 54 GB RAM per Spark executor | ||
- spark.executor.memory=54g | ||
# Only keep up to 200 jobs in the Spark web UI | ||
- spark.ui.retainedJobs=200 | ||
# Use 400 partitions in shuffle operations | ||
- spark.sql.shuffle.partitions=400 | ||
# Number of executors to allocate | ||
- spark.executor.instances=2 | ||
# Memory overhead as safety margin | ||
- spark.executor.memoryOverhead=1G | ||
``` | ||
Often it is a good idea to make these properties easily configurable via system environment variables as follows: | ||
```yaml | ||
config: | ||
- spark.executor.cores=$System.getenv('SPARK_EXECUTOR_CORES', '8') | ||
- spark.executor.memory=$System.getenv('SPARK_EXECUTOR_MEMORY', '54g') | ||
- spark.ui.retainedJobs=$System.getenv('RETAINED_JOBS', 200) | ||
- spark.sql.shuffle.partitions=$System.getenv('SPARK_PARTITIONS', 400) | ||
``` | ||
## Flowman Parameters | ||
In addition to classical Spark tuning parameters, Flowman also offers some advanced functionality which may help to | ||
cut down processing overhead cost by parallelizing target execution and mapping instantiation. This will not speed | ||
up the processing itself, but it will help to hide some expensive Spark planning costs, which may involve querying | ||
the Hive metastore or remote file systems, which are known to be slow. | ||
```yaml | ||
config: | ||
# Enable building multiple targets in parallel | ||
- flowman.execution.executor.class=com.dimajix.flowman.execution.ParallelExecutor | ||
# Build up to 4 targets in parallel | ||
- flowman.execution.executor.parallelism=4 | ||
# Instantiate up to 16 mappings in parallel | ||
- flowman.execution.mapping.parallelism=16 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Iterative SQL Mapping | ||
The `iterativeSql` mapping allows to iteratively execute SQL transformation which contains Spark SQL code. The | ||
iteration will stop when the data does not change anymore. | ||
|
||
## Example | ||
The following example will detect trees within a company hierarchy table, which provides simple parent-child | ||
relations. The objective of the query is to assign a separate ID to each company tree. The query will essentially | ||
propagate the `tree_id` from each parent down to its direct children. This step is performed over and over again | ||
until the `tree_id` from the root companies without a parent are propagated to the leave companies without any | ||
children. | ||
``` | ||
mappings: | ||
organization_hierarchy: | ||
kind: iterativeSql | ||
input: companies | ||
sql: | | ||
SELECT | ||
COALESCE(parent.tree_id, c.tree_id) AS tree_id, | ||
c.parent_company_number, | ||
c.company_number | ||
FROM companies c | ||
LEFT JOIN __this__ parent | ||
ON c.parent_company_number = parent.company_number | ||
``` | ||
Within the first step, the output of the input mapping `companies` is assigned to the identifier `__this__`. Then the | ||
SQL query is executed for the first time, which will provide the start value of the forthcoming iteration. In each | ||
iteration, the result of the previous iteration is assigned to `__this__` and the query is executed. | ||
Then the result is compared to the result of the previous iteration. If the results are the same, a fix point is | ||
reached and the execution stops. Otherwise, the iteration will continue. | ||
|
||
## Fields | ||
* `kind` **(mandatory)** *(type: string)*: `iterativeSql` | ||
|
||
* `broadcast` **(optional)** *(type: boolean)* *(default: false)*: | ||
Hint for broadcasting the result of this mapping for map-side joins. | ||
|
||
* `cache` **(optional)** *(type: string)* *(default: NONE)*: | ||
Cache mode for the results of this mapping. Supported values are | ||
* `NONE` - Disables caching of teh results of this mapping | ||
* `DISK_ONLY` - Caches the results on disk | ||
* `MEMORY_ONLY` - Caches the results in memory. If not enough memory is available, records will be uncached. | ||
* `MEMORY_ONLY_SER` - Caches the results in memory in a serialized format. If not enough memory is available, records will be uncached. | ||
* `MEMORY_AND_DISK` - Caches the results first in memory and then spills to disk. | ||
* `MEMORY_AND_DISK_SER` - Caches the results first in memory in a serialized format and then spills to disk. | ||
|
||
* `input` **(required)** *(type: string)*: | ||
The input mapping which serves as the starting point of the iteration. This means that for the first execution, | ||
the identifier `__this__` will simply refer the output of this mapping. Within the next iterations, `__this__` will | ||
refer to the result of the previous iteration. | ||
|
||
* `sql` **(optional)** *(type: string)* *(default: empty)*: | ||
The SQL statement to execute | ||
|
||
* `file` **(optional)** *(type: string)* *(default: empty)*: | ||
The name of a file containing the SQL to execute. | ||
|
||
* `uri` **(optional)** *(type: string)* *(default: empty)*: | ||
A url pointing to a resource containing the SQL to execute. | ||
|
||
* `maxIterations` **(optional)** *(type: int)* *(default: 99)*: | ||
The maximum of iterations. The mapping will fail if the number of actual iterations required to find the fix point | ||
exceeds this number. | ||
|
||
|
||
## Outputs | ||
* `main` - the only output of the mapping | ||
|
||
|
||
## Description | ||
The `iterativeSql` mapping allows to execute recursive SQL statements, which refer to themselves. | ||
|
||
Flowman also supports [`recursiveSql` mappings](recursive-sql.md), which provide similar functionality more along | ||
the lines of classical recursive SQL statements. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.