This quickstart guide will walk you to a installation of Apache Spark and Flowman on your local Linux box. If you are using Windows, you will find some hints for setting up the required "Hadoop WinUtils", but we generally recommend using Linux. You can also run a Flowman Docker image, which is the simplest way to get up to speed.
Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.
Fortunately, Spark is rather simple to install locally on your machine:
As of this writing, the latest release of Flowman is 1.1.0 and is available prebuilt for Spark 3.4.1 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.
# Create a nice playground which doesn't mess up your system
mkdir playground
cd playground
# Download and unpack Spark & Hadoop
curl -L https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz | tar xvzf -
# Create a nice link
ln -snf spark-3.4.1-bin-hadoop3 spark
The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.
If you are trying to run Flowman on Windows, you also need the Hadoop Winutils, which is a set of
DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils .
Once you downloaded the appropriate version, you need to place the DLLs into a directory $HADOOP_HOME/bin
, where
HADOOP_HOME
refers to some arbitrary location of your choice on your Windows PC. You also need to set the following
environment variables:
HADOOP_HOME
should point to the parent directory of thebin
directoryPATH
should also contain$HADOOP_HOME/bin
You find prebuilt Flowman packages on the corresponding release page on GitHub. For this quickstart, we chose
flowman-dist-1.1.0-oss-spark3.4-hadoop3.3-bin.tar.gz
which nicely fits to the Spark package we just downloaded before.
# Download and unpack Flowman
curl -L https://github.com/dimajix/flowman/releases/download/1.1.0/flowman-dist-1.1.0-oss-spark3.4-hadoop3.3-bin.tar.gz | tar xvzf -
# Create a nice link
ln -snf flowman-1.1.0-oss-spark3.4-hadoop3.3 flowman
Now before you can use Flowman, you need to tell it where it can find the Spark home directory which we just created
in the previous step. This can be either done by providing a valid configuration file in
flowman/conf/flowman-env.sh
(a template can be found at flowman/conf/flowman-env.sh.template
), or we can simply
set an environment variable. For the sake of simplicity, we follow the second approach
# This assumes that we are still in the directory "playground"
export SPARK_HOME=$(pwd)/spark
In order to access S3 in the example below, we also need to provide a default namespace which contains some basic plugin configurations. We simply copy the provided template as follows:
# Copy default namespace
cp flowman/conf/default-namespace.yml.template flowman/conf/default-namespace.yml
cp flowman/conf/flowman-env.sh.template flowman/conf/flowman-env.sh
# Optionally provide AWS keys. The example will use anonymous access to S3 and does not require the keys
export AWS_ACCESS_KEY_ID=<your aws access key>
export AWS_SECRET_ACCESS_KEY=<your aws secret key>
That’s all we need to run the Flowman example.
The example data is stored in a publicly accessible S3 bucket. Since the data is publicly available and the project is configured to use anonymous AWS authentication, you do not need to provide your AWS credentials (you even do not even need to have an account on AWS)
We start Flowman by running the interactive Flowman shell. While this is not the tool that would be used in automatic
batch processing (flowexec
is the right tool for that scenario), it gives us a good idea how ETL projects in Flowman
are organized.
cd flowman
bin/flowshell -f examples/weather
Now we can inspect some of the relations defined in the project. First we list all relations
flowman:weather> relation list
Now we can peek inside the relations stations_raw
and measurements_raw
. Since the second relation is partitioned
by years, we explicitly specify the year via the option -p year=2011
flowman:weather> relation show stations_raw
flowman:weather> relation show measurements_raw -p year=2011
Now we want to execute the projects main job. Again the job is parametrized by year, so we need to specify the year that we'd like to process.
flowman:weather> job build main year=2011
Now we'd like to inspect some of the mappings which have been used during job execution. Since some mappings depend
on job-specific variables, we need to create a job context, which can be done by job enter <job-name> <job-args>
as follows:
flowman:weather> job enter main year=2011
Note how the prompt has changed and will now include the job name. Now we can inspect some mappings:
flowman:weather/main> mapping list
flowman:weather/main> mapping show measurements_raw
flowman:weather/main> mapping show measurements_extracted
flowman:weather/main> mapping show stations_raw
Finally we'd like to leave the job context again.
flowman:weather/main> job leave
The job execution has written its results into some relations again. We can now inspect them again
flowman:weather> relation show stations
flowman:weather> relation show measurements
flowman:weather> relation show aggregates -p year=2011
Flowman also provides an execution history. In the trivial deployment, this information is stored locally in a Derby database, but other databases like MySQL, MariaDB etc are also supported.
flowman:weather> history job search
flowman:weather> history target search -P weather
Flowman cannot only execute all the data transformations specified in the example project, it can also generate a documentation, which will be stored as an html file
flowman:weather> documentation generate
This will create a file in the directory examples/weather/generated-documentation/project.html
which can be viewed
by any web browser of your choice.
Finally, we quit the Flowman shell via the quit
command.
flowman:weather> quit
So far we have only used the Flowman shell for interactive work with projects. Actually, the shell was developed as a
second step to help to analyze problems and debugging data flows. The primary command for working with Flowman projects
is flowexec
which is used for non-interactive batch execution, for example within cron-jobs.
It shares a lot of code with the Flowman shell, so the commands are often exactly the same. The main difference is
that with flowexec
you specify the commands on the command line while flowshell
provides its own prompt.
For example for running the “build” lifecycle of the weather project for the year 2014, you only need to run:
bin/flowexec -f examples/weather job build main year=2014
A very special Thank You! goes to all of you who try to follow the example hands-on on your local machine. If you have problems with following the example, please leave me a note — it’s always difficult to streamline such a process, and I might have overseen some issues.