When working with relational databases for data engineering, one of the main tasks is data modeling:
- What data objects do we need?
- What are the relationships between the different objects?
- What grade of normalization do we need?
- ...
By answering these questions, we come up with a data model, that we then put into place by using DDL (Data definition language). Example:
CREATE TABLE customers (
first_name VARCHAR(255),
last_name VARCHAR(255)
);
If we want to change that table in a later step, we have to adjust the model with an update:
ALTER TABLE customers ADD COLUMN mail VARCHAR(255);
Following up: How do we put this into code control, like versioning? There are some solutions, but these are not streamlined at all.
... we had an easier way to come up with the models in the first place, being able to alter the models and also have a way to do code control? This is what dbt tries to solve.
Disclaimer: Next to dbt Core, there is also dbt Cloud, which is a SaaS. In this workshop, we will solely talk about dbt Core, the free open source variant.
dbt Core is a command line tool that scans your working folder for SQL files. Each file makes up one model (i.e. SQL select queries). dbt handles turning all these models into objects in our database. Go here to get a quick glance on all the features of dbt.
Relational Database: DuckDB
For the purpose of this workshop, we will use DuckDB, a filebased relational database. It allows everyone in the room to have their own database locally on their computer and we configure dbt to work with this local database in the following steps.
- Clone this repository
- Use a command line tool of your choice and cd to the project folder
- Checkout the workshop branch by calling
git checkout correlaid-workshop
- Create a virtual environment for the project by using ...
... Poetry or
To install Poetry simply run
curl -sSL https://install.python-poetry.org | python3 -
or visit Installation for more information.
In your project folder run poetry install
to set up your virtual environment.
You have 2 options to run anything within the virtual environment. Either spawn a shell by using poetry shell
or use poetry run [command]
to run any command from within the virtual environment.
Create a virtual environment by calling
python -m venv venv
Activate said environment
source venv/bin/activate
and install all requirements
pip install -r requirements.txt
In the subfolder seeds
you find 3 .csv
-files, which will be the source data for our workshop.
dbt can use these .csv
-files to fill your database with some source data (more about that later). For now, we will just run
dbt seed
Lets have a look into the data by calling the DuckDB CLI: duckcli correlaid.duckdb
select * from raw_customers;
select * from raw_orders;
select * from raw_payments;
- Project file:
dbt_project.yml
This file lets dbt know that this directory and it's subfolders are a dbt project. It contains some general configurations for the project.
- Profile:
profiles.yml
This file defines the database connection. It can either be sitting in ~/.dbt/profiles.yml
or in the project folder itself.
- Models:
models/some_random_model.sql
The subfolder models
(and it's subfolders) contain the models, that we define in our dbt project.
All SQL
-files will be picked up by dbt and materialized into the database.
- Model configs:
models/schema.yml
We can configure our models in 3 different places: dbt_project.yml
, a .yml
-file within the model's folder, or a config block on top of our model.
The configuration will be evaluated in that order, which means that we can override generic configurations that are set in dbt_project.yml
in the configuration block on top of the model file.
Best Practice: place a .yml
file in each subfolder under models
and define the configs for that subfolder right there. Only make specific changes in the configuration block of a single model.
Let's have a look into our first dbt model: ./models/staging/stg_customers.sql
. The purpose of this model: pick up the source table raw_customers
and rename columns, recast values and reconsider the format, so that we have a consistent format for all of our source tables.
with source as (
select * from {{ ref('raw_customers') }}
),
renamed as (
select
id as customer_id,
first_name,
last_name
from source
)
select * from renamed
Within dbt models, we use a lot of CTEs (Common Table Expressions). These start with the with [name] as
-clause and are essentially temporary tables that are only valid within the context of this model.
We can chain multiple queries as CTEs by combining them with a ,
. In each following CTE, we have access to all temporary tables that were defined beforehand.
The last query of the expression will be the result of the model and will be materialized to the database.
Let's have a look into the first part of the CTE:
with source as (
select * from {{ ref('raw_customers') }}
),
Apart from raw SQL
, we can also use jinja
within our models. jinja
is a lightweight templating language that will help us to have better control over our models.
In this case, we use the {{ ref() }}
-function to reference other models that are defined within our dbt project. dbt uses this to know about the lineage and dependencies of our models.
In other cases, we could use jinja to introduce control structures like if
and for
to our models. More about that here.
With this codeblock we have access to our raw customer data and it is wrapped in the temporary table source
.
renamed as (
select
id as customer_id,
first_name,
last_name
from source
)
For the next part, we pick up that raw data and rename the columns to our own liking.
select * from renamed
This last part selects from renamed
to materialize the renamed table to our database.
The other model in our staging folder ./models/staging/raw_orders.sql
has the same structure. It transforms the raw orders into a table to our own liking.
Then there are also the 2 models in ./models/mart/
. These are more elaborated models that could be used for a dashboard or whatever. Please have a look on your own and try to understand those models.
One of the main commands of dbt is dbt run
. It scans for the models that are already defined and materializes those models into the database. Let's try it and have another look into the database afterwards.
In your CLI, use
dbt run
dbt run
picks up all the defined models, compiles them to valid SQL
code and runs it against your database.
Follow this up with
duckcli correlaid.duckdb
Compare your newly build models
select * from stg_customers limit 10;
select * from stg_orders limit 10;
select * from customers limit 10;
select * from orders limit 10;
dbt has the ability to directly generate a documentation from your models. To generate and view the documentation, simply call
dbt docs generate
dbt docs serve
Either a browser window will open up by itself or you could open [http://localhost:8080]
By using the {{ ref() }}
macro, you are building dependencies into your models. dbt uses these dependencies to build a DAG. This also works as a lineage graph for your models and data.
To view the DAG of your project, click the blue button in the bottom right corner of the documentation window.
We can use the left panel to navigate through our models. Each model is included with different metadata, including a description, columns, references, dependencies and the code itself.
dbt is using the .yml
files in your project to put descriptions for tables and columns into place.
Let's have a look into ./models/mart/schema.yml
:
version: 2
models:
- name: customers
description: This table has basic information about a customer, as well as some derived facts based on a customer's orders
columns:
- name: customer_id
description: This is a unique identifier for a customer
tests:
- unique
- not_null
- name: first_name
description: Customer's first name. PII.
As you can see in line 5, we can add a description
to each of our models. dbt will pick up this description and put it in the corresponding place in our documentation.
Additionally, as you can see in lines 9 & 15, we can also add the description
tag for individual columns.
There is more stuff around the documentation, which you can read up here
Finally, there are some tasks for you to solve. Try to do your best with the knowledge that you just got. If you feel stuck, either ask me or have a look into the main branch.
- rename
./seeds/raw_payments.csv_ignore
to./seeds/raw_payments.csv
. Afterwards rundbt seed
to seed this file into your database. - introduce a staging model for payments that picks up the data from
raw_payments
and renames the columns to your liking. - within
stg_payments
, convert the amount from cents to euros. Make sure to document this correctly in the corresponding.yml
-file. - add payments to your
mart/orders
model. Combine all the payments for one order in one row and have one column for each of the different payment methods. Also add the total amount. It should look something like this:
order_id | customer_id | order_date | status | credit_card_amount | coupon_amount | bank_transfer_amount | gift_card_amount | total_amount |
---|---|---|---|---|---|---|---|---|
1 | 1 | 2018-01-01 | returned | 10 | 0 | 0 | 0 | 10 |
2 | 3 | 2018-01-02 | completed | 20 | 0 | 0 | 0 | 20 |
3 | 94 | 2018-01-04 | completed | 0 | 1 | 0 | 0 | 1 |
4 | 50 | 2018-01-05 | completed | 0 | 25 | 0 | 0 | 25 |
5 | 64 | 2018-01-05 | completed | 0 | 0 | 17 | 0 | 17 |
6 | 54 | 2018-01-07 | completed | 6 | 0 | 0 | 0 | 6 |
7 | 88 | 2018-01-09 | completed | 16 | 0 | 0 | 0 | 16 |
8 | 2 | 2018-01-11 | returned | 23 | 0 | 0 | 0 | 23 |
9 | 53 | 2018-01-12 | completed | 0 | 0 | 0 | 23 | 23 |
10 | 7 | 2018-01-14 | completed | 0 | 0 | 26 | 0 | 26 |
- add payments to your
mart/customers
model. For this, sum up all orders belonging to each customer and add a column likecustomer_lifetime_value
. Make sure to only include orders that were not returned. It should look like this:
customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |
---|---|---|---|---|---|---|
1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33 |
2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23 |
3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65 |
6 | Sarah | R. | 2018-02-19 | 2018-02-19 | 1 | 8 |
7 | Martin | M. | 2018-01-14 | 2018-01-14 | 1 | 26 |
8 | Frank | R. | 2018-01-29 | 2018-03-12 | 2 | 45 |
9 | Jennifer | F. | 2018-03-17 | 2018-03-17 | 1 | 30 |
11 | Fred | S. | 2018-03-23 | 2018-03-23 | 1 | 3 |
12 | Amy | D. | 2018-03-03 | 2018-03-03 | 1 | 4 |
13 | Kathleen | M. | 2018-03-07 | 2018-03-07 | 1 | 26 |
- add descriptions for the newly added columns.
- rebuild the docs and have a look into the lineage graph and your added column descriptions.
- The dbt Fundamentals Course is a free course which takes about 3.5h to 5h and gives a great overview the different features in dbt. There are more features, that we have not touched today.
- The dbt Docs is a great resource for all kind of knowledge around dbt.
- Feel free to contact me anytime about any questions regarding dbt.