The End-To-End Data Pipeline using Jenkins and Airflow

Overview

A comprehensive ELT (Extract, Load, Transformation) project using Apache Spark, Apache Hive and DBT. This project includes a CI/CD pipeline implemented with Jenkins for automated build and deployment. The workflow orchestration of the ELT job is handled by Apache Airflow, enabling the scheduling and monitoring of ELT job. DBT is used to transform and prepare data for analysis in a data warehouse which is Hive in our case. The project showcases the integration of Spark, Hive and DBT for efficient data processing and storage, ensuring seamless data pipeline operations.

The End-To-End Data Pipeline

Built With

I have used Ubuntu includes installations of the following

language/tool	version
Scala	2.12.10
Python	3.8.10
Maven	3.6.3
Apache Airflow	2.5.1
Jenkins	2.332.3
dbt	1.3.4
dbt-hive plugin	1.3.1

I have used a Hadoop cluster consisting of one NameNode and three DataNodes built using Docker. The cluster includes installations of the following

language/tool	version
Apache Hadoop	3.0.0
Apache Hive	3.1.2
Apache Spark	2.4.7

Jenkins Pipeline

The pipeline contains the following stages:

Checkout stage
Build stage
Generate JAR stage
Deploy to Airflow Stage
Post Actions Stage (Gmail Notification)

In Jenkins, create a pipeline spark-hive-elt-cicd and configure it to check the source code from GitHub repository which contains the jenkinsfile.

You can trigger the pipeline manually or using GitHub hook trigger that listen for code changes and trigger the pipeline.

I used Email notification plugin to send me email either on success or on failure to track the status of my pipeline.

Airflow Pipeline

The Airflow pipeline contains the following tasks:

Extract from Excel file that contains multiple sheets
Load data into hive
Transform the data using Data Build Tool(DBT)

After deploying the ELT JAR and airflow Dag, then you will find that Dag is added and its tasks are scheduled to be run.

Before loading the data into Hive, The data warehouse is empty.

After loading the data into Hive.

ELT

Here is a data for a dashboard of the SQL Saturday sessions from around the world from data.world. You can access it via https://data.world/stretcharm/sqlsaturday.

I have used Apache spark to read these sheets and load it into Hive using spark-excel library.

DBT

DBT here is used to transform data in hive using models

To connect to Hive, Use the following profiles.yml

dbt_with_hive:
  target: dev
  outputs:
    dev:
      type: hive
      host: master
      port: 10000 # Which is hive server port
      schema: default

Hive server is up and running on port 10000

The first simple model events_2019.sql , Filter events of year 2019
After loading the data into Hive the DBT job will be run and execute the models and save them inside the data warehouse.

TO DO

Dashboard to visualize data.
Add data tests in DBT.

Acknowledgments

Getting Started With Jenkins-Installation and Setup | by Abhishek Jain | Tech Insider | Medium | Tech Insider
Install Apache Airflow the first time | by Kin Lau | Medium
Install DBT (Data Build Tool). Steps to install DBT | by Ravi Kishore Reddy Lakkireddy | Medium
Apache Hive setup | dbt Developer Hub (getdbt.com)
Sign in with app passwords - Google Account Help, To generate app password for email notification in Jenkins

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
airflow_dags		airflow_dags
data		data
dbt		dbt
resources		resources
src/main		src/main
.gitignore		.gitignore
Jenkinsfile		Jenkinsfile
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The End-To-End Data Pipeline using Jenkins and Airflow

Overview

The End-To-End Data Pipeline

Built With

Jenkins Pipeline

Airflow Pipeline

ELT

DBT

TO DO

Acknowledgments

About

Releases

Packages

Languages

hosniadel666/spark-hive-elt-dbt-cicd

Folders and files

Latest commit

History

Repository files navigation

The End-To-End Data Pipeline using Jenkins and Airflow

Overview

The End-To-End Data Pipeline

Built With

Jenkins Pipeline

Airflow Pipeline

ELT

DBT

TO DO

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages