ETL Project - Vehicle Data Cleaning and Analysis

This repository details our process of ETL, data analysis, and visualization using Python and PostgreSQL, focusing on vehicle data to extract valuable insights.

Structure

PROYECT-CARS-ETL
├── API
│   ├── dags
│   ├── apiCall.py
│   └── EDA_API.ipynb
├── Dashboard
├── Data
│   ├── Clean
│   └── Raws
├── data-README.md
├── Document
├── src
├── Video
├── .env
├── .gitignore
├── connection.py
├── docker-compose.yml
├── EDA.ipynb
├── fact-dimensions.ipynb
├── poetry.lock
├── pyproject.toml
└── README.md

Context

The project is based on a study of the different cars bought and sold in the United States, seeking to show how the automotive market behaves and to know the preferences of local consumers, to know if there is any variable that affects the acquisition of a car such as Geographic location affects the purchasing decision, brand, color, among others. The chosen data set consists of the cars.com page for the sale of second-hand and new cars, published by people and car dealers. We got the code from: https://www.kaggle.com/datasets/chancev/carsforsale/data

The key steps in this project include:

Clean the dataset through an EDA process. Migrating the cleaned data to a PostgreSQL database for further analysis.

For this project, use Python and Jupyter Notebook, choosing PostgreSQL as the database to manage and query the clean data.

Technologies

Python
Jupyter
Ubuntu
Apache-Ariflow
Poetry
Git y Github
PowerBI
SQLalchemy
Pandas
Dontev
PostgreSQL

Members

Santiago Gomez Castro

Juan Carlos Quintero

MIguel Angel Ruales

Installation

After did git clone enter to directory:

# If you don't have poetry
sudo apt install python3-poetry

poetry shell

poetry install

export AIRFLOW_HOME=$(pwd)/airflow

export AIRFLOW__CORE__LOAD_EXAMPLES=false

AIRFLOW_VERSION=2.10.1
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

airflow standalone

Open browser and copy and paste the link:

airflow webserver --port 8080

Airflow with kafka producer

Airflow is used for the creation of a pipeline, where it performs the extraction, transformation, and loading of the data.

Watch the video on Google Drive

Before running Kafka, we need to use some additional commands:

pip install git+https://github.com/dpkp/kafka-python.git

launch Docker with:

docker compose up

If you don't have Docker, go to the next Link and download

Open Docker bash:

docker exec -it kafka-test bash

and paste:

kafka-topics --bootstrap-server kafka-test:9092 --create --topic kafka_project

exit

Next we start consumer with:

python3 ./src/consumer.py

Data Validation with Great Expectations

This project uses Great Expectations to ensure the quality of data retrieved from the extract_API stage before further processing. The validation file is a Jupyter Notebook "testin_extractData.ipynb", in the GX folder.

Validations Performed

Column Structure: Ensures the retrieved columns match the expected schema.
Data Types: Verifies that each column has the correct data type.
Missing Values: Checks that no columns contain null values, except for the value column (as it may contain nulls before data cleaning).

Setup

Make sure to initialize a Great Expectations project before using the notebook by running:

great_expectations init

Dashboard

This is our dashboard which we keep updating trought the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Project - Vehicle Data Cleaning and Analysis

Structure

Context

Technologies

Members

Installation

Airflow with kafka producer

Data Validation with Great Expectations

Validations Performed

Setup

Dashboard

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
API		API
Dashboard		Dashboard
Data		Data
Document		Document
GX		GX
Video		Video
data-README.md		data-README.md
src		src
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
README.md		README.md
connection.py		connection.py
docker-compose.yml		docker-compose.yml
fact-dimensions.ipynb		fact-dimensions.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

2004santiago/Proyect-Cars-ETL

Folders and files

Latest commit

History

Repository files navigation

ETL Project - Vehicle Data Cleaning and Analysis

Structure

Context

Technologies

Members

Installation

Airflow with kafka producer

Data Validation with Great Expectations

Validations Performed

Setup

Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages