This repository details our process of ETL, data analysis, and visualization using Python and PostgreSQL, focusing on vehicle data to extract valuable insights.
PROYECT-CARS-ETL
├── API
│ ├── dags
│ ├── apiCall.py
│ └── EDA_API.ipynb
├── Dashboard
├── Data
│ ├── Clean
│ └── Raws
├── data-README.md
├── Document
├── src
├── Video
├── .env
├── .gitignore
├── connection.py
├── docker-compose.yml
├── EDA.ipynb
├── fact-dimensions.ipynb
├── poetry.lock
├── pyproject.toml
└── README.md
The project is based on a study of the different cars bought and sold in the United States, seeking to show how the automotive market behaves and to know the preferences of local consumers, to know if there is any variable that affects the acquisition of a car such as Geographic location affects the purchasing decision, brand, color, among others. The chosen data set consists of the cars.com page for the sale of second-hand and new cars, published by people and car dealers. We got the code from: https://www.kaggle.com/datasets/chancev/carsforsale/data
The key steps in this project include:
Clean the dataset through an EDA process. Migrating the cleaned data to a PostgreSQL database for further analysis.
For this project, use Python and Jupyter Notebook, choosing PostgreSQL as the database to manage and query the clean data.
- Python
- Jupyter
- Ubuntu
- Apache-Ariflow
- Poetry
- Git y Github
- PowerBI
- SQLalchemy
- Pandas
- Dontev
- PostgreSQL
Santiago Gomez Castro
Juan Carlos Quintero
MIguel Angel Ruales
After did git clone enter to directory:
# If you don't have poetry
sudo apt install python3-poetry
poetry shell
poetry install
export AIRFLOW_HOME=$(pwd)/airflow
export AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW_VERSION=2.10.1
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
airflow standalone
Open browser and copy and paste the link:
airflow webserver --port 8080
Airflow is used for the creation of a pipeline, where it performs the extraction, transformation, and loading of the data.
Watch the video on Google Drive
Before running Kafka, we need to use some additional commands:
pip install git+https://github.com/dpkp/kafka-python.git
launch Docker with:
docker compose up
If you don't have Docker, go to the next Link and download
Open Docker bash:
docker exec -it kafka-test bash
and paste:
kafka-topics --bootstrap-server kafka-test:9092 --create --topic kafka_project
exit
Next we start consumer with:
python3 ./src/consumer.py
This project uses Great Expectations to ensure the quality of data retrieved from the extract_API
stage before further processing. The validation file is a Jupyter Notebook "testin_extractData.ipynb", in the GX folder.
- Column Structure: Ensures the retrieved columns match the expected schema.
- Data Types: Verifies that each column has the correct data type.
- Missing Values: Checks that no columns contain null values, except for the
value
column (as it may contain nulls before data cleaning).
Make sure to initialize a Great Expectations project before using the notebook by running:
great_expectations init
This is our dashboard which we keep updating trought the time.