Preprocessing and Analyzing NYC Yellow taxi passenger count for 2022 and 2023 using Isolation forest
- First, download the queried dataset as csv to the folder data/ from 2022dataset and 2023dataset
- install required packages
pip install -r requirements.txt
- Run all code in preprocess.iypnb. It first group the number of passenger per hour and do data cleaning to remove problematic data in the original dataset. Then it locates the maximum number of passenger per hour in one day to further reduce the data size. Finally it saves the processed data as csv files.
- Run
python anomoly_detection.py