Goal for this project is to classify twitter review sentiment with implementation of Online learning.
Online learning is the process of retraining the model as the data comes in streams of continuously generated data.
Dataset contains Twitter tweets records (Size - 1.6M unique records).
- Data processing extraction of useful data features
- ETL Cleaning and Filtering of data with operations like removal of stop words, punctuations, urls, repeating phrases, encodings
- Visualizations of data distributions, word clouds
- Microservice Implementing - each service developed can be used as a module in external environment
- NLP using techniques like Tokenization, Stemming and Lemmatization
- Model comparator which compares multiple model stats and saves the best performing model
- Model selector keeps checking for best performing model and selects the top model for production
- Clock function which garbage collects the obsolete models and data files based on business rules
- Model run history covers all previous best runs of every model
TBU
- Data Cleaning, Filtering & Manipulation - Regular expressions, pandas and numpy dataframes
- Data Visualization - Plotly, Seaborn, Matplotlib, word cloud
- Data Storage - local
- Webapp - TBA
-
Download the project and run the below requirements in the project folder terminal
pip install -r /path/to/requirements.txt
- Implement logging at a modular level
- Exception Handling for data transformation and model selector
- Enhance model training and history to parametric modules
- Implement clock function to remove obsolete models/data
- Create and load environment variable file