영화 평점 예측 with Tensorflow
- OS : Ubuntu 16.04+ / Windows 10
- CPU : any (quad core ~)
- GPU : GTX 1060 6GB ~
- RAM : 16GB ~
- Library : TF 1.x with CUDA 9.0~ + cuDNN 7.0~
- Python
- MySQL DB
- tensorflow 1.x
- numpy
- gensim and konlpy and soynlp
- mecab-ko
- pymysql
- h5py
- tqdm
- pymysql
- (Optional) java 1.7+
- (Optional) PyKoSpacing
- (Optional) MultiTSNE (for visualization)
- (Optional) matplotlib (for visualization)
DataSet | Language | Sentences | Words | Size |
---|---|---|---|---|
NAVER Movie Review | Korean | 8.86M |
391K |
About 1GB |
# Necessary
$ sudo python3 -m pip install -r requirements.txt
# Optional
$ sudo python3 -m pip install -r opt_requirements.txt
# In ```config.py```, there're lots of params for scripts. plz re-setting
$ python3 movie-parse.py
$ python3 db.py
$ python3 preprocessing.py
usage: preprocessing.py [-h] [--load_from {db,csv}] [--vector {d2v,w2v}]
[--is_analyzed IS_ANALYZED]
Pre-Processing NAVER Movie Review Comment
optional arguments:
-h, --help show this help message and exit
--load_from {db,csv} load DataSet from db or csv
--vector {d2v,w2v} d2v or w2v
--is_analyzed IS_ANALYZED
already analyzed data
$ python3 main.py --refine_data [True or False]
usage: main.py [-h] [--checkpoint CHECKPOINT] [--refine_data REFINE_DATA]
train/test movie review classification model
optional arguments:
-h, --help show this help message and exit
--checkpoint CHECKPOINT
pre-trained model
--refine_data REFINE_DATA
solving data imbalance problem
│
├── comments (NAVER Movie Review DataSets)
│ ├── 10000.sql
│ ├── ...
│ └── 200000.sql
├── w2v (Word2Vec)
│ ├── ko_w2v.model (Word2Vec trained gensim model)
│ └── ...
├── d2v (Doc2Vec)
│ ├── ko_d2v.model (Dov2Vec trained gensim model)
│ └── ...
├── model (Movie Review Rate ML Models)
│ ├── textcnn.py
│ └── textrnn.py
├── image (explaination images)
│ └── *.png
├── ml_model (tf pre-trained model saved in here)
│ ├── checkpoint
│ ├── ...
│ └── charcnn-best_loss.ckpt
├── config.py (Configuration)
├── tfutil.py (handy tfutils)
├── dataloader.py (Doc/Word2Vec model loader)
├── movie-parser.py (NAVER Movie Review Parser)
├── db.py (DataBase processing)
├── preprocessing.py (Korean normalize/tokenize)
├── visualize.py (for visualizing w2v)
└── main.py (for easy use of train/test)
Here's a google drive link. You can download pre-trained models from here !
- TextCNN
credited by Toxic Comment Classification kaggle 1st solution
- TextRNN
credited by Toxic Comment Classification kaggle 1st solution
DataSet is not good. So, the result also isn't pretty good as i expected :(
Refining/Normalizing raw sentences are needed!
- TextCNN (Char2Vec)
Result : train MSE 1.553, val MSE 3.341
Hyper-Parameter : rand, conv kernel size [10,9,7,5,3], conv filters 256, drop out 0.7, fc unit 1024, adam, embed size 384
- TextCNN (Word2Vec)
Result : train MSE 3.410
Hyper-Parameter : non-static, conv kernel size [2,3,4,5], conv filters 256, drop out 0.7, fc unit 1024, adadelta, embed size 300
- TextRNN (Word2Vec)
Result : train MSE 3.646
Hyper-Parameter : non-static, rnn cells 128, attention 128, drop out 0.7, fc unit 1024, adadelta, embed size 300
- TextRNN (Char2Vec)
SOON!
You can just simply type tensorboard --logdir=./ml_model/
Perplexity : 80
Learning rate : 10
Iteration : 310
- deal with word spacing problem
Any suggestions and PRs and issues are WELCONE :)
HyeongChan Kim / @kozistr