Video captioning in Pytorch based on hobincar/SA-LSTM.
- MSVD and MSR-VTT dataset EDA (see
dataset_eda/dataeda.ipynb
) - 2d Feature extraction
- 3d Feature extraction (follow this issue)
- BUTD Feature extraction
- Temporal augmentation
- Joint-Hierarchical Attention Model
- Full pretrained models (Cider 50.3 for MSR-VTT, 97.1 for MSVD)