This package is for testing Polish word sense disambiguation with BERT. Currently we're focusing on performing tests on the small plWordnet3-annotated corpus made for CoDeS. We compare the BERT embedding of the token to disambiguate with embeddings of tokens of the same lemma that we know are of certain sense (because they appear in the reference corpus or Wordnet glosses).
This is a work in progress. It's also intended to deprecate the gibber code down the line (better code quality, models etc.).
- Python 3.7 or newer
- pip
- virtualenv
- Docker
- Slavic BERT files for pytorch from DeepPavlov
- Polish Wordnet (Słowosieć) XML file
- the CoDeS small sense-annotated corpus of Polish
- optionally NKJP1M (i.e., the 1-million subcorpus)
- KRNNT (we install it below inside Docker)
docker pull djstrong/krnnt:1.0.1
virtualenv .
source bin/activate
pip3 install -r requirements.txt # this may be just pip on some platforms
deactivate
In one terminal window:
docker run -p 9003:9003 -it djstrong/krnnt
# To kill, ctrl+c
In another terminal window:
source bin/activate
# After you review local_settings.py, run this to see the options:
python3 run.py --help
# (this may be just python instead of python3 on your machine)
# Plain `python3 run.py` will just train and test an embedding dictionary from Wordnet and the train corpus.
# After you're done:
deactivate
To test:
source bin/activate
python3 test.py
# After you're done:
deactivate