Skip to content

Step 3: Final model

Pierre Lison edited this page Apr 19, 2021 · 2 revisions

The result of the aggregation process is a labelled corpus. The last step is then to train a machine learning model (for instance for NER or sentiment analysis) based on those automated annotations. Training a machine learning model makes the predictions more robust to overfitting, since the results are less directly dependent on the accuracy of each labelling function1.

This final model can obviously take a variety of forms, and skweak does not set any particular constraint on the type of model that can be applied.

If you wish to use SpaCy to train the final model, you can proceed as such:

for doc in docs:
    doc.ents = doc.spans["hmm"]
skweak.utils.docbin_writer(docs, "/path/to/corpus.spacy")
spacy init config - --lang en --pipeline ner --optimize accuracy | \
spacy train - --paths.train ./path/to/corpus.spacy  --paths.dev /path/to/corpus.spacy \
--initialize.vectors en_core_web_md --output /path/to/trained_model

1: Some labelling functions may also be impossible to apply at prediction time. This is obviously the case when using the annotations of crowd-workers as labelling functions. Labelling functions may also rely on additional resources that are available for historical data, but not for new, unseen texts.

Clone this wiki locally