Step 3: Final model

The result of the aggregation process is a labelled corpus. The last step is then to train a machine learning model (for instance for NER or sentiment analysis) based on those automated annotations. Training a machine learning model makes the predictions more robust to overfitting, since the results are less directly dependent on the accuracy of each labelling function¹.

This final model can obviously take a variety of forms, and skweak does not set any particular constraint on the type of model that can be applied.

If you wish to use SpaCy to train the final model, you can proceed as such:

for doc in docs:
    doc.ents = doc.spans["hmm"]
skweak.utils.docbin_writer(docs, "/path/to/corpus.spacy")

spacy init config - --lang en --pipeline ner --optimize accuracy | \
spacy train - --paths.train ./path/to/corpus.spacy  --paths.dev /path/to/corpus.spacy \
--initialize.vectors en_core_web_md --output /path/to/trained_model

1: Some labelling functions may also be impossible to apply at prediction time. This is obviously the case when using the annotations of crowd-workers as labelling functions. Labelling functions may also rely on additional resources that are available for historical data, but not for new, unseen texts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 3: Final model

Clone this wiki locally