Skip to content

Step 1: Labelling functions

Pierre Lison edited this page May 31, 2022 · 9 revisions

Labelling functions are at the core of skweak. They can be constructed in several ways. The key idea behind all labelling functions is that they take a Doc object as input, and returns a list of (token-level) spans with associated labels.

For sequence labelling, the spans simply corresponds to the entities one wish to detect. For text classification tasks (such as sentiment analysis), the span corresponds to the full text you wish to classify (which may be a sentence, or perhaps the full document).

Building labelling functions

base.SpanAnnotator

The generic class for all labelling functions is SpanAnnotator. An flexible way to create a labelling function is to create a child class of SpanAggregator, and implement the method find_spans(doc), like this:

class MoneyDetector(base.SpanAggregator):
    def __init__(self):
        super(MoneyDetector, self).__init__("money_detector")

    def find_spans(self, doc):
        for tok in doc[1:]:
            if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
                yield tok.i-1, tok.i+1, "MONEY"

money_detector = MoneyDetector()

heuristics.FunctionAnnotator

If the function to apply is relatively simple and stateless, one can also use FunctionAnnotator to accomplish the same thing:

def money_detector(doc):
   for tok in doc[1:]:
      if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
          yield tok.i-1, tok.i+1, "MONEY"

money_detector = heuristics.FunctionAnnotator("money_detector", money_detector)

In both cases, you can also specify a number of "incompatible labelling functions" (see to_exclude parameters) that should take precedence over the current function if their spans overlap with one another. For instance, if we were to add a function to detect entities of type CARDINAL, we may want to specify that we wish to skip numbers that were already labelled by money_detector, since the entity is then most likely a MONEY.

heuristics.TokenConstraintAnnotator

We sometimes wish to find all sequences of tokens that satisfy a given token-level constraint. For instance, we may wish to extract all spans of NNP tokens, and associate those with an ENT label:

nnp_detector = TokenConstraintAnnotator("nnp_detector", lambda tok: tok.tag_ in {"NNP", "NNPS"}, "ENT")

TokenConstraintAnnotator only extract the longest subsequences where each token satisfies the constraint. It is also possible to specify specific tokens that may be relaxed from this constraint when occurring in the middle of the sequence (see method add_gap_tokens).

heuristics.SpanConstraintAnnotator

We can create a similar type of labelling function where the constraint is at the level of spans instead of tokens. This labelling function depends on having access to the result of another labelling function to generate candidate spans. For instance, once can use the result of the nnp_detector above, and add an additional constraint that the span must be more than two tokens:

multitoken_nnp_detector = SpanConstraintAnnotator("multitoken_nnp_detector", "nnp_detector", 
                                                  lambda span: len(span) > 1)

heuristics.SpanEditorAnnotator

Instead of specifying a constraint, we can provide as function an "editor" which takes a span and returns another span or None. This can be done with SpanEditorAnnotator. For instance, we can extend the money detector we just created with a new function that checks whether the money entity is followed by a thousand/million/billion, like "$40 million", in which case we extend the span to include this last token:

def transform_money_span(span):
    last_token = span[-1]
    if last_token .n_rights and last_token.nbor(1).text in {"thousand", "million", "billion"}:
        return span.doc[span.start:span.end+1]
    else:
        return span
money_detector2 = SpanEditorAnnotator("money_detector2", "money_detector", transform_money_span)

heuristics.VicinityAnnotator

The last type of heuristic we can incorporate is based on cue words in the vicinity of a span. This labelling function is again based on the result of another function, such as nnp_detector. We then specify a list of cue words, and a window size for search for those cue words (either in their original form, or as lemma):

# Typically, entities next to words like say/tell/listen etc. will be PERSON
cue_words = ["say", "indicate", "reply", "claim", "declare", "tell", "answer", "listen"]
VicinityAnnotator("person_detector", {word:"PERSON" for word in cue_words}, "nnp_detector", max_window=2)

Of course, this labelling function will "fail" from time to time, as a company can sometimes be the agent or patient of those verbs. But it is worth stressing that labelling functions do not need to be perfect - the purpose of weak supervision is precisely to combine together a set of weaker/noisier supervision signals, leading to a form of denoising. As long as we also integrate other labelling functions to detect e.g. ORG, the aggregation model should be able to find out the most likely predictions based on the results of all functions, and the estimated accuracy of each.

gazetteers.GazetteerAnnotator

Another way to create labelling functions is by compiling lists of entities to detect. This is known in NLP as a gazetteer. skweak comes equipped with a gazetteer implementation that can scale to very large lists with several million entries. Technically, this is done by using a trie to efficiently search for all possible occurrences.

If you have a small list, the easiest is to create the trie directly, and then load it into the gazetteer:

NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
trie = Trie(NAMES)
lf3 = GazetteerAnnotator("presidents", {"PERSON":trie})

For larger lists, you can also store the entities in a JSON file (encoded as dictionary of lists, the key representing the labels), and use the method extract_json_data to automatically build the trie.

One optional argument in GazetteerAnnotator is whether you want to apply the gazetteer in case-sensitive or case-insensitive mode. In other words, if the Trie contains the word "Belgium", but the text contains the word "belgium", should we consider it a match or not? Case-sensitive modes generally give a higher precision, but a lower recall, while the opposite is true for the case-insensitive mode. A good idea is then to create two gazetteers (one in each possible mode) to get as much value as possible from the knowledge base.

spacy.ModelAnnotator

Labelling functions can also take the form of machine learning models. In principle, any machine learning model can be employed, as long we can implement the method find_spans. If your model takes the form of a Spacy NER model, it becomes even easier:

spacy_model = spacy.ModelAnnotator("spacy",  "en_core_web_sm")

spacy.TruecaseAnnotator

Texts that are not properly cased can cause problems to neural NER models trained on "clean" texts. One variant of the model above applies a "truecasing" operation as a preprocessing step. This requires a JSON file containing the most common casing variants for a list of words.

form_frequencies = "data/form_frequencies.json"
spacy_model2 = spacy.TruecaseAnnotator("spacy_truecase",  "en_core_web_sm", form_frequencies )

doclevel.DocumentMajorityAnnotator

skweak also makes it possible to create document-level labelling functions that rely on the global document context. In particular, skweak includes a labelling function that takes advantage of label consistency within a document. Entities occurring multiple times through a document are highly likely to belong to the same category1. For instance, while Komatsu may both refer to a Japanese town or a multinational corporation, a text including this mention will either be about the town or the company, but rarely both at the same time.

This label consistency can be exploited with the DocumentMajorityAnnotator. The labelling function starts with the prediction of another labelling function, and computes for each possible entity string the frequency of each label. The labelling function then selects the most common label, and predicts this label for every mention of this entity through the document.

Which labelling function should be used to count the label frequencies? Ideally, we would like to take all labelling functions into account, which means we need to aggregate them. The easiest option is to use a MajorityVoter to quickly aggregate all labelling functions. Alternatively, one can also rely on the full generative model (see Step 2), but this means we need to fit the model twice (first without the document-level functions, and then with those functions included).

Here is an example of use:

maj_voter = MajorityVoter("doclevel_voter", ["LOCATION", "ORGANIZATION", "PERSON"], 
                          initial_weights={"doc_majority":0.0}) # We do not want to include doc_majority itself in the vote
           
doc_majority = DocumentMajorityAnnotator("doc_majority", "doclevel_voter",  ["LOCATION", "ORGANIZATION", "PERSON"])

As with gazetteers, DocumentMajorityAnnotator can be run in case-sensitive and case-insensitive mode (and we recommend you to create labelling functions for both modes).

doclevel.DocumentHistoryAnnotator

When introduced for the first time, named entities are often referred to in an explicit and univocal manner, while subsequent mentions (once the entity is a part of the focus structure) frequently rely on shorter references. For instance, the first mention of a person in a news article will typically use the full name of that person, while subsequent mentions may only rely on the person's last name.

This insight can also be exploited as part of a labelling function. Using DocumentHistoryAnnotator, we can propagate the labels from an entity's first mentions to all subsequent mentions within the same document. The use of this labelling function is very similar to the one above:

maj_voter = MajorityVoter("doclevel_voter", ["ORGANIZATION", "PERSON"], 
                           initial_weights={"doc_history":0.0})
           
doc_history= DocumentHistoryAnnotator("doc_history", "doclevel_voter",  ["ORGANIZATION", "PERSON"])

Applying labelling functions

Once all labelling functions are defined, you can apply them to your document. To apply a single labelling function to a document, you can simply run:

annotator(doc)

which returns the same document, completed with the new annotations from the labelling function. To verify that your function worked correctly, you can inspect the content of doc.spans["name_of_your_labelling_function"], which should contains the detect spans, along with their label (accessible via the attribute label_).

If you have defined many labelling functions (which you should), it is often easier to regroup them. This can be done using base.CombinedAnnotator. Simply add each labelling function:

combined = CombinedAnnotator()
combined.add_annotator(money_detector)
...
combined.add_annotator(doc_history)

You can then easily apply all labelling functions to a document. If you have many documents, it is also often faster to run:

docs = list(combined.pipe(docs))

to apply all labelling functions to the full corpus. Note the implementation of pipe is lazy. You can easily store such collections of documents using the DocBin format (see previous section).


1: See V. Krishnan and C. D. Manning, "An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition", ACL 2006