kazu.steps.ner.tokenized_word_processor¶
Classes
A span finder that can handle multiple labels per token, as opposed to the standard 'BIO' format. |
|
Since Bert like models use wordpiece tokenisers to handle the OOV problem, we need to reconstitute this info back into document character indices to represent actual NEs. |
|
Dataclass for a span (i.e. a list[TokenizedWord] representing an NE). |
|
A convenient container for a word, which may be split into multiple tokens by e.g. WordPiece tokenisation. |
|
Because of the inherent obscurity of the inner workings of transformers, sometimes they produce BIO tags that don't correctly align to whole words, or maybe the classic BIO format gets confused by nested entities. |
- class kazu.steps.ner.tokenized_word_processor.MultilabelSpanFinder[source]¶
Bases:
SpanFinder
A span finder that can handle multiple labels per token, as opposed to the standard ‘BIO’ format.
- __call__(words)[source]¶
Find spans of entities.
- Parameters:
words (list[TokenizedWord])
- Returns:
The spans found
- Return type:
- close_spans(class_label)[source]¶
Close any active spans.
- Parameters:
class_label (str)
- Return type:
None
- get_class_labels(word)[source]¶
- Parameters:
word (TokenizedWord)
- Return type:
- process_next_word(word)[source]¶
Process the next word in the sequence, updating span information accordingly.
- Parameters:
word (TokenizedWord)
- Returns:
- Return type:
None
- span_continue_condition(word, class_labels)[source]¶
A potential entity span will end if any of the following conditions are met:
any of the BIO classes for word are O
The previous character to the word is in the set of self.span_breaking_chars
- Parameters:
word (TokenizedWord)
bio_and_class_labels
- Returns:
- Return type:
- start_span(class_label, word)[source]¶
Start a new TokWordSpan for the given class label.
- Parameters:
class_label (str) – the label to use
word (TokenizedWord) – the word to start the span with
- Returns:
- Return type:
None
- class kazu.steps.ner.tokenized_word_processor.SimpleSpanFinder[source]¶
Bases:
SpanFinder
Since Bert like models use wordpiece tokenisers to handle the OOV problem, we need to reconstitute this info back into document character indices to represent actual NEs.
Since transformer NER is an imprecise art, we may want to use different logic in how this works, so we can subclass this class to determine how this should be done
The __call__ method of this class operates on a list of TokenizedWord, processing each sequentially according to a logic determined by the implementing class. It returns the spans found - a list of
TokWordSpan
.After being called, the spans can be later accessed via self.closed_spans.
- __call__(words)[source]¶
Find spans of entities.
- Parameters:
words (list[TokenizedWord])
- Returns:
The spans found
- Return type:
- get_bio_and_class_labels(word)[source]¶
Return a set of tuple[<BIO label>,Optional[<class label>]] for a TokenizedWord. Optional[<class label>] is None if the BIO label is “O”.
- Parameters:
word (TokenizedWord)
- Returns:
- Return type:
- process_next_word(word)[source]¶
Process the next word in the sequence, updating span information accordingly.
- Parameters:
word (TokenizedWord)
- Returns:
- Return type:
None
- class kazu.steps.ner.tokenized_word_processor.SpanFinder[source]¶
Bases:
ABC
- abstract __call__(words)[source]¶
Find spans of entities.
- Parameters:
words (list[TokenizedWord])
- Returns:
The spans found
- Return type:
- class kazu.steps.ner.tokenized_word_processor.TokWordSpan[source]¶
Bases:
object
Dataclass for a span (i.e. a list[TokenizedWord] representing an NE)
- __init__(clazz, tok_words=<factory>)[source]¶
- Parameters:
clazz (str)
tok_words (list[TokenizedWord])
- Return type:
None
- tok_words: list[TokenizedWord]¶
words associated with this span
- class kazu.steps.ner.tokenized_word_processor.TokenizedWord[source]¶
Bases:
object
A convenient container for a word, which may be split into multiple tokens by e.g. WordPiece tokenisation.
- class kazu.steps.ner.tokenized_word_processor.TokenizedWordProcessor[source]¶
Bases:
object
Because of the inherent obscurity of the inner workings of transformers, sometimes they produce BIO tags that don’t correctly align to whole words, or maybe the classic BIO format gets confused by nested entities.
This class is designed to work when an entire sequence of NER labels is known and therefore we can apply some post-processing logic. Namely, we use the SpanFinder class to find entity spans according to their internal logic
- __init__(labels, use_multilabel=False, strip_re=None)[source]¶
- Parameters:
labels (Iterable[str]) – mapping of label int id to str label
use_multilabel (bool) – whether to use multilabel classification (needs to be supported by the model)
strip_re (dict[str, str] | None) – an optional dict of {<entity_class>:<python regex to remove>} to process NER results that the model frequently misclassifies.
- attempt_strip_suffixes(start, end, match_str, clazz)[source]¶
Transformers sometimes get confused about precise offsets, depending on the training data (e.g. “COX2” is often classified as “COX2 gene”). This method offers light post-processing to correct these, for better entity linking results.
- calculate_span_offsets(words)[source]¶
- Parameters:
words (list[TokenizedWord])
- Return type: