kazu.steps.ner.tokenized_word_processor

Classes

MultilabelSpanFinder

A span finder that can handle multiple labels per token, as opposed to the standard 'BIO' format.

SimpleSpanFinder

Since Bert like models use wordpiece tokenisers to handle the OOV problem, we need to reconstitute this info back into document character indices to represent actual NEs.

SpanFinder

TokWordSpan

Dataclass for a span (i.e. a list[TokenizedWord] representing an NE).

TokenizedWord

A convenient container for a word, which may be split into multiple tokens by e.g. WordPiece tokenisation.

TokenizedWordProcessor

Because of the inherent obscurity of the inner workings of transformers, sometimes they produce BIO tags that don't correctly align to whole words, or maybe the classic BIO format gets confused by nested entities.

class kazu.steps.ner.tokenized_word_processor.MultilabelSpanFinder[source]

Bases: SpanFinder

A span finder that can handle multiple labels per token, as opposed to the standard ‘BIO’ format.

__call__(words)[source]

Find spans of entities.

Parameters:

words (list[TokenizedWord])

Returns:

The spans found

Return type:

list[TokWordSpan]

__init__(text, id2label)[source]
Parameters:
  • text (str) – the raw text to be processed

  • id2label (dict[int, str]) – id to class label mappings

close_spans(class_label)[source]

Close any active spans.

Parameters:

class_label (str)

Return type:

None

get_class_labels(word)[source]
Parameters:

word (TokenizedWord)

Return type:

set[str]

process_next_word(word)[source]

Process the next word in the sequence, updating span information accordingly.

Parameters:

word (TokenizedWord)

Returns:

Return type:

None

span_continue_condition(word, class_labels)[source]

A potential entity span will end if any of the following conditions are met:

  1. any of the BIO classes for word are O

  2. The previous character to the word is in the set of self.span_breaking_chars

Parameters:
Returns:

Return type:

bool

start_span(class_label, word)[source]

Start a new TokWordSpan for the given class label.

Parameters:
  • class_label (str) – the label to use

  • word (TokenizedWord) – the word to start the span with

Returns:

Return type:

None

class kazu.steps.ner.tokenized_word_processor.SimpleSpanFinder[source]

Bases: SpanFinder

Since Bert like models use wordpiece tokenisers to handle the OOV problem, we need to reconstitute this info back into document character indices to represent actual NEs.

Since transformer NER is an imprecise art, we may want to use different logic in how this works, so we can subclass this class to determine how this should be done

The __call__ method of this class operates on a list of TokenizedWord, processing each sequentially according to a logic determined by the implementing class. It returns the spans found - a list of TokWordSpan.

After being called, the spans can be later accessed via self.closed_spans.

__call__(words)[source]

Find spans of entities.

Parameters:

words (list[TokenizedWord])

Returns:

The spans found

Return type:

list[TokWordSpan]

__init__(text, id2label)[source]
Parameters:
  • text (str) – the raw text to be processed

  • id2label (dict[int, str]) – id to BIO-class label mappings

close_spans()[source]

Close any active spans.

get_bio_and_class_labels(word)[source]

Return a set of tuple[<BIO label>,Optional[<class label>]] for a TokenizedWord. Optional[<class label>] is None if the BIO label is “O”.

Parameters:

word (TokenizedWord)

Returns:

Return type:

set[tuple[str, str | None]]

process_next_word(word)[source]

Process the next word in the sequence, updating span information accordingly.

Parameters:

word (TokenizedWord)

Returns:

Return type:

None

span_continue_condition(word, bio_and_class_labels)[source]

A potential entity span will end if any of the following conditions are met:

  1. any of the BIO classes for word are O

  2. The previous character to the word is in the set of self.span_breaking_chars

Parameters:
Returns:

Return type:

bool

start_span(bio_and_class_labels, word)[source]

Start a new TokWordSpan if a B label is detected.

Parameters:
Returns:

Return type:

None

class kazu.steps.ner.tokenized_word_processor.SpanFinder[source]

Bases: ABC

abstract __call__(words)[source]

Find spans of entities.

Parameters:

words (list[TokenizedWord])

Returns:

The spans found

Return type:

list[TokWordSpan]

__init__(text, id2label)[source]
Parameters:
  • text (str) – the raw text to be processed

  • id2label (dict[int, str]) – BIO to class label mappings

Return type:

None

class kazu.steps.ner.tokenized_word_processor.TokWordSpan[source]

Bases: object

Dataclass for a span (i.e. a list[TokenizedWord] representing an NE)

__init__(clazz, tok_words=<factory>)[source]
Parameters:
Return type:

None

clazz: str

entity_class

tok_words: list[TokenizedWord]

words associated with this span

class kazu.steps.ner.tokenized_word_processor.TokenizedWord[source]

Bases: object

A convenient container for a word, which may be split into multiple tokens by e.g. WordPiece tokenisation.

__init__(token_ids, tokens, token_confidences, token_offsets, word_char_start, word_char_end, word_id)[source]
Parameters:
Return type:

None

token_confidences: Tensor

tensor of token logit softmax

token_ids: list[int]
token_offsets: list[tuple[int, int]]

character indices of tokens

tokens: list[str]

token string representations

word_char_end: int

char end index of word

word_char_start: int

char start index of word

word_id: int
class kazu.steps.ner.tokenized_word_processor.TokenizedWordProcessor[source]

Bases: object

Because of the inherent obscurity of the inner workings of transformers, sometimes they produce BIO tags that don’t correctly align to whole words, or maybe the classic BIO format gets confused by nested entities.

This class is designed to work when an entire sequence of NER labels is known and therefore we can apply some post-processing logic. Namely, we use the SpanFinder class to find entity spans according to their internal logic

__call__(words, text, namespace)[source]

Call self as a function.

Parameters:
Return type:

list[Entity]

__init__(labels, use_multilabel=False, strip_re=None)[source]
Parameters:
  • labels (Iterable[str]) – mapping of label int id to str label

  • use_multilabel (bool) – whether to use multilabel classification (needs to be supported by the model)

  • strip_re (dict[str, str] | None) – an optional dict of {<entity_class>:<python regex to remove>} to process NER results that the model frequently misclassifies.

attempt_strip_suffixes(start, end, match_str, clazz)[source]

Transformers sometimes get confused about precise offsets, depending on the training data (e.g. “COX2” is often classified as “COX2 gene”). This method offers light post-processing to correct these, for better entity linking results.

Parameters:
  • start (int) – original start

  • end (int) – original end

  • match_str (str) – original string

  • clazz (str) – entity class

Returns:

new string, new end

Return type:

tuple[str, int]

calculate_span_offsets(words)[source]
Parameters:

words (list[TokenizedWord])

Return type:

tuple[int, int]

static id2labels_from_label_list(labels)[source]
Parameters:

labels (Iterable[str])

Return type:

dict[int, str]

make_span_finder(text)[source]
Parameters:

text (str)

Return type:

SpanFinder

spans_to_entities(spans, text, namespace)[source]

Convert spans to instances of Entity, adding in namespace info as appropriate.

Parameters:
  • spans (list[TokWordSpan]) – list of TokWordSpan to consider

  • text (str) – original text

  • namespace (str) – namespace to add to Entity

Returns:

Return type:

list[Entity]