kazu.steps.ner.hf_token_classification

Classes

class kazu.steps.ner.hf_token_classification.HFDataset[source]

Bases: IterableDataset[dict[str, Any]]

__init__(encodings, keys_to_use)[source]

Simple implementation of torch.utils.data.IterableDataset, producing HF tokenizer input_id.

Parameters:
  • encodings (BatchEncoding)

  • keys_to_use (Iterable[str]) – the keys to use from the encodings (not all models require token_type_ids)

class kazu.steps.ner.hf_token_classification.TransformersModelForTokenClassificationNerStep[source]

Bases: Step

A wrapper for transformers.AutoModelForTokenClassification.

This implementation uses a sliding window concept to process large documents that don’t fit into the maximum sequence length allowed by a model. Resulting token labels are then post processed by TokenizedWordProcessor.

__call__(docs)[source]

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:
Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(path, batch_size, stride, max_sequence_length, tokenized_word_processor, keys_to_use, entity_splitter=None, device='cpu')[source]
Parameters:
  • path (str) – path to HF model, config and tokenizer. Passed to HF .from_pretrained()

  • batch_size (int) – batch size for dataloader

  • stride (int) – passed to HF tokenizers (for splitting long docs)

  • max_sequence_length (int) – passed to HF tokenizers (for splitting long docs)

  • tokenized_word_processor (TokenizedWordProcessor)

  • keys_to_use (Iterable[str]) – keys to use from the encodings. Note that this varies depending on the flaour of bert model (e.g. distilbert requires token_type_ids)

  • entity_splitter (NonContiguousEntitySplitter | None) – to detect non-contiguous entities if provided

  • device (str) – device to run the model on. Defaults to “cpu”

frame_to_tok_word(batch_encoding, number_of_frames, frame_index, section_frame_index, predictions)[source]

Depending on the number of frames generated by a string of text, and whether it is the first or last frame, we need to return different subsets of the frame offsets and frame word_ids.

Parameters:
  • batch_encoding (BatchEncoding)

  • number_of_frames (int) – number of frames created by the tokenizer for the string

  • frame_index (int) – the index of the query frame, relative to the total number of frames

  • section_frame_index (int) – the index of the section frame, relative to the whole BatchEncoding

  • predictions (Tensor)

Returns:

Tuple of 2 lists: frame offsets and frame word ids

Return type:

list[TokenizedWord]

get_dataloader(docs)[source]

Get a dataloader from a List of kazu.data.Document. Collation is handled via transformers.DataCollatorWithPadding.

Parameters:

docs (list[Document])

Returns:

The returned dict’s keys map to overflow_to_sample_mapping in the underlying batch encoding, allowing the processing of docs longer than can fit within the maximum sequence length of a transformer

Return type:

tuple[DataLoader, dict[int, Section]]

static get_list_of_batch_encoding_frames_for_section(batch_encoding, section_index)[source]

For a given dataloader with a HFDataset, return a list of frame indexes associated with a given section index.

Parameters:
Returns:

Return type:

list[int]

get_multilabel_activations(loader)[source]

Get a tensor consisting of confidences for labels in a multi label classification context. Output tensor is of shape (n_samples, max_sequence_length, n_labels).

Parameters:

loader (DataLoader)

Returns:

Return type:

Tensor

get_single_label_activations(loader)[source]

Get a tensor consisting of one hot binary classifications in a single label classification context. Output tensor is of shape (n_samples, max_sequence_length, n_labels).

Parameters:

loader (DataLoader)

Returns:

Return type:

Tensor

section_frames_to_tokenised_words(section_index, batch_encoding, predictions)[source]
Parameters:
Returns:

Return type:

list[TokenizedWord]