kazu.steps.ner.hf_token_classification¶

Classes

`HFDataset`
`TransformersModelForTokenClassificationNerStep`	A wrapper for `transformers.AutoModelForTokenClassification`.

class kazu.steps.ner.hf_token_classification.HFDataset[source]¶

Bases: IterableDataset[dict[str, Any]]

__init__(encodings, keys_to_use)[source]¶

Simple implementation of torch.utils.data.IterableDataset, producing HF tokenizer input_id.

Parameters:

encodings (BatchEncoding)
keys_to_use (Iterable[str]) – the keys to use from the encodings (not all models require token_type_ids)

class kazu.steps.ner.hf_token_classification.TransformersModelForTokenClassificationNerStep[source]¶

Bases: Step

A wrapper for transformers.AutoModelForTokenClassification.

This implementation uses a sliding window concept to process large documents that don’t fit into the maximum sequence length allowed by a model. Resulting token labels are then post processed by TokenizedWordProcessor.

__call__(docs)[source]¶

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:

docs (list[Document])
self (Self)

Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(path, batch_size, stride, max_sequence_length, tokenized_word_processor, keys_to_use=None, entity_splitter=None, device='cpu')[source]¶

Parameters:

path (str) – path to HF model, config and tokenizer. Passed to HF .from_pretrained()
batch_size (int) – batch size for dataloader
stride (int) – passed to HF tokenizers (for splitting long docs)
max_sequence_length (int) – passed to HF tokenizers (for splitting long docs)
tokenized_word_processor (TokenizedWordProcessor)
keys_to_use (Iterable[str] | None) –
keys to use from the encodings. Note that this varies depending on the flaour of bert model (e.g. distilbert requires token_type_ids)

Deprecated since version No: longer used: automatically inferred from the model.
entity_splitter (NonContiguousEntitySplitter | None) – to detect non-contiguous entities if provided
device (str) – device to run the model on. Defaults to “cpu”

frame_to_tok_word(batch_encoding, number_of_frames, frame_index, section_frame_index, predictions)[source]¶

Depending on the number of frames generated by a string of text, and whether it is the first or last frame, we need to return different subsets of the frame offsets and frame word_ids.

Parameters:

batch_encoding (BatchEncoding)
number_of_frames (int) – number of frames created by the tokenizer for the string
frame_index (int) – the index of the query frame, relative to the total number of frames
section_frame_index (int) – the index of the section frame, relative to the whole BatchEncoding
predictions (Tensor)

Returns:

Tuple of 2 lists: frame offsets and frame word ids

Return type:

list[TokenizedWord]

get_dataloader(docs)[source]¶

Get a dataloader from a List of kazu.data.Document. Collation is handled via transformers.DataCollatorWithPadding.

Parameters:: docs (list[Document])
Returns:: The returned dict’s keys map to overflow_to_sample_mapping in the underlying batch encoding, allowing the processing of docs longer than can fit within the maximum sequence length of a transformer
Return type:: tuple[DataLoader, dict[int, Section]]

static get_list_of_batch_encoding_frames_for_section(batch_encoding, section_index)[source]¶

For a given dataloader with a HFDataset, return a list of frame indexes associated with a given section index.

Parameters:

batch_encoding (BatchEncoding)
section_index (int)

Returns:

Return type:

list[int]

get_multilabel_activations(loader)[source]¶

Get a tensor consisting of confidences for labels in a multi label classification context. Output tensor is of shape (n_samples, max_sequence_length, n_labels).

Parameters:: loader (DataLoader)
Returns:
Return type:: Tensor

get_single_label_activations(loader)[source]¶

Get a tensor consisting of one hot binary classifications in a single label classification context. Output tensor is of shape (n_samples, max_sequence_length, n_labels).

Parameters:: loader (DataLoader)
Returns:
Return type:: Tensor

section_frames_to_tokenised_words(section_index, batch_encoding, predictions)[source]¶

Parameters:

section_index (int)
batch_encoding (BatchEncoding)
predictions (Tensor)

Returns:

Return type:

list[TokenizedWord]