kazu.steps.ner.hf_token_classification¶
Classes
A wrapper for |
- class kazu.steps.ner.hf_token_classification.HFDataset[source]¶
Bases:
IterableDataset
[dict
[str
,Any
]]- __init__(encodings, keys_to_use)[source]¶
Simple implementation of
torch.utils.data.IterableDataset
, producing HF tokenizer input_id.- Parameters:
encodings (BatchEncoding)
keys_to_use (Iterable[str]) – the keys to use from the encodings (not all models require token_type_ids)
- class kazu.steps.ner.hf_token_classification.TransformersModelForTokenClassificationNerStep[source]¶
Bases:
Step
A wrapper for
transformers.AutoModelForTokenClassification
.This implementation uses a sliding window concept to process large documents that don’t fit into the maximum sequence length allowed by a model. Resulting token labels are then post processed by
TokenizedWordProcessor
.- __call__(docs)[source]¶
Process documents and respond with processed and failed documents.
Note that many steps will be decorated by
document_iterating_step()
ordocument_batch_step()
which will modify the ‘original’__call__
function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.
- __init__(path, batch_size, stride, max_sequence_length, tokenized_word_processor, keys_to_use, entity_splitter=None, device='cpu')[source]¶
- Parameters:
path (str) – path to HF model, config and tokenizer. Passed to HF .from_pretrained()
batch_size (int) – batch size for dataloader
stride (int) – passed to HF tokenizers (for splitting long docs)
max_sequence_length (int) – passed to HF tokenizers (for splitting long docs)
tokenized_word_processor (TokenizedWordProcessor)
keys_to_use (Iterable[str]) – keys to use from the encodings. Note that this varies depending on the flaour of bert model (e.g. distilbert requires token_type_ids)
entity_splitter (NonContiguousEntitySplitter | None) – to detect non-contiguous entities if provided
device (str) – device to run the model on. Defaults to “cpu”
- frame_to_tok_word(batch_encoding, number_of_frames, frame_index, section_frame_index, predictions)[source]¶
Depending on the number of frames generated by a string of text, and whether it is the first or last frame, we need to return different subsets of the frame offsets and frame word_ids.
- Parameters:
batch_encoding (BatchEncoding)
number_of_frames (int) – number of frames created by the tokenizer for the string
frame_index (int) – the index of the query frame, relative to the total number of frames
section_frame_index (int) – the index of the section frame, relative to the whole BatchEncoding
predictions (Tensor)
- Returns:
Tuple of 2 lists: frame offsets and frame word ids
- Return type:
- get_dataloader(docs)[source]¶
Get a dataloader from a List of
kazu.data.Document
. Collation is handled viatransformers.DataCollatorWithPadding
.
- static get_list_of_batch_encoding_frames_for_section(batch_encoding, section_index)[source]¶
For a given dataloader with a HFDataset, return a list of frame indexes associated with a given section index.
- Parameters:
batch_encoding (BatchEncoding)
section_index (int)
- Returns:
- Return type:
- get_multilabel_activations(loader)[source]¶
Get a tensor consisting of confidences for labels in a multi label classification context. Output tensor is of shape (n_samples, max_sequence_length, n_labels).
- Parameters:
loader (DataLoader)
- Returns:
- Return type:
- get_single_label_activations(loader)[source]¶
Get a tensor consisting of one hot binary classifications in a single label classification context. Output tensor is of shape (n_samples, max_sequence_length, n_labels).
- Parameters:
loader (DataLoader)
- Returns:
- Return type:
- section_frames_to_tokenised_words(section_index, batch_encoding, predictions)[source]¶
- Parameters:
section_index (int)
batch_encoding (BatchEncoding)
predictions (Tensor)
- Returns:
- Return type: