kazu.steps.ner.gliner¶

Classes

`ConflictScorer`
`GLiNERStep`	Wrapper for GLiNER models and library.
`GliNERBatchItem`	GliNERBatchItem(doc: kazu.data.Document, section: kazu.data.Section, start_span: kazu.data.CharSpan, end_span: kazu.data.CharSpan, sentence: str)
`MajorityVoteScorer`
`MaxScoreScorer`

class kazu.steps.ner.gliner.ConflictScorer[source]¶

Bases: object

__init__()[source]¶

Return type:: None

finalise(namespace)[source]¶

Parameters:: namespace (str)
Return type:: None

update(doc, section, entity)[source]¶

Parameters:

doc (Document)
section (Section)
entity (Entity)

Return type:

None

class kazu.steps.ner.gliner.GLiNERStep[source]¶

Bases: Step

Wrapper for GLiNER models and library. Requires kazu.data.Section to have sentence spans set on it, as sentences are processed in batches by GLiNER. This is to avoid the ‘windowing’ problem, whereby a multi-token entity could be split across two windows, leading to ambiguity over the entity class and spans. Since entities cannot theoretically cross sentences, batching sentences eliminates this problem.

If multiple classes are detected for the same string, it will be resolved via the supplied ConflictScorer class

Attention

To use this step, you will need gliner installed, which is not installed as part of the default kazu install because this step isn’t used as part of the default pipeline. You can either do:

$ pip install gliner

Or you can install required dependencies for all steps included in kazu with:

$ pip install kazu[all-steps]

Paper:

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer.
Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois
https://arxiv.org/abs/2311.08526

Bibtex Citation Details

@misc{zaratiana2023gliner,
      title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
      author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
      year={2023},
      eprint={2311.08526},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

__call__(docs)[source]¶

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:

docs (list[Document])
self (Self)

Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(pretrained_model_name_or_path, gliner_class_prompt_to_entity_class, threshold=0.3, batch_size=2, device=None, local_files_only=True, conflict_scorer=<class 'kazu.steps.ner.gliner.MajorityVoteScorer'>, max_context_size=None, iterations=5)[source]¶

Parameters:

pretrained_model_name_or_path (str) – Passed to GLiNER.from_pretrained. Note that this could attempt to download a model from the HuggingFace Hub (see docs for ModelHubMixin ).
gliner_class_prompt_to_entity_class (dict[str, str]) – Since GLiNER needs entity class prompts, these might not map exactly to our global NER classes. Therefore, this dictionary provides this mapping.
threshold (float) – passed to GLiNER.predict_entities.
batch_size (int) – The number of sentences to process in a single batch. This is to avoid memory issues.
device (str | None) – passed to GLiNER.to.
local_files_only (bool) – passed to GLiNER.from_pretrained.
conflict_scorer (type[ConflictScorer]) – The method to use to resolve conflicts between entity classes. Defaults to MajorityVoteScorer.
max_context_size (int | None) – The maximum number of tokens to process in a single batch. This is to avoid memory issues. If None, the default is the model’s max_len - 10 (for special tokens).
iterations (int) – The number of times to shuffle the entity class prompts. This is to avoid any bias in the model.

Return type:

None

class kazu.steps.ner.gliner.GliNERBatchItem[source]¶

Bases: object

GliNERBatchItem(doc: kazu.data.Document, section: kazu.data.Section, start_span: kazu.data.CharSpan, end_span: kazu.data.CharSpan, sentence: str)

__init__(doc, section, start_span, end_span, sentence)[source]¶

Parameters:

doc (Document)
section (Section)
start_span (CharSpan)
end_span (CharSpan)
sentence (str)

Return type:

None

doc: Document¶

end_span: CharSpan¶

section: Section¶

sentence: str¶

start_span: CharSpan¶

class kazu.steps.ner.gliner.MajorityVoteScorer[source]¶

Bases: ConflictScorer

__init__()[source]¶

Return type:: None

class kazu.steps.ner.gliner.MaxScoreScorer[source]¶

Bases: ConflictScorer

__init__()[source]¶

Return type:: None