kazu.steps.ner.gliner¶
Classes
Wrapper for GLiNER models and library. |
|
GliNERBatchItem(doc: kazu.data.Document, section: kazu.data.Section, start_span: kazu.data.CharSpan, end_span: kazu.data.CharSpan, sentence: str) |
|
- class kazu.steps.ner.gliner.GLiNERStep[source]¶
Bases:
Step
Wrapper for GLiNER models and library. Requires
kazu.data.Section
to have sentence spans set on it, as sentences are processed in batches by GLiNER. This is to avoid the ‘windowing’ problem, whereby a multi-token entity could be split across two windows, leading to ambiguity over the entity class and spans. Since entities cannot theoretically cross sentences, batching sentences eliminates this problem.If multiple classes are detected for the same string, it will be resolved via the supplied
ConflictScorer
classAttention
To use this step, you will need gliner installed, which is not installed as part of the default kazu install because this step isn’t used as part of the default pipeline. You can either do:
$ pip install gliner
Or you can install required dependencies for all steps included in kazu with:
$ pip install kazu[all-steps]
Paper:
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer.Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry CharnoisBibtex Citation Details
@misc{zaratiana2023gliner, title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois}, year={2023}, eprint={2311.08526}, archivePrefix={arXiv}, primaryClass={cs.CL} }
- __call__(docs)[source]¶
Process documents and respond with processed and failed documents.
Note that many steps will be decorated by
document_iterating_step()
ordocument_batch_step()
which will modify the ‘original’__call__
function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.
- __init__(pretrained_model_name_or_path, gliner_class_prompt_to_entity_class, threshold=0.3, batch_size=2, device=None, local_files_only=True, conflict_scorer=<class 'kazu.steps.ner.gliner.MajorityVoteScorer'>, max_context_size=None, iterations=5)[source]¶
- Parameters:
pretrained_model_name_or_path (str) – Passed to
GLiNER.from_pretrained
. Note that this could attempt to download a model from the HuggingFace Hub (see docs for ModelHubMixin ).gliner_class_prompt_to_entity_class (dict[str, str]) – Since GLiNER needs entity class prompts, these might not map exactly to our global NER classes. Therefore, this dictionary provides this mapping.
threshold (float) – passed to
GLiNER.predict_entities
.batch_size (int) – The number of sentences to process in a single batch. This is to avoid memory issues.
device (str | None) – passed to
GLiNER.to
.local_files_only (bool) – passed to
GLiNER.from_pretrained
.conflict_scorer (type[ConflictScorer]) – The method to use to resolve conflicts between entity classes. Defaults to
MajorityVoteScorer
.max_context_size (int | None) – The maximum number of tokens to process in a single batch. This is to avoid memory issues. If None, the default is the model’s max_len - 10 (for special tokens).
iterations (int) – The number of times to shuffle the entity class prompts. This is to avoid any bias in the model.
- Return type:
None
- class kazu.steps.ner.gliner.GliNERBatchItem[source]¶
Bases:
object
GliNERBatchItem(doc: kazu.data.Document, section: kazu.data.Section, start_span: kazu.data.CharSpan, end_span: kazu.data.CharSpan, sentence: str)
- class kazu.steps.ner.gliner.MajorityVoteScorer[source]¶
Bases:
ConflictScorer
- class kazu.steps.ner.gliner.MaxScoreScorer[source]¶
Bases:
ConflictScorer