kazu.steps.linking.rules_based_disambiguation

Classes

MatcherResult

RulesBasedEntityClassDisambiguationFilterStep

Removes instances of Entity from Sections that don't meet rules based disambiguation requirements in at least one location in the document.

class kazu.steps.linking.rules_based_disambiguation.MatcherResult[source]

Bases: AutoNameEnum

__new__(value)[source]
HIT = 'HIT'
MISS = 'MISS'
NOT_CONFIGURED = 'NOT_CONFIGURED'
class kazu.steps.linking.rules_based_disambiguation.RulesBasedEntityClassDisambiguationFilterStep[source]

Bases: Step

Removes instances of Entity from Sections that don’t meet rules based disambiguation requirements in at least one location in the document.

This step utilises spaCy Matcher rules to determine whether an entity class and or/mention entities are valid or not. These Matcher rules operate on the sentence in which each mention under consideration is located.

Rules can have both true positive and false positive aspects. If defined, that aspect MUST be correct at least once in the document for all entities with the same key (composed of the matched string and entity class) to be valid.

Non-contiguous entities are evaluated on the full span of the text they cover, rather than the specific tokens.

__call__(doc)[source]

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:
Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(class_matcher_rules, mention_matcher_rules)[source]
Parameters:
  • class_matcher_rules (dict[str, dict[Literal['tp', 'fp'], list[list[dict[str, ~typing.Any]]] | None]]) –

    these should follow the format:

    {
        "<entity class>": {
            "<tp or fp (for true positive or false positive rules respectively>": [
                "<a list of rules>",
                "<according to the spaCy pattern matcher syntax>",
            ]
        }
    }
    

  • mention_matcher_rules (dict[str, dict[str, dict[Literal['tp', 'fp'], list[list[dict[str, ~typing.Any]]] | None]]]) –

    these should follow the format:

    {
        "<entity class>": {
            "<mention to disambiguate>": {
                "<tp or fp>": [
                    "<a list of rules>",
                    "<according to the spaCy pattern matcher syntax>",
                ]
            }
        }
    }