kazu.steps.joint_ner_and_linking.memory_efficient_string_matching¶

Classes

MemoryEfficientStringMatchingStep

A wrapper for the ahocorasick algorithm.

class kazu.steps.joint_ner_and_linking.memory_efficient_string_matching.MemoryEfficientStringMatchingStep[source]¶

Bases: ParserDependentStep

A wrapper for the ahocorasick algorithm.

In testing, this implementation is comparable in speed to a spaCy PhraseMatcher, and uses a fraction of the memory. Since this implementation is unaware of NLP concepts such as tokenization, we backfill this capability by checking for word boundaries with a custom spaCy tokenizer.

__call__(doc)[source]¶

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:

docs
self (Self)
doc (Document)

Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(parsers)[source]¶

Parameters:: parsers (Iterable[OntologyParser]) – parsers that this step requires