kazu.steps.step

Module Attributes

Self

A TypeVar for the type of the class whose method is decorated with document_iterating_step() or document_batch_step().

Functions

document_batch_step(batch_doc_callable)

Add error handling to a method that processes batches of Documents.

document_iterating_step(per_doc_callable)

Handle a list of Documents and add error handling.

Classes

ParserDependentStep

A step that depends on ontology parsers in any form.

Step

class kazu.steps.step.ParserDependentStep[source]

Bases: Step

A step that depends on ontology parsers in any form.

Steps that need information from parsers should subclass this class, in order for the internal databases to be correctly populated. Generally, these will be steps that have anything to do with Entity Linking.

__init__(parsers)[source]
Parameters:

parsers (Iterable[OntologyParser]) – parsers that this step requires

class kazu.steps.step.Self

A TypeVar for the type of the class whose method is decorated with document_iterating_step() or document_batch_step().

alias of TypeVar(‘Self’)

class kazu.steps.step.Step[source]

Bases: Protocol

__call__(docs)[source]

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:

docs (list[Document])

Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(*args, **kwargs)[source]
classmethod namespace()[source]

Metadata to name/describe the step, used in various places.

Defaults to cls.__name__.

Return type:

str

kazu.steps.step.document_batch_step(batch_doc_callable)[source]

Add error handling to a method that processes batches of Documents.

Use this to decorate a method that processes a batch of Documents at a time. The resulting method will wrap a call to the decorated function with error handling which will add exceptions to the PROCESSING_EXCEPTION metadata of documents. Failed documents will be returned as the second element of the return value, as expected by Step.__call__().

Generally speaking, it will save effort and repetition to decorate a Step with either document_iterating_step() or document_batch_step(), rather than implementing the error handling in the Step itself.

Normally, document_iterating_step() would be used in preference to document_batch_step(), unless the method involves computation which is more efficient when run in a batch, such as inference with a transformer-based Machine Learning model, or using spacy’s pipe method.

Note that this will only work for a method of a class, rather than a standalone function, as it expects to have to pass through ‘self’ as a parameter.

Parameters:

batch_doc_callable (Callable[[Self, list[Document]], Any]) – A function that processes a batch of documents, that you want to use as the __call__ method of a Step. This must do its work by mutating the input documents: the return value is ignored.

Returns:

Return type:

Callable[[Self, list[Document]], tuple[list[Document], list[Document]]]

kazu.steps.step.document_iterating_step(per_doc_callable)[source]

Handle a list of Documents and add error handling.

Use this to decorate a method that processes a single Document. The resulting method will then iterate over a list of Documents, calling the decorated function for each Document. Errors are handled automatically and added to the PROCESSING_EXCEPTION metadata of documents, with failed docs returned as the second element of the return value, as expected by Step.__call__().

Generally speaking, it will save effort and repetition to decorate a Step with either document_iterating_step() or document_batch_step(), rather than implementing the error handling in the Step itself.

Normally, document_iterating_step() would be used in preference to document_batch_step(), unless the method involves computation which is more efficient when run in a batch, such as inference with a transformer-based Machine Learning model, or using spaCy’s pipe method.

Note that this will only work for a method of a class, rather than a standalone function, as it expects to have to pass through ‘self’ as a parameter.

Parameters:

per_doc_callable (Callable[[Self, Document], Any]) – A function that processes a single document, that you want to use as the __call__ method of a Step. This must do its work by mutating the input document: the return value is ignored.

Returns:

Return type:

Callable[[Self, list[Document]], tuple[list[Document], list[Document]]]