kazu.steps.other.cleanup

Classes

class kazu.steps.other.cleanup.CleanupAction[source]

Bases: Protocol

__init__(*args, **kwargs)[source]
cleanup(doc)[source]
Parameters:

doc (Document)

Return type:

None

class kazu.steps.other.cleanup.CleanupStep[source]

Bases: Step

__call__(doc)[source]

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:
Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(cleanup_actions)[source]
Parameters:

cleanup_actions (list[CleanupAction])

class kazu.steps.other.cleanup.DropByMinLenFilter[source]

Bases: object

__call__(entity)[source]

Call self as a function.

Parameters:

entity (Entity)

Return type:

bool

__init__(min_len)[source]
Parameters:

min_len (int)

class kazu.steps.other.cleanup.DropEntityIfClassNotMatchedFilter[source]

Bases: object

__call__(entity)[source]

Call self as a function.

Parameters:

entity (Entity)

Return type:

bool

__init__(required_classes)[source]
Parameters:

required_classes (Iterable[str])

class kazu.steps.other.cleanup.DropEntityIfMatchInSetFilter[source]

Bases: object

__call__(entity)[source]

Call self as a function.

Parameters:

entity (Entity)

Return type:

bool

__init__(drop_dict)[source]
Parameters:

drop_dict (dict[str, Iterable[str]])

class kazu.steps.other.cleanup.DropMappingsByConfidenceMappingFilter[source]

Bases: object

__call__(mapping)[source]

Call self as a function.

Parameters:

mapping (Mapping)

Return type:

bool

__init__(string_match_ranks_to_drop, disambiguation_ranks_to_drop)[source]
Parameters:
class kazu.steps.other.cleanup.DropMappingsByParserNameRankAction[source]

Bases: CleanupAction

Removes instances of Mapping based upon some preferential order of parsers.

Useful if you want to filter results based upon some predefined hierarchy of importance, for entity classes mapping to multiple parsers. For instance, you may prefer Meddra entities over Mondo ones, but will accept Mondo ones if Meddra mappings aren’t available.

Caution

To ensure this class is configured correctly, ensure that all the parsers you intend to use with it have populated the metadata database first. See populate_databases().

__init__(entity_class_to_parser_name_rank)[source]
Parameters:

entity_class_to_parser_name_rank (dict[str, list[str]]) – For a given entity class, only retain the mappings from the first parser that an entity has mappings for, based on list ordering (first is preferred).

cleanup(doc)[source]
Parameters:

doc (Document)

Return type:

None

class kazu.steps.other.cleanup.DropUnmappedEntityFilter[source]

Bases: object

__call__(ent)[source]

Call self as a function.

Parameters:

ent (Entity)

Return type:

bool

__init__(from_ent_namespaces=None, min_confidence_level=MentionConfidence.PROBABLE)[source]
Parameters:
class kazu.steps.other.cleanup.EntityFilterCleanupAction[source]

Bases: object

__init__(filter_fns)[source]
Parameters:

filter_fns (list[Callable[[Entity], bool]])

cleanup(doc)[source]
Parameters:

doc (Document)

Return type:

None

class kazu.steps.other.cleanup.LinkingCandidateRemovalCleanupAction[source]

Bases: object

__init__()[source]
Return type:

None

cleanup(doc)[source]
Parameters:

doc (Document)

Return type:

None

class kazu.steps.other.cleanup.MappingFilterCleanupAction[source]

Bases: object

__init__(filter_fns)[source]
Parameters:

filter_fns (list[Callable[[Mapping], bool]])

cleanup(doc)[source]
Parameters:

doc (Document)

Return type:

None

class kazu.steps.other.cleanup.StripMappingURIsAction[source]

Bases: object

Strip the IDs in kazu.data.Mapping to just the final part of the URI.

For example, this will turn http://purl.obolibrary.org/obo/MONDO_0004979 into just MONDO_004979.

If you don’t want URI stripping at all, don’t use this Action as part of the CleanupStep/in the pipeline.

__init__(parsers_to_strip=None)[source]
Parameters:

parsers_to_strip (Iterable[str] | None) – if you only want to strip URIs for some parsers and not others, provide the parsers to strip here. Otherwise, all parsers will have their IDs stripped. This prevents having to keep the full list of parsers in sync here.

cleanup(doc)[source]
Parameters:

doc (Document)

Return type:

None