kazu.ontology_preprocessing.curation_utils

Functions

batch(iterable[, n])

dump_ontology_string_resources(resources, path)

Dump an iterable of kazu.data.OntologyStringResources to the file system.

load_global_actions(path)

Load an instance of GlobalParserActions from a file path.

load_ontology_string_resources(path)

Load kazu.data.OntologyStringResources from a file path or directory.

Classes

AutofixStrategy

LinkingCandidateModificationResult

OntologyResourceProcessor

A OntologyResourceProcessor is responsible for modifying the set of LinkingCandidates produced by an kazu.ontology_preprocessing.base.OntologyParser with any relevant GlobalParserActions and/or OntologyStringResource associated with the parser.

OntologyResourceSetCompleteReport

Describes the state of kazu.data.OntologyStringResources configured for a given kazu.ontology_preprocessing.base.OntologyParser, such as any conflicts in how string matching is configured, or discrepancies between resources produced by kazu.ontology_preprocessing.autocuration.AutoCurator and their human curated overrrides.

OntologyResourceSetConflictReport

OntologyResourceSetConflictReport(clean_resources: set[kazu.data.OntologyStringResource], merged_resources: set[kazu.data.OntologyStringResource], normalisation_conflicts: set[frozenset[kazu.data.OntologyStringResource]], case_conflicts: set[frozenset[kazu.data.OntologyStringResource]])

OntologyResourceSetMergeReport

OntologyResourceSetMergeReport(obsolete_resources: set[kazu.data.OntologyStringResource], effective_resources: set[kazu.data.OntologyStringResource], superfluous_resources: set[kazu.data.OntologyStringResource], resources_with_discrepancies: set[tuple[kazu.data.OntologyStringResource, kazu.data.OntologyStringResource]])

OntologyStringConflictAnalyser

Find and potentially fix conflicting behaviour in a set of kazu.data.OntologyStringResources.

Exceptions

exception kazu.ontology_preprocessing.curation_utils.CurationError[source]

Bases: Exception

class kazu.ontology_preprocessing.curation_utils.AutofixStrategy[source]

Bases: AutoNameEnum

__new__(value)[source]
NONE = 'NONE'
OPTIMISTIC = 'OPTIMISTIC'
PESSIMISTIC = 'PESSIMISTIC'
class kazu.ontology_preprocessing.curation_utils.LinkingCandidateModificationResult[source]

Bases: AutoNameEnum

__new__(value)[source]
ID_SET_MODIFIED = 'ID_SET_MODIFIED'
LINKING_CANDIDATE_ADDED = 'LINKING_CANDIDATE_ADDED'
LINKING_CANDIDATE_DROPPED = 'LINKING_CANDIDATE_DROPPED'
NO_ACTION = 'NO_ACTION'
class kazu.ontology_preprocessing.curation_utils.OntologyResourceProcessor[source]

Bases: object

A OntologyResourceProcessor is responsible for modifying the set of LinkingCandidates produced by an kazu.ontology_preprocessing.base.OntologyParser with any relevant GlobalParserActions and/or OntologyStringResource associated with the parser.

This class should be used before instances of LinkingCandidates are loaded into the internal database representation.

__init__(parser_name, entity_class, global_actions, resources, linking_candidates)[source]
Parameters:
export_resources_and_final_candidates()[source]

Perform any updates required to the linking candidates as specified in the curations/global actions.

The returned OntologyStringResources can be used for Dictionary based NER, whereas the returned LinkingCandidates can be loaded into the internal database for linking.

Returns:

Return type:

tuple[list[OntologyStringResource], set[LinkingCandidate]]

classmethod resource_sort_key(resource)[source]

Determines the order resources are processed in.

We use associated_id_sets as a key, so that any overrides will be processed after any original behaviours.

Parameters:

resource (OntologyStringResource)

Return type:

tuple[int, bool]

BEHAVIOUR_APPLICATION_ORDER = (OntologyStringBehaviour.ADD_FOR_NER_AND_LINKING, OntologyStringBehaviour.ADD_FOR_LINKING_ONLY, OntologyStringBehaviour.DROP_FOR_LINKING)
class kazu.ontology_preprocessing.curation_utils.OntologyResourceSetCompleteReport[source]

Bases: object

Describes the state of kazu.data.OntologyStringResources configured for a given kazu.ontology_preprocessing.base.OntologyParser, such as any conflicts in how string matching is configured, or discrepancies between resources produced by kazu.ontology_preprocessing.autocuration.AutoCurator and their human curated overrrides.

__init__(intermediate_linking_candidates, final_conflict_report, human_conflict_report=None, merge_report=None)[source]
Parameters:
Return type:

None

write_reports_for_parser(path, parser_name)[source]
Parameters:
Return type:

None

final_conflict_report: OntologyResourceSetConflictReport

report of resource conflict resolution after merging of autogenerated resources and human curations

human_conflict_report: OntologyResourceSetConflictReport | None = None

report of resource conflict in human curation set, if available

intermediate_linking_candidates: set[LinkingCandidate]

parser linking candidates before they are processed with OntologyStringResource\s

merge_report: OntologyResourceSetMergeReport | None = None

report of result of merging human and autogenerated resources, if available

class kazu.ontology_preprocessing.curation_utils.OntologyResourceSetConflictReport[source]

Bases: object

OntologyResourceSetConflictReport(clean_resources: set[kazu.data.OntologyStringResource], merged_resources: set[kazu.data.OntologyStringResource], normalisation_conflicts: set[frozenset[kazu.data.OntologyStringResource]], case_conflicts: set[frozenset[kazu.data.OntologyStringResource]])

__init__(clean_resources, merged_resources, normalisation_conflicts, case_conflicts)[source]
Parameters:
Return type:

None

write_normalisation_conflict_report(path)[source]
Parameters:

path (str | Path)

Return type:

None

case_conflicts: set[frozenset[OntologyStringResource]]

Resources that conflict on case

clean_resources: set[OntologyStringResource]

Resources with no conflicts

merged_resources: set[OntologyStringResource]

Resources that can be safely merged without affecting OntologyStringBehaviour. However, may still conflict on MentionConfidence and/or case sensitivity

normalisation_conflicts: set[frozenset[OntologyStringResource]]

Resources that conflict on normalisation value

class kazu.ontology_preprocessing.curation_utils.OntologyResourceSetMergeReport[source]

Bases: object

OntologyResourceSetMergeReport(obsolete_resources: set[kazu.data.OntologyStringResource], effective_resources: set[kazu.data.OntologyStringResource], superfluous_resources: set[kazu.data.OntologyStringResource], resources_with_discrepancies: set[tuple[kazu.data.OntologyStringResource, kazu.data.OntologyStringResource]])

__init__(obsolete_resources, effective_resources, superfluous_resources, resources_with_discrepancies)[source]
Parameters:
Return type:

None

write_ontology_merge_report(path)[source]
Parameters:

path (str | Path)

Return type:

None

effective_resources: set[OntologyStringResource]

human and autogenerated resources that are actively in use

obsolete_resources: set[OntologyStringResource]

human resources that no longer match any strings in the underlying parser data

resources_with_discrepancies: set[tuple[OntologyStringResource, OntologyStringResource]]

a tuple of human resource/autogenerated resource that only partially match on their original forms, suggesting that the underlying parser data has changed in some way since the last upgrade. In this scenario, it’s recommended to replace the human form with the new autogenerated version for consistency

superfluous_resources: set[OntologyStringResource]

human resources that match any strings in the underlying parser data, but result in the same behaviour as the autogenerated resource, and therefore can be eliminated to reduce the management burden of human resources

class kazu.ontology_preprocessing.curation_utils.OntologyStringConflictAnalyser[source]

Bases: object

Find and potentially fix conflicting behaviour in a set of kazu.data.OntologyStringResources.

__init__(entity_class, autofix=AutofixStrategy.NONE)[source]
Parameters:
  • entity_class (str) – entity class that this analyzer will handle

  • autofix (AutofixStrategy) – Should any conflicts be automatically fixed, such that the behaviour is consistent within this set? Note that this does not guarantee that the optimal behaviour for a conflict is preserved.

autofix_resources(resource_conflicts)[source]

Fix conflicts in resources by producing a new set of resources with consistent behaviour.

This ensures that there are no conflicts regarding any of these properties:

  • the combination of case sensitivity and mention confidence

  • associated id set

  • resource behaviour

Parameters:

resource_conflicts (set[frozenset[OntologyStringResource]])

Returns:

Return type:

set[OntologyStringResource]

static build_synonym_defaultdict(resources)[source]
Parameters:

resources (Iterable[OntologyStringResource])

Return type:

defaultdict[str, set[OntologyStringResource]]

static check_for_case_conflicts_across_resources(resources, strict=False)[source]

Find conflicts in case sensitivity within a set of resources.

Conflicts can occur when strings differ by case sensitivity, and a case-insensitive synonym will produce a MentionConfidence of equal or higher rank than a case-sensitive one.

Parameters:
  • resources (set[OntologyStringResource])

  • strict (bool) – if True, then the function will return True if there are multiple mention confidences for a given string, regardless of case sensitivity

Returns:

a set of conflicted subsets, and a set of clean resources.

Return type:

tuple[set[frozenset[OntologyStringResource]], set[OntologyStringResource]]

check_for_normalised_behaviour_conflicts_and_merge_if_possible(resources)[source]

Find behaviour conflicts in the resource set indexed by syn_norm.

If possible, resources will be merged. If not, they will be added to a set of conflicting resources.

Parameters:

resources (set[OntologyStringResource])

Returns:

A set of newly created merged resources, a set of resources eliminated through the merge, and a set of resources subsets that conflict and cannot be merged,

Return type:

tuple[set[OntologyStringResource], set[OntologyStringResource], set[frozenset[OntologyStringResource]]]

static find_case_conflicts(maybe_good_resources_by_active_syn_lower, strict=False)[source]
Parameters:
Return type:

tuple[set[frozenset[OntologyStringResource]], set[OntologyStringResource]]

merge_human_and_auto_resources(human_curated_resources, autocurated_resources)[source]

Merge a set of human curated resources with a set of automatically curated resources, preferring the human set where possible.

Note that the output is not guaranteed to be conflict free - consider calling verify_resource_set_integrity().

Parameters:
Returns:

Return type:

OntologyResourceSetMergeReport

verify_resource_set_integrity(resources)[source]

Verify that a set of resources has consistent behaviour.

Conflicts can occur for the following reasons:

  1. If two or more resources normalise to the same string, but have different kazu.data.OntologyStringBehaviour.

  2. If two or more resources normalise to the same value, but have different associated ID sets specified, such that one would override the other.

  3. If two or more resources have conflicting values for case sensitivity and kazu.data.MentionConfidence. E.g. A case-insensitive resource cannot have a higher mention confidence value than a case-sensitive one for the same string.

Parameters:

resources (set[OntologyStringResource])

Returns:

Raises:

CurationError – if one or more resources produce multiple normalised values

Return type:

OntologyResourceSetConflictReport

kazu.ontology_preprocessing.curation_utils.batch(iterable, n=1)[source]
Parameters:
Return type:

Iterable[list[OntologyStringResource]]

kazu.ontology_preprocessing.curation_utils.dump_ontology_string_resources(resources, path, force=False, split_at=10000)[source]

Dump an iterable of kazu.data.OntologyStringResources to the file system.

Parameters:
Returns:

Return type:

None

kazu.ontology_preprocessing.curation_utils.load_global_actions(path)[source]

Load an instance of GlobalParserActions from a file path.

Parameters:

path (str | Path) – path to a json serialised GlobalParserActions

Returns:

Return type:

GlobalParserActions

kazu.ontology_preprocessing.curation_utils.load_ontology_string_resources(path)[source]

Load kazu.data.OntologyStringResources from a file path or directory.

Parameters:

path (str | Path) – path to a jsonl file or directory of jsonl files that map to kazu.data.OntologyStringResource

Returns:

Return type:

set[OntologyStringResource]