kazu.ontology_preprocessing.curation_utils¶
Functions
|
|
|
Dump an iterable of |
|
Load an instance of |
Load |
Classes
A OntologyResourceProcessor is responsible for modifying the set of |
|
Describes the state of |
|
OntologyResourceSetConflictReport(clean_resources: set[kazu.data.OntologyStringResource], merged_resources: set[kazu.data.OntologyStringResource], normalisation_conflicts: set[frozenset[kazu.data.OntologyStringResource]], case_conflicts: set[frozenset[kazu.data.OntologyStringResource]]) |
|
OntologyResourceSetMergeReport(obsolete_resources: set[kazu.data.OntologyStringResource], effective_resources: set[kazu.data.OntologyStringResource], superfluous_resources: set[kazu.data.OntologyStringResource], resources_with_discrepancies: set[tuple[kazu.data.OntologyStringResource, kazu.data.OntologyStringResource]]) |
|
Find and potentially fix conflicting behaviour in a set of |
Exceptions
- class kazu.ontology_preprocessing.curation_utils.AutofixStrategy[source]¶
Bases:
AutoNameEnum
- NONE = 'NONE'¶
- OPTIMISTIC = 'OPTIMISTIC'¶
- PESSIMISTIC = 'PESSIMISTIC'¶
- class kazu.ontology_preprocessing.curation_utils.LinkingCandidateModificationResult[source]¶
Bases:
AutoNameEnum
- ID_SET_MODIFIED = 'ID_SET_MODIFIED'¶
- LINKING_CANDIDATE_ADDED = 'LINKING_CANDIDATE_ADDED'¶
- LINKING_CANDIDATE_DROPPED = 'LINKING_CANDIDATE_DROPPED'¶
- NO_ACTION = 'NO_ACTION'¶
- class kazu.ontology_preprocessing.curation_utils.OntologyResourceProcessor[source]¶
Bases:
object
A OntologyResourceProcessor is responsible for modifying the set of
LinkingCandidate
s produced by ankazu.ontology_preprocessing.base.OntologyParser
with any relevantGlobalParserActions
and/orOntologyStringResource
associated with the parser.This class should be used before instances of
LinkingCandidate
s are loaded into the internal database representation.- __init__(parser_name, entity_class, global_actions, resources, linking_candidates)[source]¶
- Parameters:
parser_name (str) – name of parser to process
entity_class (str) – name of parser entity_class to process (typically as passed to
kazu.ontology_preprocessing.base.OntologyParser
)global_actions (GlobalParserActions | None)
resources (list[OntologyStringResource])
linking_candidates (set[LinkingCandidate])
- export_resources_and_final_candidates()[source]¶
Perform any updates required to the linking candidates as specified in the curations/global actions.
The returned
OntologyStringResource
s can be used for Dictionary based NER, whereas the returnedLinkingCandidate
s can be loaded into the internal database for linking.- Returns:
- Return type:
- classmethod resource_sort_key(resource)[source]¶
Determines the order resources are processed in.
We use associated_id_sets as a key, so that any overrides will be processed after any original behaviours.
- Parameters:
resource (OntologyStringResource)
- Return type:
- BEHAVIOUR_APPLICATION_ORDER = (OntologyStringBehaviour.ADD_FOR_NER_AND_LINKING, OntologyStringBehaviour.ADD_FOR_LINKING_ONLY, OntologyStringBehaviour.DROP_FOR_LINKING)¶
- class kazu.ontology_preprocessing.curation_utils.OntologyResourceSetCompleteReport[source]¶
Bases:
object
Describes the state of
kazu.data.OntologyStringResource
s configured for a givenkazu.ontology_preprocessing.base.OntologyParser
, such as any conflicts in how string matching is configured, or discrepancies between resources produced bykazu.ontology_preprocessing.autocuration.AutoCurator
and their human curated overrrides.- __init__(intermediate_linking_candidates, final_conflict_report, human_conflict_report=None, merge_report=None)[source]¶
- Parameters:
intermediate_linking_candidates (set[LinkingCandidate])
final_conflict_report (OntologyResourceSetConflictReport)
human_conflict_report (OntologyResourceSetConflictReport | None)
merge_report (OntologyResourceSetMergeReport | None)
- Return type:
None
- final_conflict_report: OntologyResourceSetConflictReport¶
report of resource conflict resolution after merging of autogenerated resources and human curations
- human_conflict_report: OntologyResourceSetConflictReport | None = None¶
report of resource conflict in human curation set, if available
- intermediate_linking_candidates: set[LinkingCandidate]¶
parser linking candidates before they are processed with
OntologyStringResource
\s
- merge_report: OntologyResourceSetMergeReport | None = None¶
report of result of merging human and autogenerated resources, if available
- class kazu.ontology_preprocessing.curation_utils.OntologyResourceSetConflictReport[source]¶
Bases:
object
OntologyResourceSetConflictReport(clean_resources: set[kazu.data.OntologyStringResource], merged_resources: set[kazu.data.OntologyStringResource], normalisation_conflicts: set[frozenset[kazu.data.OntologyStringResource]], case_conflicts: set[frozenset[kazu.data.OntologyStringResource]])
- __init__(clean_resources, merged_resources, normalisation_conflicts, case_conflicts)[source]¶
- Parameters:
clean_resources (set[OntologyStringResource])
merged_resources (set[OntologyStringResource])
normalisation_conflicts (set[frozenset[OntologyStringResource]])
case_conflicts (set[frozenset[OntologyStringResource]])
- Return type:
None
- case_conflicts: set[frozenset[OntologyStringResource]]¶
Resources that conflict on case
- clean_resources: set[OntologyStringResource]¶
Resources with no conflicts
- merged_resources: set[OntologyStringResource]¶
Resources that can be safely merged without affecting
OntologyStringBehaviour
. However, may still conflict onMentionConfidence
and/or case sensitivity
- normalisation_conflicts: set[frozenset[OntologyStringResource]]¶
Resources that conflict on normalisation value
- class kazu.ontology_preprocessing.curation_utils.OntologyResourceSetMergeReport[source]¶
Bases:
object
OntologyResourceSetMergeReport(obsolete_resources: set[kazu.data.OntologyStringResource], effective_resources: set[kazu.data.OntologyStringResource], superfluous_resources: set[kazu.data.OntologyStringResource], resources_with_discrepancies: set[tuple[kazu.data.OntologyStringResource, kazu.data.OntologyStringResource]])
- __init__(obsolete_resources, effective_resources, superfluous_resources, resources_with_discrepancies)[source]¶
- Parameters:
obsolete_resources (set[OntologyStringResource])
effective_resources (set[OntologyStringResource])
superfluous_resources (set[OntologyStringResource])
resources_with_discrepancies (set[tuple[OntologyStringResource, OntologyStringResource]])
- Return type:
None
- effective_resources: set[OntologyStringResource]¶
human and autogenerated resources that are actively in use
- obsolete_resources: set[OntologyStringResource]¶
human resources that no longer match any strings in the underlying parser data
- resources_with_discrepancies: set[tuple[OntologyStringResource, OntologyStringResource]]¶
a tuple of human resource/autogenerated resource that only partially match on their original forms, suggesting that the underlying parser data has changed in some way since the last upgrade. In this scenario, it’s recommended to replace the human form with the new autogenerated version for consistency
- superfluous_resources: set[OntologyStringResource]¶
human resources that match any strings in the underlying parser data, but result in the same behaviour as the autogenerated resource, and therefore can be eliminated to reduce the management burden of human resources
- class kazu.ontology_preprocessing.curation_utils.OntologyStringConflictAnalyser[source]¶
Bases:
object
Find and potentially fix conflicting behaviour in a set of
kazu.data.OntologyStringResource
s.- __init__(entity_class, autofix=AutofixStrategy.NONE)[source]¶
- Parameters:
entity_class (str) – entity class that this analyzer will handle
autofix (AutofixStrategy) – Should any conflicts be automatically fixed, such that the behaviour is consistent within this set? Note that this does not guarantee that the optimal behaviour for a conflict is preserved.
- autofix_resources(resource_conflicts)[source]¶
Fix conflicts in resources by producing a new set of resources with consistent behaviour.
This ensures that there are no conflicts regarding any of these properties:
the combination of case sensitivity and mention confidence
associated id set
resource behaviour
- Parameters:
resource_conflicts (set[frozenset[OntologyStringResource]])
- Returns:
- Return type:
- static build_synonym_defaultdict(resources)[source]¶
- Parameters:
resources (Iterable[OntologyStringResource])
- Return type:
- static check_for_case_conflicts_across_resources(resources, strict=False)[source]¶
Find conflicts in case sensitivity within a set of resources.
Conflicts can occur when strings differ by case sensitivity, and a case-insensitive synonym will produce a
MentionConfidence
of equal or higher rank than a case-sensitive one.- Parameters:
resources (set[OntologyStringResource])
strict (bool) – if True, then the function will return True if there are multiple mention confidences for a given string, regardless of case sensitivity
- Returns:
a set of conflicted subsets, and a set of clean resources.
- Return type:
tuple[set[frozenset[OntologyStringResource]], set[OntologyStringResource]]
- check_for_normalised_behaviour_conflicts_and_merge_if_possible(resources)[source]¶
Find behaviour conflicts in the resource set indexed by syn_norm.
If possible, resources will be merged. If not, they will be added to a set of conflicting resources.
- Parameters:
resources (set[OntologyStringResource])
- Returns:
A set of newly created merged resources, a set of resources eliminated through the merge, and a set of resources subsets that conflict and cannot be merged,
- Return type:
tuple[set[OntologyStringResource], set[OntologyStringResource], set[frozenset[OntologyStringResource]]]
- static find_case_conflicts(maybe_good_resources_by_active_syn_lower, strict=False)[source]¶
- Parameters:
maybe_good_resources_by_active_syn_lower (defaultdict[str, set[OntologyStringResource]])
strict (bool)
- Return type:
tuple[set[frozenset[OntologyStringResource]], set[OntologyStringResource]]
- merge_human_and_auto_resources(human_curated_resources, autocurated_resources)[source]¶
Merge a set of human curated resources with a set of automatically curated resources, preferring the human set where possible.
Note that the output is not guaranteed to be conflict free - consider calling
verify_resource_set_integrity()
.- Parameters:
human_curated_resources (set[OntologyStringResource])
autocurated_resources (set[OntologyStringResource])
- Returns:
- Return type:
- verify_resource_set_integrity(resources)[source]¶
Verify that a set of resources has consistent behaviour.
Conflicts can occur for the following reasons:
If two or more resources normalise to the same string, but have different
kazu.data.OntologyStringBehaviour
.If two or more resources normalise to the same value, but have different associated ID sets specified, such that one would override the other.
If two or more resources have conflicting values for case sensitivity and
kazu.data.MentionConfidence
. E.g. A case-insensitive resource cannot have a higher mention confidence value than a case-sensitive one for the same string.
- Parameters:
resources (set[OntologyStringResource])
- Returns:
- Raises:
CurationError – if one or more resources produce multiple normalised values
- Return type:
- kazu.ontology_preprocessing.curation_utils.batch(iterable, n=1)[source]¶
- Parameters:
iterable (Iterable[OntologyStringResource])
n (int)
- Return type:
- kazu.ontology_preprocessing.curation_utils.dump_ontology_string_resources(resources, path, force=False, split_at=10000)[source]¶
Dump an iterable of
kazu.data.OntologyStringResource
s to the file system.- Parameters:
resources (Iterable[OntologyStringResource]) – resources to dump
path (str | Path) – path to a directory of json lines files that map to
kazu.data.OntologyStringResource
force (bool) – override existing directory, if it exists
split_at (int) – number of lines per partition
- Returns:
- Return type:
None
- kazu.ontology_preprocessing.curation_utils.load_global_actions(path)[source]¶
Load an instance of
GlobalParserActions
from a file path.
- kazu.ontology_preprocessing.curation_utils.load_ontology_string_resources(path)[source]¶
Load
kazu.data.OntologyStringResource
s from a file path or directory.- Parameters:
path (str | Path) – path to a jsonl file or directory of jsonl files that map to
kazu.data.OntologyStringResource
- Returns:
- Return type: