kazu.data¶
This module contains the core aspects of the Kazu Data Model.
See the page linked above for a quick introduction to the key concepts.
Module Attributes
Represents a json-encodable object. |
|
A frozen set of |
|
This type is used whenever we have |
|
A cattrs Converter configured for converting Kazu's datamodel into json. |
Classes
Subclass to create an Enum where values are the names when using |
|
A concept similar to a spaCy Span, except is character index based rather than token based. |
|
A container that is the primary input into a |
|
A |
|
A representation of a set of kb ID's that map to the same synonym and mean the same thing. |
|
Container for all |
|
A LinkingCandidate is a container for a single normalised synonym, and is produced by an |
|
Metrics for Entity Linking. |
|
A mapping is a fully mapped and disambiguated kb concept. |
|
A OntologyStringResource represents the behaviour of a specific |
|
A ParserAction changes the behaviour of a |
|
A container for text and entities. |
|
Synonym(text: str, case_sensitive: bool, mention_confidence: kazu.data.MentionConfidence) |
Exceptions
- class kazu.data.AutoNameEnum[source]¶
Bases:
Enum
Subclass to create an Enum where values are the names when using
enum.auto
.Taken from the Python Enum Docs.
This is licensed under Zero-Clause BSD.
Full License
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
- class kazu.data.CharSpan[source]¶
Bases:
object
A concept similar to a spaCy Span, except is character index based rather than token based.
Example:
text[char_span.start:char_span.end]
will precisely cover the target. This means thattext[char_span.end]
will give the first character not in the span.
- class kazu.data.DisambiguationConfidence[source]¶
Bases:
AutoNameEnum
- AMBIGUOUS = 'AMBIGUOUS'¶
- HIGHLY_LIKELY = 'HIGHLY_LIKELY'¶
- POSSIBLE = 'POSSIBLE'¶
- PROBABLE = 'PROBABLE'¶
- class kazu.data.Document[source]¶
Bases:
object
A container that is the primary input into a
kazu.pipeline.Pipeline
.- classmethod create_simple_document(text)[source]¶
Create an instance of
Document
from a text string.
- to_json(**kwargs)[source]¶
Convert to json string.
- Parameters:
kwargs (Any) – passed through to
json.dumps()
.- Returns:
- Return type:
- metadata: dict¶
- generic metadataNote that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
- class kazu.data.Entity[source]¶
Bases:
object
A
kazu.data.Entity
is a container for information about a single entity detected within akazu.data.Section
.Within an
kazu.data.Entity
, the most important fields areEntity.match
(the actual string detected),Entity.linking_candidates
, (candidates for knowledgebase hits) andEntity.mappings
, the final product of linked references to the underlying entity.- __init__(match, entity_class, spans, namespace, mention_confidence=MentionConfidence.HIGHLY_LIKELY, _id=<factory>, mappings=<factory>, metadata=<factory>, linking_candidates=<factory>)[source]¶
- Parameters:
match (str)
entity_class (str)
namespace (str)
mention_confidence (MentionConfidence)
_id (str)
metadata (dict)
linking_candidates (dict[LinkingCandidate, LinkingMetrics])
- Return type:
None
- add_or_update_linking_candidate(candidate, new_metrics)[source]¶
- Parameters:
candidate (LinkingCandidate)
new_metrics (LinkingMetrics)
- Return type:
None
- add_or_update_linking_candidates(candidates)[source]¶
- Parameters:
candidates (dict[LinkingCandidate, LinkingMetrics])
- Return type:
None
- classmethod from_spans(spans, text, join_str='', **kwargs)[source]¶
Create an instance of Entity from a list of character indices. A text string of underlying doc is also required to produce a representative match.
- is_completely_overlapped(other)[source]¶
True if all CharSpan instances are completely encompassed by all other CharSpan instances.
- is_partially_overlapped(other)[source]¶
True if only one CharSpan instance is defined in both self and other, and they are partially overlapped.
If multiple CharSpan are defined in both self and other, this becomes pathological, as while they may overlap in the technical sense, they may have distinct semantic meaning. For instance, consider the case where we may want to use is_partially_overlapped to select the longest annotation span suggested by some NER system.
case 1: text: the patient has metastatic liver cancers entity1: metastatic liver cancer -> [CharSpan(16,39] entity2: liver cancers -> [CharSpan(27,40]
result: is_partially_overlapped -> True (entities are part of same concept)
case 2: non-contiguous entities
text: lung and liver cancer lung cancer -> [CharSpan(0,4), CharSpan(15, 21)] liver cancer -> [CharSpan(9,21)]
result: is_partially_overlapped -> False (entities are distinct)
- linking_candidates: dict[LinkingCandidate, LinkingMetrics]¶
- mention_confidence: MentionConfidence = 100¶
- metadata: dict¶
- generic metadataNote that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
- class kazu.data.EquivalentIdAggregationStrategy[source]¶
Bases:
AutoNameEnum
- CUSTOM = 'CUSTOM'¶
- MERGED_AS_NON_SYMBOLIC = 'MERGED_AS_NON_SYMBOLIC'¶
- MODIFIED_BY_CURATION = 'MODIFIED_BY_CURATION'¶
- NO_STRATEGY = 'NO_STRATEGY'¶
- RESOLVED_BY_SIMILARITY = 'RESOLVED_BY_SIMILARITY'¶
- RESOLVED_BY_XREF = 'RESOLVED_BY_XREF'¶
- SYNONYM_IS_AMBIGUOUS = 'SYNONYM_IS_AMBIGUOUS'¶
- UNAMBIGUOUS = 'UNAMBIGUOUS'¶
- class kazu.data.EquivalentIdSet[source]¶
Bases:
object
A representation of a set of kb ID’s that map to the same synonym and mean the same thing.
- class kazu.data.GlobalParserActions[source]¶
Bases:
object
Container for all
ParserAction
s.- __init__(actions)[source]¶
- Parameters:
actions (list[ParserAction])
- Return type:
None
- parser_behaviour(parser_name)[source]¶
Generator that yields behaviours for a specific parser, based on the order they are specified in.
- Parameters:
parser_name (str)
- Returns:
- Return type:
- actions: list[ParserAction]¶
- class kazu.data.LinkingCandidate[source]¶
Bases:
object
A LinkingCandidate is a container for a single normalised synonym, and is produced by an
OntologyParser
implementation.It may be composed of multiple synonyms that normalise to the same unique string (e.g. “breast cancer” and “Breast Cancer”). The number of
associated_id_sets
that this synonym maps to is determined by thescore_and_group_ids()
method of the associated OntologyParser.- __init__(raw_synonyms, synonym_norm, parser_name, is_symbolic, associated_id_sets, aggregated_by, mapping_types=<factory>)[source]¶
- Parameters:
synonym_norm (str)
parser_name (str)
is_symbolic (bool)
associated_id_sets (frozenset[EquivalentIdSet])
aggregated_by (EquivalentIdAggregationStrategy)
- Return type:
None
- aggregated_by: EquivalentIdAggregationStrategy¶
aggregation strategy, determined by the ontology parser
- associated_id_sets: frozenset[EquivalentIdSet]¶
- class kazu.data.LinkingMetrics[source]¶
Bases:
object
Metrics for Entity Linking.
- LinkingMetrics holds data on various quality metrics, for how well a
LinkingCandidate
maps to a host
Entity
.
- LinkingMetrics holds data on various quality metrics, for how well a
- class kazu.data.Mapping[source]¶
Bases:
object
A mapping is a fully mapped and disambiguated kb concept.
- __init__(default_label, source, parser_name, idx, string_match_strategy, string_match_confidence, disambiguation_confidence=None, disambiguation_strategy=None, xref_source_parser_name=None, metadata=<factory>)[source]¶
- Parameters:
default_label (str)
source (str)
parser_name (str)
idx (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_confidence (DisambiguationConfidence | None)
disambiguation_strategy (str | None)
xref_source_parser_name (str | None)
metadata (dict)
- Return type:
None
- disambiguation_confidence: DisambiguationConfidence | None = None¶
- metadata: dict¶
- generic metadataNote that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
- string_match_confidence: StringMatchConfidence¶
- class kazu.data.MentionConfidence[source]¶
Bases:
IntEnum
- HIGHLY_LIKELY = 100¶
- IGNORE = 0¶
- POSSIBLE = 10¶
- PROBABLE = 50¶
- class kazu.data.OntologyStringBehaviour[source]¶
Bases:
AutoNameEnum
- ADD_FOR_LINKING_ONLY = 'ADD_FOR_LINKING_ONLY'¶
use the resource only as a linking target. Note, this is not required if the resource is already in the underlying ontology, as all ontology resources are included as linking targets by default (Also see DROP_FOR_LINKING)
- ADD_FOR_NER_AND_LINKING = 'ADD_FOR_NER_AND_LINKING'¶
use the resource for both dictionary based NER and as a linking target.
- DROP_FOR_LINKING = 'DROP_FOR_LINKING'¶
do not use this resource as a linking target. Normally, you would use this for a resource you want to remove from the underlying ontology (e.g. a ‘bad’ synonym). If the resource does not exist, has no effect
- class kazu.data.OntologyStringResource[source]¶
Bases:
object
A OntologyStringResource represents the behaviour of a specific
LinkingCandidate
within an Ontology.For each LinkingCandidate, a default OntologyStringResource is produced with its behaviour determined by an instance of
kazu.ontology_preprocessing.autocuration.AutoCurator
and thekazu.ontology_preprocessing.curation_utils.OntologyStringConflictAnalyser
.Note
This is typically handled by the internals of
kazu.ontology_preprocessing.base.OntologyParser
. However, OntologyStringResources can also be used to override the default behaviour of a parser. See The OntologyParser for a more detailed guide.The configuration of a OntologyStringResource will affect both NER and Linking aspects of Kazu:
Example 1:
The string ‘ALL’ is highly ambiguous. It might mean several diseases, or simply ‘all’. Therefore, we want to add a curation as follows, so that it will only be used as a linking target and not for dictionary based NER:
OntologyStringResource( original_synonyms=frozenset( [ Synonym( text="ALL", mention_confidence=MentionConfidence.POSSIBLE, case_sensitive=True, ) ] ), behaviour=OntologyStringBehaviour.ADD_FOR_LINKING_ONLY, )
Example 2:
The string ‘LH’ is incorrectly identified as a synonym of the PLOD1 (ENSG00000083444) gene, whereas more often than not, it’s actually an abbreviation of Lutenising Hormone. We therefore want to override the associated_id_sets to LHB (ENSG00000104826, or Lutenising Hormone Subunit Beta)
The OntologyStringResource we therefore want is:
OntologyStringResource( original_synonyms=frozenset( [ Synonym( text="LH", mention_confidence=MentionConfidence.POSSIBLE, case_sensitive=True, ) ] ), associated_id_sets=frozenset((EquivalentIdSet(("ENSG00000104826", "ENSEMBL")),)), behaviour=OntologyStringBehaviour.ADD_FOR_LINKING_ONLY, )
Example 3:
A
LinkingCandidate
has an alternative synonym not referenced in the underlying ontology, and we want to add it.OntologyStringResource( original_synonyms=frozenset( [ Synonym( text="breast carcinoma", mention_confidence=MentionConfidence.POSSIBLE, case_sensitive=True, ) ] ), associated_id_sets=frozenset((EquivalentIdSet(("ENSG00000104826", "ENSEMBL")),)), behaviour=OntologyStringBehaviour.ADD_FOR_NER_AND_LINKING, )
- __init__(original_synonyms, behaviour, alternative_synonyms=<factory>, associated_id_sets=None, _id=<factory>, autocuration_results=None, comment=None)[source]¶
- property additional_to_source: bool¶
True if this resource created in addition to the source resources defined in the original Ontology.
- alternative_synonyms: frozenset[Synonym]¶
Alternative synonyms generated from the originals by
kazu.ontology_preprocessing.synonym_generation.CombinatorialSynonymGenerator
.
- associated_id_sets: frozenset[EquivalentIdSet] | None = None¶
If specified, will override the parser defaults for the associated
LinkingCandidate
, as long as conflicts do not occur
- autocuration_results: dict[str, str] | None = None¶
results of any decisions by the
kazu.ontology_preprocessing.autocuration.AutoCurator
- behaviour: OntologyStringBehaviour¶
The intended behaviour for this resource.
- class kazu.data.ParserAction[source]¶
Bases:
object
A ParserAction changes the behaviour of a
kazu.ontology_preprocessing.base.OntologyParser
in a global sense.A ParserAction overrides any default behaviour of the parser, and also any conflicts that may occur with
OntologyStringResource
s.These actions are useful for eliminating unwanted behaviour. For example, the root of the Mondo ontology is http://purl.obolibrary.org/obo/HP_0000001, which has a default label of ‘All’. Since this is such a common word, and not very useful in terms of linking, we might want a global action so that this ID is not used anywhere in a Kazu pipeline.
The parser_to_target_id_mappings field should specify the parser name and an affected IDs if required. See
ParserBehaviour
for the type of actions that are possible.- __init__(behaviour, parser_to_target_id_mappings=<factory>)[source]¶
- Parameters:
behaviour (ParserBehaviour)
- Return type:
None
- behaviour: ParserBehaviour¶
- class kazu.data.ParserBehaviour[source]¶
Bases:
AutoNameEnum
- DROP_IDS_FROM_PARSER = 'DROP_IDS_FROM_PARSER'¶
completely remove the ids from the parser - i.e. should never used anywhere
- class kazu.data.Section[source]¶
Bases:
object
A container for text and entities.
One or more make up a
kazu.data.Document
.- metadata: dict¶
- generic metadataNote that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
- class kazu.data.StringMatchConfidence[source]¶
Bases:
AutoNameEnum
- HIGHLY_LIKELY = 'HIGHLY_LIKELY'¶
- POSSIBLE = 'POSSIBLE'¶
- PROBABLE = 'PROBABLE'¶
- class kazu.data.Synonym[source]¶
Bases:
object
Synonym(text: str, case_sensitive: bool, mention_confidence: kazu.data.MentionConfidence)
- __init__(text, case_sensitive, mention_confidence)[source]¶
- Parameters:
text (str)
case_sensitive (bool)
mention_confidence (MentionConfidence)
- Return type:
None
- mention_confidence: MentionConfidence¶
- kazu.data.AssociatedIdSets¶
A frozen set of
EquivalentIdSet
alias of
frozenset
[EquivalentIdSet
]
- kazu.data.CandidatesToMetrics¶
This type is used whenever we have
LinkingCandidate
s and some metrics for how well they map to a specificEntity
.In particular,
linking_candidates
holds relevant candidates and their metrics, and this type is used in parts of kazu which produce, modify or use this field.alias of
dict
[LinkingCandidate
,LinkingMetrics
]
- kazu.data.JsonEncodable¶
Represents a json-encodable object.
Note that because
dict
is invariant, there can be issues with using types likedict[str, str]
(see further here).alias of
dict
[str
, JsonEncodable] |list
[JsonEncodable] |bool
|int
|float
|str
|None
- kazu.data.kazu_json_converter = <cattrs.preconf.json.JsonConverter object>¶
A cattrs Converter configured for converting Kazu’s datamodel into json.
If you are not familiar with
cattrs
, don’t worry: you can just use methods on the kazu classes likeDocument.from_dict()
andDocument.from_dict()
, and you will likely never need to use or understandkazu_json_converter
.If you are familiar with cattrs, you may prefer to use the
structure
,unstructure
,dumps
andloads
methods ofkazu_json_converter
directly.