kazu.data

This module contains the core aspects of the Kazu Data Model.

See the page linked above for a quick introduction to the key concepts.

Module Attributes

JsonEncodable

Represents a json-encodable object.

AssociatedIdSets

A frozen set of EquivalentIdSet

CandidatesToMetrics

This type is used whenever we have LinkingCandidates and some metrics for how well they map to a specific Entity.

kazu_json_converter

A cattrs Converter configured for converting Kazu's datamodel into json.

Classes

AutoNameEnum

Subclass to create an Enum where values are the names when using enum.auto.

CharSpan

A concept similar to a spaCy Span, except is character index based rather than token based.

DisambiguationConfidence

Document

A container that is the primary input into a kazu.pipeline.Pipeline.

Entity

A kazu.data.Entity is a container for information about a single entity detected within a kazu.data.Section.

EquivalentIdAggregationStrategy

EquivalentIdSet

A representation of a set of kb ID's that map to the same synonym and mean the same thing.

GlobalParserActions

Container for all ParserActions.

LinkingCandidate

A LinkingCandidate is a container for a single normalised synonym, and is produced by an OntologyParser implementation.

LinkingMetrics

Metrics for Entity Linking.

Mapping

A mapping is a fully mapped and disambiguated kb concept.

MentionConfidence

OntologyStringBehaviour

OntologyStringResource

A OntologyStringResource represents the behaviour of a specific LinkingCandidate within an Ontology.

ParserAction

A ParserAction changes the behaviour of a kazu.ontology_preprocessing.base.OntologyParser in a global sense.

ParserBehaviour

Section

A container for text and entities.

StringMatchConfidence

Synonym

Synonym(text: str, case_sensitive: bool, mention_confidence: kazu.data.MentionConfidence)

Exceptions

exception kazu.data.KazuConfigurationError[source]

Bases: Exception

class kazu.data.AutoNameEnum[source]

Bases: Enum

Subclass to create an Enum where values are the names when using enum.auto.

Taken from the Python Enum Docs.

This is licensed under Zero-Clause BSD.

Full License

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

class kazu.data.CharSpan[source]

Bases: object

A concept similar to a spaCy Span, except is character index based rather than token based.

Example: text[char_span.start:char_span.end] will precisely cover the target. This means that text[char_span.end] will give the first character not in the span.

__init__(start, end)[source]
Parameters:
Return type:

None

is_completely_overlapped(other)[source]

True if other completely overlaps this span.

Parameters:

other (CharSpan)

Returns:

Return type:

bool

is_partially_overlapped(other)[source]

True if other partially overlaps this span.

Parameters:

other (CharSpan)

Returns:

Return type:

bool

end: int
start: int
class kazu.data.DisambiguationConfidence[source]

Bases: AutoNameEnum

__new__(value)[source]
AMBIGUOUS = 'AMBIGUOUS'
HIGHLY_LIKELY = 'HIGHLY_LIKELY'
POSSIBLE = 'POSSIBLE'
PROBABLE = 'PROBABLE'
class kazu.data.Document[source]

Bases: object

A container that is the primary input into a kazu.pipeline.Pipeline.

__init__(idx=<factory>, sections=<factory>, metadata=<factory>)[source]
Parameters:
Return type:

None

classmethod create_simple_document(text)[source]

Create an instance of Document from a text string.

Parameters:

text (str)

Returns:

Return type:

Document

static from_dict(document_dict)[source]
Parameters:

document_dict (dict)

Return type:

Document

static from_json(json_str)[source]
Parameters:

json_str (str)

Return type:

Document

classmethod from_named_section_texts(named_sections)[source]
Parameters:

named_sections (dict[str, str])

Return type:

Document

get_entities()[source]

Get all entities in this document.

Return type:

list[Entity]

classmethod simple_document_from_sents(sents)[source]
Parameters:

sents (list[str])

Return type:

Document

to_dict()[source]

Convert the Document to a dict.

Return type:

dict

to_json(**kwargs)[source]

Convert to json string.

Parameters:

kwargs (Any) – passed through to json.dumps().

Returns:

Return type:

str

idx: str

a document identifier

metadata: dict
generic metadata

Note that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
sections: list[Section]

sections comprising this document

class kazu.data.Entity[source]

Bases: object

A kazu.data.Entity is a container for information about a single entity detected within a kazu.data.Section.

Within an kazu.data.Entity, the most important fields are Entity.match (the actual string detected), Entity.linking_candidates, (candidates for knowledgebase hits) and Entity.mappings, the final product of linked references to the underlying entity.

__init__(match, entity_class, spans, namespace, mention_confidence=MentionConfidence.HIGHLY_LIKELY, _id=<factory>, mappings=<factory>, metadata=<factory>, linking_candidates=<factory>)[source]
Parameters:
Return type:

None

add_mapping(mapping)[source]

Deprecated.

Parameters:

mapping (Mapping)

Returns:

Return type:

None

add_or_update_linking_candidate(candidate, new_metrics)[source]
Parameters:
Return type:

None

add_or_update_linking_candidates(candidates)[source]
Parameters:

candidates (dict[LinkingCandidate, LinkingMetrics])

Return type:

None

as_brat()[source]
Returns:

this entity in the third party biomedical nlp Brat format (see the docs, paper, and codebase)

Return type:

str

calc_starts_and_ends()[source]
Return type:

tuple[int, int]

static from_dict(entity_dict)[source]
Parameters:

entity_dict (dict)

Return type:

Entity

classmethod from_spans(spans, text, join_str='', **kwargs)[source]

Create an instance of Entity from a list of character indices. A text string of underlying doc is also required to produce a representative match.

Parameters:
Returns:

Return type:

Entity

is_completely_overlapped(other)[source]

True if all CharSpan instances are completely encompassed by all other CharSpan instances.

Parameters:

other (Entity)

Returns:

Return type:

bool

is_partially_overlapped(other)[source]

True if only one CharSpan instance is defined in both self and other, and they are partially overlapped.

If multiple CharSpan are defined in both self and other, this becomes pathological, as while they may overlap in the technical sense, they may have distinct semantic meaning. For instance, consider the case where we may want to use is_partially_overlapped to select the longest annotation span suggested by some NER system.

case 1: text: the patient has metastatic liver cancers entity1: metastatic liver cancer -> [CharSpan(16,39] entity2: liver cancers -> [CharSpan(27,40]

result: is_partially_overlapped -> True (entities are part of same concept)

case 2: non-contiguous entities

text: lung and liver cancer lung cancer -> [CharSpan(0,4), CharSpan(15, 21)] liver cancer -> [CharSpan(9,21)]

result: is_partially_overlapped -> False (entities are distinct)

Parameters:

other (Entity)

Returns:

Return type:

bool

classmethod load_contiguous_entity(start, end, **kwargs)[source]
Parameters:
Return type:

Entity

end: int
entity_class: str
linking_candidates: dict[LinkingCandidate, LinkingMetrics]
mappings: set[Mapping]
match: str

exact text representation

match_norm: str
mention_confidence: MentionConfidence = 100
metadata: dict
generic metadata

Note that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
namespace: str

namespace of the Step that produced this instance

spans: frozenset[CharSpan]
start: int
class kazu.data.EquivalentIdAggregationStrategy[source]

Bases: AutoNameEnum

__new__(value)[source]
CUSTOM = 'CUSTOM'
MERGED_AS_NON_SYMBOLIC = 'MERGED_AS_NON_SYMBOLIC'
MODIFIED_BY_CURATION = 'MODIFIED_BY_CURATION'
NO_STRATEGY = 'NO_STRATEGY'
RESOLVED_BY_SIMILARITY = 'RESOLVED_BY_SIMILARITY'
RESOLVED_BY_XREF = 'RESOLVED_BY_XREF'
SYNONYM_IS_AMBIGUOUS = 'SYNONYM_IS_AMBIGUOUS'
UNAMBIGUOUS = 'UNAMBIGUOUS'
class kazu.data.EquivalentIdSet[source]

Bases: object

A representation of a set of kb ID’s that map to the same synonym and mean the same thing.

__init__(ids_and_source=<factory>)[source]
Parameters:

ids_and_source (frozenset[tuple[str, str]])

Return type:

None

property ids: set[str]
ids_and_source: frozenset[tuple[str, str]]
property sources: set[str]
class kazu.data.GlobalParserActions[source]

Bases: object

Container for all ParserActions.

__init__(actions)[source]
Parameters:

actions (list[ParserAction])

Return type:

None

classmethod from_dict(json_dict)[source]
Parameters:

json_dict (dict)

Return type:

GlobalParserActions

parser_behaviour(parser_name)[source]

Generator that yields behaviours for a specific parser, based on the order they are specified in.

Parameters:

parser_name (str)

Returns:

Return type:

Iterable[ParserAction]

actions: list[ParserAction]
class kazu.data.LinkingCandidate[source]

Bases: object

A LinkingCandidate is a container for a single normalised synonym, and is produced by an OntologyParser implementation.

It may be composed of multiple synonyms that normalise to the same unique string (e.g. “breast cancer” and “Breast Cancer”). The number of associated_id_sets that this synonym maps to is determined by the score_and_group_ids() method of the associated OntologyParser.

__init__(raw_synonyms, synonym_norm, parser_name, is_symbolic, associated_id_sets, aggregated_by, mapping_types=<factory>)[source]
Parameters:
Return type:

None

static from_dict(candidate_dict)[source]
Parameters:

candidate_dict (dict)

Return type:

LinkingCandidate

aggregated_by: EquivalentIdAggregationStrategy

aggregation strategy, determined by the ontology parser

associated_id_sets: frozenset[EquivalentIdSet]
property is_ambiguous: bool
is_symbolic: bool

is the candidate symbolic? Determined by the OntologyParser

mapping_types: frozenset[str]

mapping type metadata

parser_name: str

ontology parser name

raw_synonyms: frozenset[str]

unnormalised synonym strings

synonym_norm: str

normalised form

class kazu.data.LinkingMetrics[source]

Bases: object

Metrics for Entity Linking.

LinkingMetrics holds data on various quality metrics, for how well a LinkingCandidate

maps to a host Entity.

__init__(search_score=None, embed_score=None, bool_score=None, exact_match=None)[source]
Parameters:
  • search_score (float | None)

  • embed_score (float | None)

  • bool_score (bool | None)

  • exact_match (bool | None)

Return type:

None

static from_dict(metric_dict)[source]
Parameters:

metric_dict (dict)

Return type:

LinkingMetrics

bool_score: bool | None = None
embed_score: float | None = None
exact_match: bool | None = None
search_score: float | None = None
class kazu.data.Mapping[source]

Bases: object

A mapping is a fully mapped and disambiguated kb concept.

__init__(default_label, source, parser_name, idx, string_match_strategy, string_match_confidence, disambiguation_confidence=None, disambiguation_strategy=None, xref_source_parser_name=None, metadata=<factory>)[source]
Parameters:
Return type:

None

static from_dict(mapping_dict)[source]
Parameters:

mapping_dict (dict)

Return type:

Mapping

default_label: str

default label from knowledgebase

disambiguation_confidence: DisambiguationConfidence | None = None
disambiguation_strategy: str | None = None
idx: str

the identifier within the KB

metadata: dict
generic metadata

Note that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
parser_name: str

the origin of this mapping

source: str

the knowledgebase/database/ontology name

string_match_confidence: StringMatchConfidence
string_match_strategy: str
xref_source_parser_name: str | None = None

source parser name if mapping is an XREF

class kazu.data.MentionConfidence[source]

Bases: IntEnum

__new__(value)[source]
HIGHLY_LIKELY = 100
IGNORE = 0
POSSIBLE = 10
PROBABLE = 50
class kazu.data.OntologyStringBehaviour[source]

Bases: AutoNameEnum

__new__(value)[source]
ADD_FOR_LINKING_ONLY = 'ADD_FOR_LINKING_ONLY'

use the resource only as a linking target. Note, this is not required if the resource is already in the underlying ontology, as all ontology resources are included as linking targets by default (Also see DROP_FOR_LINKING)

ADD_FOR_NER_AND_LINKING = 'ADD_FOR_NER_AND_LINKING'

use the resource for both dictionary based NER and as a linking target.

DROP_FOR_LINKING = 'DROP_FOR_LINKING'

do not use this resource as a linking target. Normally, you would use this for a resource you want to remove from the underlying ontology (e.g. a ‘bad’ synonym). If the resource does not exist, has no effect

class kazu.data.OntologyStringResource[source]

Bases: object

A OntologyStringResource represents the behaviour of a specific LinkingCandidate within an Ontology.

For each LinkingCandidate, a default OntologyStringResource is produced with its behaviour determined by an instance of kazu.ontology_preprocessing.autocuration.AutoCurator and the kazu.ontology_preprocessing.curation_utils.OntologyStringConflictAnalyser.

Note

This is typically handled by the internals of kazu.ontology_preprocessing.base.OntologyParser. However, OntologyStringResources can also be used to override the default behaviour of a parser. See The OntologyParser for a more detailed guide.

The configuration of a OntologyStringResource will affect both NER and Linking aspects of Kazu:

Example 1:

The string ‘ALL’ is highly ambiguous. It might mean several diseases, or simply ‘all’. Therefore, we want to add a curation as follows, so that it will only be used as a linking target and not for dictionary based NER:

OntologyStringResource(
    original_synonyms=frozenset(
        [
            Synonym(
                text="ALL",
                mention_confidence=MentionConfidence.POSSIBLE,
                case_sensitive=True,
            )
        ]
    ),
    behaviour=OntologyStringBehaviour.ADD_FOR_LINKING_ONLY,
)

Example 2:

The string ‘LH’ is incorrectly identified as a synonym of the PLOD1 (ENSG00000083444) gene, whereas more often than not, it’s actually an abbreviation of Lutenising Hormone. We therefore want to override the associated_id_sets to LHB (ENSG00000104826, or Lutenising Hormone Subunit Beta)

The OntologyStringResource we therefore want is:

OntologyStringResource(
    original_synonyms=frozenset(
        [
            Synonym(
                text="LH",
                mention_confidence=MentionConfidence.POSSIBLE,
                case_sensitive=True,
            )
        ]
    ),
    associated_id_sets=frozenset((EquivalentIdSet(("ENSG00000104826", "ENSEMBL")),)),
    behaviour=OntologyStringBehaviour.ADD_FOR_LINKING_ONLY,
)

Example 3:

A LinkingCandidate has an alternative synonym not referenced in the underlying ontology, and we want to add it.

OntologyStringResource(
    original_synonyms=frozenset(
        [
            Synonym(
                text="breast carcinoma",
                mention_confidence=MentionConfidence.POSSIBLE,
                case_sensitive=True,
            )
        ]
    ),
    associated_id_sets=frozenset((EquivalentIdSet(("ENSG00000104826", "ENSEMBL")),)),
    behaviour=OntologyStringBehaviour.ADD_FOR_NER_AND_LINKING,
)
__init__(original_synonyms, behaviour, alternative_synonyms=<factory>, associated_id_sets=None, _id=<factory>, autocuration_results=None, comment=None)[source]
Parameters:
Return type:

None

active_ner_synonyms()[source]
Return type:

Iterable[Synonym]

all_strings()[source]
Return type:

Iterable[str]

all_synonyms()[source]
Return type:

Iterable[Synonym]

classmethod from_dict(json_dict)[source]
Parameters:

json_dict (dict)

Return type:

OntologyStringResource

static from_json(json_str)[source]
Parameters:

json_str (str)

Return type:

OntologyStringResource

syn_norm_for_linking(entity_class)[source]
Parameters:

entity_class (str)

Return type:

str

to_dict(preserve_structured_object_id=True)[source]
Parameters:

preserve_structured_object_id (bool)

Return type:

dict[str, Any]

to_json()[source]
Return type:

str

property additional_to_source: bool

True if this resource created in addition to the source resources defined in the original Ontology.

alternative_synonyms: frozenset[Synonym]

Alternative synonyms generated from the originals by kazu.ontology_preprocessing.synonym_generation.CombinatorialSynonymGenerator.

associated_id_sets: frozenset[EquivalentIdSet] | None = None

If specified, will override the parser defaults for the associated LinkingCandidate, as long as conflicts do not occur

autocuration_results: dict[str, str] | None = None

results of any decisions by the kazu.ontology_preprocessing.autocuration.AutoCurator

behaviour: OntologyStringBehaviour

The intended behaviour for this resource.

comment: str | None = None

human readable comments about this curation decision

original_synonyms: frozenset[Synonym]

Original synonyms, exactly as specified in the source ontology. These should all normalise to the same string.

class kazu.data.ParserAction[source]

Bases: object

A ParserAction changes the behaviour of a kazu.ontology_preprocessing.base.OntologyParser in a global sense.

A ParserAction overrides any default behaviour of the parser, and also any conflicts that may occur with OntologyStringResources.

These actions are useful for eliminating unwanted behaviour. For example, the root of the Mondo ontology is http://purl.obolibrary.org/obo/HP_0000001, which has a default label of ‘All’. Since this is such a common word, and not very useful in terms of linking, we might want a global action so that this ID is not used anywhere in a Kazu pipeline.

The parser_to_target_id_mappings field should specify the parser name and an affected IDs if required. See ParserBehaviour for the type of actions that are possible.

__init__(behaviour, parser_to_target_id_mappings=<factory>)[source]
Parameters:
Return type:

None

classmethod from_dict(json_dict)[source]
Parameters:

json_dict (dict)

Return type:

ParserAction

behaviour: ParserBehaviour
parser_to_target_id_mappings: dict[str, set[str]]
class kazu.data.ParserBehaviour[source]

Bases: AutoNameEnum

__new__(value)[source]
DROP_IDS_FROM_PARSER = 'DROP_IDS_FROM_PARSER'

completely remove the ids from the parser - i.e. should never used anywhere

class kazu.data.Section[source]

Bases: object

A container for text and entities.

One or more make up a kazu.data.Document.

__init__(text, name, metadata=<factory>, entities=<factory>)[source]
Parameters:
Return type:

None

static from_dict(section_dict)[source]
Parameters:

section_dict (dict)

Return type:

Section

entities: list[Entity]

entities detected in this section

metadata: dict
generic metadata

Note that storing objects here that Kazu can’t convert to and from json will cause problems for (de)serialization. See (De)serialization and generic metadata fields for details.
name: str

the name of the section (e.g. abstract, body, header, footer etc)

property sentence_spans: Iterable[CharSpan]
text: str

the text to be processed

class kazu.data.StringMatchConfidence[source]

Bases: AutoNameEnum

__new__(value)[source]
HIGHLY_LIKELY = 'HIGHLY_LIKELY'
POSSIBLE = 'POSSIBLE'
PROBABLE = 'PROBABLE'
class kazu.data.Synonym[source]

Bases: object

Synonym(text: str, case_sensitive: bool, mention_confidence: kazu.data.MentionConfidence)

__init__(text, case_sensitive, mention_confidence)[source]
Parameters:
Return type:

None

case_sensitive: bool
mention_confidence: MentionConfidence
text: str
kazu.data.AssociatedIdSets

A frozen set of EquivalentIdSet

alias of frozenset[EquivalentIdSet]

kazu.data.CandidatesToMetrics

This type is used whenever we have LinkingCandidates and some metrics for how well they map to a specific Entity.

In particular, linking_candidates holds relevant candidates and their metrics, and this type is used in parts of kazu which produce, modify or use this field.

alias of dict[LinkingCandidate, LinkingMetrics]

kazu.data.JsonEncodable

Represents a json-encodable object.

Note that because dict is invariant, there can be issues with using types like dict[str, str] (see further here).

alias of dict[str, JsonEncodable] | list[JsonEncodable] | bool | int | float | str | None

kazu.data.kazu_json_converter = <cattrs.preconf.json.JsonConverter object>

A cattrs Converter configured for converting Kazu’s datamodel into json.

If you are not familiar with cattrs, don’t worry: you can just use methods on the kazu classes like Document.from_dict() and Document.from_dict(), and you will likely never need to use or understand kazu_json_converter.

If you are familiar with cattrs, you may prefer to use the structure, unstructure, dumps and loads methods of kazu_json_converter directly.