kazu.ontology_preprocessing.base

Classes

OntologyParser

Parse an ontology (or similar) into a set of outputs suitable for NLP entity linking.

class kazu.ontology_preprocessing.base.OntologyParser[source]

Bases: ABC

Parse an ontology (or similar) into a set of outputs suitable for NLP entity linking.

Implementations should have a class attribute ‘name’ to something suitably representative. The key method is parse_to_dataframe(), which should convert an input source to a dataframe suitable for further processing.

The other important method is find_kb(). This should parse an ID string (if required) and return the underlying source. This is important for composite resources that contain identifiers from different seed sources.

See The OntologyParser for a more detailed guide.

Generally speaking, when parsing a data source, synonyms that are symbolic (as determined by the StringNormalizer) that refer to more than one id are more likely to be ambiguous. Therefore, we assume they refer to unique concepts (e.g. COX 1 could be ‘ENSG00000095303’ OR ‘ENSG00000198804’, and thus they will yield multiple instances of EquivalentIdSet. Non symbolic synonyms (i.e. noun phrases) are far less likely to refer to distinct entities, so we might want to merge the associated ID’s non-symbolic ambiguous synonyms into a single EquivalentIdSet. The result of StringNormalizer.classify_symbolic() forms the is_symbolic parameter to score_and_group_ids().

If the underlying knowledgebase contains more than one entity type, muliple parsers should be implemented, subsetting accordingly (e.g. MEDDRA_DISEASE, MEDDRA_DIAGNOSTIC).

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

clear_cache()[source]

Clears the disk cache for this parser.

Return type:

None

download_ontology()[source]

Download the ontology to the in_path.

Returns:

Path of downloaded ontology.

Raises:

RuntimeError if no downloader is configured.

Return type:

Path

abstract find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

abstract parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

populate_databases(force=False, return_resources=False)[source]

Populate the databases with the results of the parser.

Also calculates the synonym norms associated with any resources (if provided) which can then be used for Dictionary based NER

Parameters:
  • force (bool) – do not use the cache for the ontology parser

  • return_resources (bool) – should processed resources be returned?

Returns:

resources if required

Return type:

list[OntologyStringResource] | None

populate_metadata_db_and_resolve_string_resources()[source]

Loads the metadata DB and resolves any OntologyStringResources associated with this parser.

Return type:

tuple[dict[str, dict[str, bool | int | float | str]], OntologyResourceSetCompleteReport]

score_and_group_ids(ids_and_source, is_symbolic)[source]

For a given data source, one normalised synonym may map to one or more id. In some cases, the ID may be duplicate/redundant (e.g. there are many chembl ids for paracetamol). In other cases, the ID may refer to distinct concepts (e.g. COX 1 could be ‘ENSG00000095303’ OR ‘ENSG00000198804’).

Since synonyms from data sources are confused in such a manner, we need to decide some way to cluster them into a single LinkingCandidate concept, which in turn is a container for one or more EquivalentIdSet (depending on whether the concept is ambiguous or not)

The job of score_and_group_ids is to determine how many EquivalentIdSets for a given set of ids should be produced.

The default algorithm (which can be overridden by concrete parser implementations) works as follows:

  1. If no string_scorer is configured, create an EquivalentIdSet for each id (strategy NO_STRATEGY - not recommended)

  2. If only one ID is referenced, or the associated normalised synonym string is not symbolic, group the ids into a single EquivalentIdSet (strategy UNAMBIGUOUS)

  3. otherwise, compare the default label associated with each ID to every other default label. If it’s above self.synonym_merge_threshold, merge into one EquivalentIdSet, if not, create a new one.

recommendation: Use the SapbertStringSimilarityScorer for comparison.

Important

Any calls to this method requires the metadata DB to be populated, as this is the store of DEFAULT_LABEL.

Parameters:
  • ids_and_source (set[tuple[str, str]]) – ids to determine appropriate groupings of, and their associated sources

  • is_symbolic (bool) – is the underlying synonym symbolic?

Returns:

Return type:

tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]

upgrade_ontology_version()[source]

Use when upgrading the version of the underlying ontology data, or when changing the configuration of the AutoCurator.

Generate a report that describes the differences in generated OntologyStringResources between the two versions/configurations. Note that this depends on the existence of a set of autogenerated OntologyStringResources from the previous configuration.

To use this method, simply replace the file/directory of the original ontology version with the new ontology version in the model pack.

Note that calling this method will invalidate the disk cache.

Returns:

Return type:

OntologyUpgradeReport

all_synonym_column_names = ['idx', 'syn', 'mapping_type']
minimum_metadata_column_names = ['default_label', 'data_origin']