The OntologyParser¶
Ontologies are not designed for NLP
—Angus Roberts
Ontologies, or more broadly Knowledge Bases are a core component of entity linking. In addition, they may hold a lot of value as a vocabulary source for Dictionary based NER. However, they often need careful handling, as the (uncontextualised) labels and synonyms associated with a given identifier can be noisy and/overloaded.
For instance, in the MONDO ontology, the abbreviation “OFD” is referenced as has_exact_synonym for osteofibrous dysplasia and orofaciodigital syndrome - i.e. two completely different diseases. Let’s call this scenario 1.
Similarly, the abbreviation “XLOA” is referenced as has_exact_synonym for ocular albinism and X-linked recessive ocular albinism - i.e. two very similar references. Let’s call this scenario 2.
It gets worse… many ontologies make use of identifiers from other ontologies, as well as assigning their own. For instance, “D-TGA” refers to “dextro-looped transposition of the great arteries” and actually has two identifiers associated: MONDO_0019443 and HP:0031348 - i.e. the exact same thing, but with different ids. When we say “we will link to MONDO”, do we mean “only MONDO ids” or “everything in the MONDO ontology”!?
Anyone familiar with a knowledgebase of any size will know such curation issues are not uncommon. When attempting to find candidates for linking either “OFD”, “XLOA” or “D-TGA” how should we reconcile these scenarios from an NLP perspective? For scenario 1, we can use some context of the underlying text to model which ID in MONDO is more likely. However, for scenario 2, this is very difficult, as the ontology is telling us the abbreviation is valid for both senses. We could arbitrarily choose one (or both). For scenario 3, it would seem like keeping both makes sense. Nevertheless, we need a system that can handle all three scenarios.
Enter the Kazu OntologyParser
. The job of the OntologyParser is to transform an Ontology or Knowledgebase
into a set of LinkingCandidate
s. A LinkingCandidate
is a container for a synonym, which understands what set of IDs the
synonym may refer to and whether they refer to a single group of closely related concepts or multiple separate ones. This is handled by the attribute
LinkingCandidate.associated_id_sets
. A LinkingCandidate
holds various other pieces of useful information
such as whether the candidate is symbolic (i.e. an abbreviation or some other identifier).
How does it work? When an ambiguous candidate is detected in the ontology, the parser must decide whether it should group the confused IDs into the same
EquivalentIdSet
, or different ones. The algorithm for doing this works as follows:
Use the
StringNormalizer
to determine if the candidate is symbolic or not. If it’s not symbolic (i.e. a noun phrase), merge the IDs into a singleEquivalentIdSet
. The idea here is that noun phrase entities ‘ought’ to be distinct enough such that references to the same string across different identifiers refer to the same concept.Example:
"seborrheic eczema" IDs: "http://purl.obolibrary.org/obo/HP_0001051", "http://purl.obolibrary.org/obo/MONDO_0006608"
Result:
EquivalentIdAggregationStrategy.MERGED_AS_NON_SYMBOLIC
If the candidate is symbolic, use the configured string scorer to calculate the similarity of default labels associated with the different IDs, and using a predefined threshold, group these IDs into one or more sets of IDs. The idea here is that we can use embeddings to check if semantically, each ID associated with a confused symbol is referring to either a very similar concept to another ID associated with the symbol, or something completely different in the knowledgebase. Typically, we use a distilled form of the SapBert model here, as it’s very good at this.
Example:
"OFD" either: osteofibrous dysplasia orofaciodigital syndrome
Result:
sapbert similarity: 0.4532. Threshold: 0.70. Decision: split into two instances of
EquivalentIdSet
Example:
"XLOA" either: X-linked recessive ocular albinism ocular albinism
Result:
sapbert similarity: 0.7426. Threshold: 0.70. Decision: merge into one instance of
EquivalentIdSet
Naturally, this behaviour may not always be desired. You may want two instances of LinkingCandidate
for the synonym “XLOA” (despite the MONDO ontology
suggesting this abbreviation is appropriate for either ID), and allow another step to decide which candidate LinkingCandidate
is most appropriate.
In this case, you can override this behaviour with OntologyParser.score_and_group_ids()
.
Writing a Custom Parser¶
Say you want to make a parser for a new datasource, (perhaps for NER or as a new linking target). To do this, you need to write an OntologyParser
.
Fortunately, this is generally quite easy to do. Let’s take the example of the ChemblOntologyParser
.
There are two methods you need to override: OntologyParser.parse_to_dataframe()
and OntologyParser.find_kb()
. Let’s look at the first of these:
import sqlite3
import pandas as pd
from kazu.ontology_preprocessing.base import (
OntologyParser,
DEFAULT_LABEL,
IDX,
SYN,
MAPPING_TYPE,
)
def parse_to_dataframe(self) -> pd.DataFrame:
"""The objective of this method is to create a long, thin pandas dataframe of entities and
associated metadata.
We need at the very least, to extract an id and a default label. Normally, we'd also be
looking to extract any synonyms and the type of mapping as well.
"""
# fortunately, Chembl comes as an sqlite DB,
# which lends itself very well to this tabular structure
conn = sqlite3.connect(self.in_path)
query = f"""\
SELECT chembl_id AS {IDX}, pref_name AS {DEFAULT_LABEL}, synonyms AS {SYN},
syn_type AS {MAPPING_TYPE}
FROM molecule_dictionary AS md
JOIN molecule_synonyms ms ON md.molregno = ms.molregno
UNION ALL
SELECT chembl_id AS {IDX}, pref_name AS {DEFAULT_LABEL}, pref_name AS {SYN},
'pref_name' AS {MAPPING_TYPE}
FROM molecule_dictionary
"""
df = pd.read_sql(query, conn)
# eliminate anything without a pref_name, as will be too big otherwise
df = df.dropna(subset=[DEFAULT_LABEL])
df.drop_duplicates(inplace=True)
return df
Secondly, we need to write the OntologyParser.find_kb()
method:
def find_kb(self, string: str) -> str:
"""In our case, this is simple, as everything in the Chembl DB has a chembl identifier.
Other ontologies may use composite identifiers, e.g. MONDO contains native MONDO_xxxxx
identifiers as well as HP_xxxxxxx identifiers. In this scenario, we'd need to parse the
'string' parameter of this method to extract the relevant KB identifier.
"""
return "CHEMBL"
The full class looks like:
class ChemblOntologyParser(OntologyParser):
def find_kb(self, string: str) -> str:
return "CHEMBL"
def parse_to_dataframe(self) -> pd.DataFrame:
conn = sqlite3.connect(self.in_path)
query = f"""\
SELECT chembl_id AS {IDX}, pref_name AS {DEFAULT_LABEL}, synonyms AS {SYN},
syn_type AS {MAPPING_TYPE}
FROM molecule_dictionary AS md
JOIN molecule_synonyms ms ON md.molregno = ms.molregno
UNION ALL
SELECT chembl_id AS {IDX}, pref_name AS {DEFAULT_LABEL}, pref_name AS {SYN},
'pref_name' AS {MAPPING_TYPE}
FROM molecule_dictionary
"""
df = pd.read_sql(query, conn)
# eliminate anything without a pref_name, as will be too big otherwise
df = df.dropna(subset=[DEFAULT_LABEL])
df.drop_duplicates(inplace=True)
return df
Finally, when we want to use our new parser, we need to give it information about what entity class it is associated with:
# We need a string scorer to resolve similar
# and potentially ambiguous synonyms.
# Here, we use a trivial example for brevity.
string_scorer = lambda string_1, string_2: 0.75
parser = ChemblOntologyParser(
in_path="path to chembl DB goes here",
# if used in entity linking, entities with class 'drug'
# will be associated with this parser
entity_class="drug",
name="CHEMBL", # a globally unique name for the parser
string_scorer=string_scorer,
)
That’s it! The datasource is now ready for integration into Kazu, and can be referenced as a linking target or elsewhere.
Using “OntologyStringResource” for dictionary based matching and/or to modify an Ontology’s behaviour¶
The data sources that Kazu users tend to concern themselves with are often a rich source of nouns that can be accurately used for dictionary based string matching. Naively, we might think it is sufficient to simply take all of the entity labels from an ontology, and perform case insensitive string matching with them. However, unless we have direct control over the ontology, this is rarely the case.
Instead, it’s preferable to curate the ontology, specifying:
Strings we want to use from the ontology, and strings we want to ignore.
Strings that we want to use for dictionary matching and entity linking, or just entity linking.
Whether the case of the string is relevant.
How confident one is that a given string match is likely to be a ‘true positive’ entity hit.
In addition, there are the following considerations:
Many strings have multiple equally relevant forms/synonyms that aren’t documented in the underlying ontology, but can be automatically generated. How can we ensure we are using those for NER/linking as well?
If the ontology is large, it’s probably not practical to review every string - there could be 10,000s. Therefore, can we employ heuristics to automatically curate some/all of the strings for us?
Usually, ontologies are not static. They undergo revisions, in which new strings are added, obsolete ones removed and existing ones change. Even with autocuration techniques, some manual review will probably be necessary. How can we preserve the work of our previous round of curation, when a new version of an ontology is released?
Curations can clash! The behaviour of one may interfere with another, similar curation. How can we ensure behaviour is consistent across our set of curations?
Note
Prior to Kazu 2.0, the internal curation system of Kazu was cumbersome to use/explain. We recommend upgrading to Kazu 2.0 or later as soon as possible.
Points 1-4 above are handled by the OntologyStringResource
concept and Synonym
concept. Point 5 is handled by
the CombinatorialSynonymGenerator
class. Point 6 is handled by the AutoCurator
class. Point 7 is
handled by OntologyParser.upgrade_ontology_version()
.
Point 8 is handled by the OntologyStringConflictAnalyser
and OntologyResourceSetCompleteReport
classes
(and executed via OntologyParser.populate_metadata_db_and_resolve_string_resources()
).
The flow of an ontology parser to handling the underlying strings is as follows:
On first initialisation, the set of
LinkingCandidate
s an ontology produces is converted into a set ofOntologyStringResource
. This happens vialinking_candidates_to_ontology_string_resources()
.If configured, the
CombinatorialSynonymGenerator
is executed to generate additional forms for eachOntologyStringResource
.If configured, the
AutoCurator
is executed to adjust the default behaviour for eachOntologyStringResource
. Note that autocuration results can lead to conflicts, which are then “optimistically” resolved to a consistent result viaOntologyStringConflictAnalyser
.The final set of the automatically generated
OntologyStringResource
s is serialised in the model pack. This is required when upgrading to a new version of the ontology, and can also be used as the basis for human curations (supplied via a seperate file to thecurations_path
argument toOntologyParser
).The automatically generated set of
OntologyStringResource
is guaranteed to be consistent. However, it can be difficult to determine whether any additional human curations will cause a conflict. Therefore, theOntologyStringConflictAnalyser
will run each time theOntologyParser.populate_databases()
method is called (once per python process, or as long asforce=True
). This will throw an exception in the case of conflicts, describing the human curations that need to be adjusted. When the humanOntologyStringResource
are consistent, they will override their automatically generated equivalents, ensuring the human curated behaviour takes precedent over the automatically curated version. AOntologyResourceSetCompleteReport
can be generated to describe what resources are obsolete/broken/superfluous viaOntologyParser.populate_metadata_db_and_resolve_string_resources()
.Finally, when upgrading an ontology, the
OntologyParser.upgrade_ontology_version()
is used to generate aOntologyResourceSetCompleteReport
, describing the differences between the old and the new versions. The results are then used to supplement the existingOntologyStringResource
s for the new version.
To assist with the above, Kazu provides a simple Streamlit tool The Kazu Resource Tool to help with the curation process.
To explore the other capabilities of the OntologyParser
, such as synonym generation and ID filtering, please
refer to the API documentation.