kazu.ontology_preprocessing.base¶
Classes
Parse an ontology (or similar) into a set of outputs suitable for NLP entity linking. |
- class kazu.ontology_preprocessing.base.OntologyParser[source]¶
Bases:
ABC
Parse an ontology (or similar) into a set of outputs suitable for NLP entity linking.
Implementations should have a class attribute ‘name’ to something suitably representative. The key method is
parse_to_dataframe()
, which should convert an input source to a dataframe suitable for further processing.The other important method is
find_kb()
. This should parse an ID string (if required) and return the underlying source. This is important for composite resources that contain identifiers from different seed sources.See The OntologyParser for a more detailed guide.
Generally speaking, when parsing a data source, synonyms that are symbolic (as determined by the
StringNormalizer
) that refer to more than one id are more likely to be ambiguous. Therefore, we assume they refer to unique concepts (e.g. COX 1 could be ‘ENSG00000095303’ OR ‘ENSG00000198804’, and thus they will yield multiple instances ofEquivalentIdSet
. Non symbolic synonyms (i.e. noun phrases) are far less likely to refer to distinct entities, so we might want to merge the associated ID’s non-symbolic ambiguous synonyms into a singleEquivalentIdSet
. The result ofStringNormalizer.classify_symbolic()
forms theis_symbolic
parameter toscore_and_group_ids()
.If the underlying knowledgebase contains more than one entity type, muliple parsers should be implemented, subsetting accordingly (e.g. MEDDRA_DISEASE, MEDDRA_DIAGNOSTIC).
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- download_ontology()[source]¶
Download the ontology to the in_path.
- Returns:
Path of downloaded ontology.
- Raises:
RuntimeError if no downloader is configured.
- Return type:
- abstract parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- populate_databases(force=False, return_resources=False)[source]¶
Populate the databases with the results of the parser.
Also calculates the synonym norms associated with any resources (if provided) which can then be used for Dictionary based NER
- Parameters:
- Returns:
resources if required
- Return type:
list[OntologyStringResource] | None
- populate_metadata_db_and_resolve_string_resources()[source]¶
Loads the metadata DB and resolves any
OntologyStringResource
s associated with this parser.
- score_and_group_ids(ids_and_source, is_symbolic)[source]¶
For a given data source, one normalised synonym may map to one or more id. In some cases, the ID may be duplicate/redundant (e.g. there are many chembl ids for paracetamol). In other cases, the ID may refer to distinct concepts (e.g. COX 1 could be ‘ENSG00000095303’ OR ‘ENSG00000198804’).
Since synonyms from data sources are confused in such a manner, we need to decide some way to cluster them into a single
LinkingCandidate
concept, which in turn is a container for one or moreEquivalentIdSet
(depending on whether the concept is ambiguous or not)The job of
score_and_group_ids
is to determine how manyEquivalentIdSet
s for a given set of ids should be produced.The default algorithm (which can be overridden by concrete parser implementations) works as follows:
If no
string_scorer
is configured, create anEquivalentIdSet
for each id (strategyNO_STRATEGY
- not recommended)If only one ID is referenced, or the associated normalised synonym string is not symbolic, group the ids into a single
EquivalentIdSet
(strategyUNAMBIGUOUS
)otherwise, compare the default label associated with each ID to every other default label. If it’s above
self.synonym_merge_threshold
, merge into oneEquivalentIdSet
, if not, create a new one.
recommendation: Use the
SapbertStringSimilarityScorer
for comparison.Important
Any calls to this method requires the metadata DB to be populated, as this is the store of
DEFAULT_LABEL
.- Parameters:
- Returns:
- Return type:
tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]
- upgrade_ontology_version()[source]¶
Use when upgrading the version of the underlying ontology data, or when changing the configuration of the
AutoCurator
.Generate a report that describes the differences in generated
OntologyStringResource
s between the two versions/configurations. Note that this depends on the existence of a set of autogeneratedOntologyStringResource
s from the previous configuration.To use this method, simply replace the file/directory of the original ontology version with the new ontology version in the model pack.
Note that calling this method will invalidate the disk cache.
- Returns:
- Return type:
- all_synonym_column_names = ['idx', 'syn', 'mapping_type']¶
- minimum_metadata_column_names = ['default_label', 'data_origin']¶