kazu.ontology_preprocessing.base¶
Classes
Parse an ontology (or similar) into a set of outputs suitable for NLP entity linking. |
- class kazu.ontology_preprocessing.base.OntologyParser[source]¶
Bases:
ABCParse an ontology (or similar) into a set of outputs suitable for NLP entity linking.
Implementations should have a class attribute ‘name’ to something suitably representative. The key method is
parse_to_dataframe(), which should convert an input source to a dataframe suitable for further processing.The other important method is
find_kb(). This should parse an ID string (if required) and return the underlying source. This is important for composite resources that contain identifiers from different seed sources.See The OntologyParser for a more detailed guide.
Generally speaking, when parsing a data source, synonyms that are symbolic (as determined by the
StringNormalizer) that refer to more than one id are more likely to be ambiguous. Therefore, we assume they refer to unique concepts (e.g. COX 1 could be ‘ENSG00000095303’ OR ‘ENSG00000198804’, and thus they will yield multiple instances ofEquivalentIdSet. Non symbolic synonyms (i.e. noun phrases) are far less likely to refer to distinct entities, so we might want to merge the associated ID’s non-symbolic ambiguous synonyms into a singleEquivalentIdSet. The result ofStringNormalizer.classify_symbolic()forms theis_symbolicparameter toscore_and_group_ids().If the underlying knowledgebase contains more than one entity type, muliple parsers should be implemented, subsetting accordingly (e.g. MEDDRA_DISEASE, MEDDRA_DIAGNOSTIC).
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet. Seescore_and_group_ids()for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidateshould be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResources to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActionsto apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloaderto download the ontology data from a remote source.
- Return type:
None
- download_ontology()[source]¶
Download the ontology to the in_path.
- Returns:
Path of downloaded ontology.
- Raises:
RuntimeError if no downloader is configured.
- Return type:
- abstract parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrameof at least the following columns:[
IDX,DEFAULT_LABEL,SYN,MAPPING_TYPE]IDX: the ontology idDEFAULT_LABEL: the preferred labelSYN: a synonym of the conceptMAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframeto add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabaseas metadata fields for the given id in the relevant ontology.- Return type:
- populate_databases(force=False, return_resources=False)[source]¶
Populate the databases with the results of the parser.
Also calculates the synonym norms associated with any resources (if provided) which can then be used for Dictionary based NER
- Parameters:
- Returns:
resources if required
- Return type:
list[OntologyStringResource] | None
- populate_metadata_db_and_resolve_string_resources()[source]¶
Loads the metadata DB and resolves any
OntologyStringResources associated with this parser.
- score_and_group_ids(ids_and_source, is_symbolic)[source]¶
For a given data source, one normalised synonym may map to one or more id. In some cases, the ID may be duplicate/redundant (e.g. there are many chembl ids for paracetamol). In other cases, the ID may refer to distinct concepts (e.g. COX 1 could be ‘ENSG00000095303’ OR ‘ENSG00000198804’).
Since synonyms from data sources are confused in such a manner, we need to decide some way to cluster them into a single
LinkingCandidateconcept, which in turn is a container for one or moreEquivalentIdSet(depending on whether the concept is ambiguous or not)The job of
score_and_group_idsis to determine how manyEquivalentIdSets for a given set of ids should be produced.The default algorithm (which can be overridden by concrete parser implementations) works as follows:
If no
string_scoreris configured, create anEquivalentIdSetfor each id (strategyNO_STRATEGY- not recommended)If only one ID is referenced, or the associated normalised synonym string is not symbolic, group the ids into a single
EquivalentIdSet(strategyUNAMBIGUOUS)otherwise, compare the default label associated with each ID to every other default label. If it’s above
self.synonym_merge_threshold, merge into oneEquivalentIdSet, if not, create a new one.
recommendation: Use the
SapbertStringSimilarityScorerfor comparison.Important
Any calls to this method requires the metadata DB to be populated, as this is the store of
DEFAULT_LABEL.- Parameters:
- Returns:
- Return type:
tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]
- upgrade_ontology_version()[source]¶
Use when upgrading the version of the underlying ontology data, or when changing the configuration of the
AutoCurator.Generate a report that describes the differences in generated
OntologyStringResources between the two versions/configurations. Note that this depends on the existence of a set of autogeneratedOntologyStringResources from the previous configuration.To use this method, simply replace the file/directory of the original ontology version with the new ontology version in the model pack.
Note that calling this method will invalidate the disk cache.
- Returns:
- Return type:
- all_synonym_column_names = ['idx', 'syn', 'mapping_type']¶
- minimum_metadata_column_names = ['default_label', 'data_origin']¶