kazu.ontology_preprocessing.parsers

This module consists entirely of implementations of OntologyParser.

Some of these are aimed specifically at a custom format for individual ontologies, like ChemblOntologyParser or MeddraOntologyParser.

Others are aimed to provide flexibly for a user across a format, such as RDFGraphParser, TabularOntologyParser and JsonLinesOntologyParser.

If you do not find a parser that meets your needs, please see Writing a Custom Parser.

Classes

ATCDrugClassificationParser

Parser for the ATC Drug classification dataset.

BiologicalProcessGeneOntologyParser

A subclass of GeneOntologyParser that filters to only the biological_process namespace.

CLOOntologyParser

Input is a CLO Owl file https://www.ebi.ac.uk/ols/ontologies/clo.

CLOntologyParser

Input should be an CL owl file e.g. https://www.ebi.ac.uk/ols/ontologies/cl.

CellosaurusOntologyParser

Input is an obo file from cellosaurus, e.g. https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo.

CellularComponentGeneOntologyParser

A subclass of GeneOntologyParser that filters to only the cellular_component namespace.

ChemblOntologyParser

Input is a directory containing an extracted sqllite dump from Chembl.

ChemblParquetOntologyParser

Input is a parquet file containing an extracted sqllite dump from Chembl.

GeneOntologyParser

A parser for the Gene Ontology.

HGNCGeneFamilyParser

Parse HGNC data and extract only Gene Families as entities.

HGNCGeneOntologyParser

Parse HGNC data and extract individual genes as entities.

HPOntologyParser

JsonLinesOntologyParser

A parser for a jsonlines dataset.

MeddraOntologyParser

Input is an unzipped directory to a Meddra release (Note, requires licence).

MolecularFunctionGeneOntologyParser

A subclass of GeneOntologyParser that filters to only the molecular_function namespace.

MondoOntologyParser

OpenTargetsDiseaseOntologyParser

Parser for OpenTargets Disease release.

OpenTargetsMoleculeOntologyParser

OpenTargetsTargetOntologyParser

Parser for the OT Target dataset.

RDFGraphParser

Parser for rdf files.

SKOSXLGraphParser

Parse SKOS-XL RDF Files.

StatoParser

Parse stato: input should be an owl file.

TabularOntologyParser

For already tabulated data.

UberonOntologyParser

Input should be an UBERON owl file e.g. https://www.ebi.ac.uk/ols/ontologies/uberon.

class kazu.ontology_preprocessing.parsers.ATCDrugClassificationParser[source]

Bases: TabularOntologyParser

Parser for the ATC Drug classification dataset.

This requires a licence from WHO, available at https://www.who.int/tools/atc-ddd-toolkit/atc-classification .

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
Return type:

None

parse_to_dataframe()[source]

Assume input file is already in correct format.

Inherit and override this method if different behaviour is required.

Returns:

Return type:

DataFrame

levels_to_ignore = {'1', '2', '3'}
class kazu.ontology_preprocessing.parsers.BiologicalProcessGeneOntologyParser[source]

Bases: GeneOntologyParser

A subclass of GeneOntologyParser that filters to only the biological_process namespace.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

Return type:

None

class kazu.ontology_preprocessing.parsers.CLOOntologyParser[source]

Bases: RDFGraphParser

Input is a CLO Owl file https://www.ebi.ac.uk/ols/ontologies/clo.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

Return type:

None

find_kb(string)[source]

By default, just return the name of the parser.

If more complex behaviour is necessary, write a custom subclass and override this method.

Parameters:

string (str)

Return type:

str

class kazu.ontology_preprocessing.parsers.CLOntologyParser[source]

Bases: RDFGraphParser

Input should be an CL owl file e.g. https://www.ebi.ac.uk/ols/ontologies/cl.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

  • include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

Return type:

None

find_kb(string)[source]

By default, just return the name of the parser.

If more complex behaviour is necessary, write a custom subclass and override this method.

Parameters:

string (str)

Return type:

str

class kazu.ontology_preprocessing.parsers.CellosaurusOntologyParser[source]

Bases: OntologyParser

Input is an obo file from cellosaurus, e.g. https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo.

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

score_and_group_ids(ids_and_source, is_symbolic)[source]

Treat all synonyms as seperate cell lines.

Parameters:
Returns:

Return type:

tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]

cell_line_re = re.compile('cell line', re.IGNORECASE)
class kazu.ontology_preprocessing.parsers.CellularComponentGeneOntologyParser[source]

Bases: GeneOntologyParser

A subclass of GeneOntologyParser that filters to only the cellular_component namespace.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

Return type:

None

class kazu.ontology_preprocessing.parsers.ChemblOntologyParser[source]

Bases: OntologyParser

Input is a directory containing an extracted sqllite dump from Chembl.

Deprecated since version 2.1.0: Use kazu.ontology_preprocessing.parsers.ChemblParquetOntologyParser instead. This is deprecated so we don’t have to store a large sqlite database file in resources.

For example, this can be sourced from: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_33/chembl_33_sqlite.tar.gz.

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

class kazu.ontology_preprocessing.parsers.ChemblParquetOntologyParser[source]

Bases: OntologyParser

Input is a parquet file containing an extracted sqllite dump from Chembl.

Note

See kazu.ontology_preprocessing.downloads.ChemblParquetOntologyDownloader for how the extraction is performed.

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

class kazu.ontology_preprocessing.parsers.GeneOntologyParser[source]

Bases: RDFGraphParser

A parser for the Gene Ontology.

Differences from its parent class RDFGraphParser:

  1. Specify an appropriate uri_regex and synonym_predicates for the Gene Ontology.

  2. Drop entities with a defalt label containing obsolete - see parse_to_dataframe().

  3. Cache the parsing of the rdf file, since we have multiple parsers that use different parts of the ontology, so this saves re-parsing the source file multiple times, which is a signficant cost as the file is very large and parsing rdf is expensive.

Subclasses of this class like BiologicalProcessGeneOntologyParser filter to a specific ‘namespace’ within the Gene Ontology. These are present as a convenience and for discoverability. It is straightforward to configure a GeneOntologyParser instance to filter to a namespace without subclassing - see the implementation of BiologicalProcessGeneOntologyParser for details.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

  • include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

Return type:

None

parse_to_dataframe()[source]

A modification of RDFGraphParser.parse_to_dataframe().

The only difference from the overriden method is that this drops entities where the default label contains obsolete, as these are no longer relevant for Gene Ontology NER/Entity Linking.

Return type:

DataFrame

static parse_to_graph(in_path)[source]

Cached version of RDFGraphParser.parse_to_graph().

Cached due to the expense of parsing Gene Ontology from scratch (otherwise we end up doing this 3 times in the public model pack).

Parameters:

in_path (str)

Return type:

Graph

populate_databases(force=False, return_resources=False)[source]

Modified version of RDFGraphParser.parse_to_graph() to handle caching.

We have custom logic here to clear the caching on parse_to_graph(), because the size of this cached graph is quite large in memory, and otherwise stays in continued usage throughout the runtime of kazu.

Parameters:
Return type:

list[OntologyStringResource] | None

instances: set[str] = {}
instances_in_dbs: set[str] = {}
class kazu.ontology_preprocessing.parsers.HGNCGeneFamilyParser[source]

Bases: OntologyParser

Parse HGNC data and extract only Gene Families as entities.

Input is a json from HGNC. For example, http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json.

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

syn_column_keys = {'Common root gene symbol', 'Family alias'}
class kazu.ontology_preprocessing.parsers.HGNCGeneOntologyParser[source]

Bases: OntologyParser

Parse HGNC data and extract individual genes as entities.

Input is a json from HGNC. For example, http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

Return type:

None

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

class kazu.ontology_preprocessing.parsers.HPOntologyParser[source]

Bases: RDFGraphParser

find_kb(string)[source]

By default, just return the name of the parser.

If more complex behaviour is necessary, write a custom subclass and override this method.

Parameters:

string (str)

Return type:

str

class kazu.ontology_preprocessing.parsers.JsonLinesOntologyParser[source]

Bases: OntologyParser

A parser for a jsonlines dataset.

Assumes one kb entry per line (i.e. json object).

This should be subclassed and subclasses must implement json_dict_to_parser_records().

json_dict_to_parser_records(jsons_gen)[source]

For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.

This means dictionaries should have keys for SYN, MAPPING_TYPE, DEFAULT_LABEL and IDX. All other keys are used as mapping metadata.

Parameters:

jsons_gen (Iterable[dict[str, Any]]) – iterable of python dict representing json objects

Returns:

Return type:

Iterable[dict[str, Any]]

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

read(path)[source]
Parameters:

path (Path)

Return type:

Iterable[dict[str, Any]]

class kazu.ontology_preprocessing.parsers.MeddraOntologyParser[source]

Bases: OntologyParser

Input is an unzipped directory to a Meddra release (Note, requires licence).

This should contain the files ‘mdhier.asc’ and ‘llt.asc’.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, exclude_socs=('Surgical and medical procedures', 'Social circumstances', 'Investigations'))[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

  • exclude_socs (Iterable[str])

Return type:

None

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

class kazu.ontology_preprocessing.parsers.MolecularFunctionGeneOntologyParser[source]

Bases: GeneOntologyParser

A subclass of GeneOntologyParser that filters to only the molecular_function namespace.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

Return type:

None

class kazu.ontology_preprocessing.parsers.MondoOntologyParser[source]

Bases: OntologyParser

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

is_valid_iri(text)[source]
Parameters:

text (str)

Return type:

bool

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

class kazu.ontology_preprocessing.parsers.OpenTargetsDiseaseOntologyParser[source]

Bases: JsonLinesOntologyParser

Parser for OpenTargets Disease release.

OpenTargets has a lot of entities in its disease dataset, not all of which are diseases. Here, we use the allowed_therapeutic_areas argument to describe which specific therapeutic areas a given instance of this parser should use. See https://platform-docs.opentargets.org/disease-or-phenotype for more info.

__init__(in_path, entity_class, name, allowed_therapeutic_areas, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
Param:

ontology_downloader:

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

json_dict_to_parser_records(jsons_gen)[source]

For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.

This means dictionaries should have keys for SYN, MAPPING_TYPE, DEFAULT_LABEL and IDX. All other keys are used as mapping metadata.

Parameters:

jsons_gen (Iterable[dict[str, Any]]) – iterable of python dict representing json objects

Returns:

Return type:

Iterable[dict[str, Any]]

score_and_group_ids(ids_and_source, is_symbolic)[source]

Group disease IDs via cross-reference.

Falls back to superclass implementation if any xrefs are inconsistently described.

Parameters:
Returns:

Return type:

tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]

DF_XREF_FIELD_NAME = 'dbXRefs'
class kazu.ontology_preprocessing.parsers.OpenTargetsMoleculeOntologyParser[source]

Bases: JsonLinesOntologyParser

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

json_dict_to_parser_records(jsons_gen)[source]

For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.

This means dictionaries should have keys for SYN, MAPPING_TYPE, DEFAULT_LABEL and IDX. All other keys are used as mapping metadata.

Parameters:

jsons_gen (Iterable[dict[str, Any]]) – iterable of python dict representing json objects

Returns:

Return type:

Iterable[dict[str, Any]]

class kazu.ontology_preprocessing.parsers.OpenTargetsTargetOntologyParser[source]

Bases: JsonLinesOntologyParser

Parser for the OT Target dataset.

Note

Automatically ignored records

Since there are many thousands of ensembl IDs that reference uninteresting genomic locations, we will likely never see them in natural language. Therefore, we automatically filter records that do not have an approved symbol defined. In addition, this class allows one to filter biotypes they’re not interested in using the excluded_biotypes argument of the constructor.

__init__(in_path, entity_class, name, excluded_biotypes=None, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

json_dict_to_parser_records(jsons_gen)[source]

For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.

This means dictionaries should have keys for SYN, MAPPING_TYPE, DEFAULT_LABEL and IDX. All other keys are used as mapping metadata.

Parameters:

jsons_gen (Iterable[dict[str, Any]]) – iterable of python dict representing json objects

Returns:

Return type:

Iterable[dict[str, Any]]

score_and_group_ids(ids_and_source, is_symbolic)[source]

Group Ensembl gene IDs belonging to the same gene.

Note for non-biologists about genes

The concept of a ‘gene’ is complex, and Ensembl gene IDs actually refer to locations on the genome, rather than individual genes. In fact, one ‘gene’ can be made up of multiple Ensembl gene IDs, generally speaking these are exons that produce different isoforms of a given protein.

Parameters:
Returns:

Return type:

tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]

ANNOTATION_FIELDS = {'chemicalProbes', 'constraint', 'functionDescriptions', 'go', 'hallmarks', 'pathways', 'safetyLiabilities', 'subcellularLocations', 'targetClass', 'tractability'}
class kazu.ontology_preprocessing.parsers.RDFGraphParser[source]

Bases: OntologyParser

Parser for rdf files.

Supports any format of rdf that rdflib.Graph.parse() can infer the format of from the file extension, e.g. .xml , .ttl , .owl , .json. Case of the extension does not matter. This functionality is handled by rdflib.util.guess_format(), but will fall back to attempting to parse as turtle/ttl format in the case of an unknown file extension.

__init__(in_path, entity_class, name, uri_regex, synonym_predicates, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, label_predicate=rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'))[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

  • uri_regex (str | Pattern[str])

  • synonym_predicates (Iterable[Path | Node | str])

  • include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • label_predicate (Path | Node | str)

Return type:

None

static convert_to_rdflib_ref(pred: Path) Path[source]
static convert_to_rdflib_ref(pred: Node) Node
static convert_to_rdflib_ref(pred: str) URIRef
find_kb(string)[source]

By default, just return the name of the parser.

If more complex behaviour is necessary, write a custom subclass and override this method.

Parameters:

string (str)

Return type:

str

is_valid_iri(text)[source]

Check if input string is a valid IRI for the ontology being parsed.

Uses self._uri_regex to define valid IRIs.

Parameters:

text (str)

Return type:

bool

parse_to_dataframe()[source]

Implementations should override this method, returning a ‘long, thin’ pandas.DataFrame of at least the following columns:

[IDX, DEFAULT_LABEL, SYN, MAPPING_TYPE]

IDX: the ontology id
DEFAULT_LABEL: the preferred label
SYN: a synonym of the concept
MAPPING_TYPE: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontology

Note

It is the responsibility of the implementation of parse_to_dataframe to add default labels as synonyms.

Any ‘extra’ columns will be added to the MetadataDatabase as metadata fields for the given id in the relevant ontology.

Return type:

DataFrame

static parse_to_graph(in_path)[source]

Parse the given input path using rdflib.

Called by parse_to_dataframe(), this is a separate method to allow overriding to tweak the parsing process, such as adding caching (as in parse_to_graph()).

Parameters:

in_path (Path)

Return type:

Graph

class kazu.ontology_preprocessing.parsers.SKOSXLGraphParser[source]

Bases: RDFGraphParser

Parse SKOS-XL RDF Files.

Note that this just sets a default label predicate and synonym predicate to SKOS-XL appropriate paths, and then passes to the parent RDFGraphParser class. This class is just a convenience to make specifying a SKOS-XL parser easier, this functionality is still available via RDFGraphParser directly.

__init__(in_path, entity_class, name, uri_regex, synonym_predicates=(Path(http://www.w3.org/2008/05/skos-xl#altLabel / http://www.w3.org/2008/05/skos-xl#literalForm), ), string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, label_predicate=Path(http://www.w3.org/2008/05/skos-xl#prefLabel / http://www.w3.org/2008/05/skos-xl#literalForm))[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

  • uri_regex (str | Pattern[str])

  • synonym_predicates (Iterable[Path | Node | str])

  • include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • label_predicate (Path | Node | str)

Return type:

None

class kazu.ontology_preprocessing.parsers.StatoParser[source]

Bases: RDFGraphParser

Parse stato: input should be an owl file.

Available at e.g. https://www.ebi.ac.uk/ols/ontologies/stato .

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

  • include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

  • exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)

Return type:

None

find_kb(string)[source]

By default, just return the name of the parser.

If more complex behaviour is necessary, write a custom subclass and override this method.

Parameters:

string (str)

Return type:

str

class kazu.ontology_preprocessing.parsers.TabularOntologyParser[source]

Bases: OntologyParser

For already tabulated data.

This expects in_path to be the path to a file that can be loaded by pandas.read_csv() (e.g. a .csv or .tsv file), and the result be in the format that is produced by parse_to_dataframe() - see the docs of that method for more details on the format of this dataframe.

Note that this class’s __init__ method takes a **kwargs parameter which is passed through to pandas.read_csv() , which gives you a notable degree of flexibility on how exactly the input file is converted into this dataframe. Although if this becomes complex to pass through in the **kwargs, it may be worth considering Writing a Custom Parser.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, **kwargs)[source]
Parameters:
Return type:

None

find_kb(string)[source]

Split an IDX somehow to find the ontology SOURCE reference.

Parameters:

string (str) – the IDX string to process

Returns:

Return type:

str

parse_to_dataframe()[source]

Assume input file is already in correct format.

Inherit and override this method if different behaviour is required.

Returns:

Return type:

DataFrame

class kazu.ontology_preprocessing.parsers.UberonOntologyParser[source]

Bases: RDFGraphParser

Input should be an UBERON owl file e.g. https://www.ebi.ac.uk/ols/ontologies/uberon.

__init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]
Parameters:
  • in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)

  • entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.

  • name (str) – A string to represent a parser in the overall pipeline. Should be globally unique

  • string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!

  • synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single EquivalentIdSet. See score_and_group_ids() for further details

  • data_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source

  • synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching

  • autocurator (AutoCurator | None) – optional AutoCurator. An AutoCurator contains a series of heuristics that determines what the default behaviour for a LinkingCandidate should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”

  • curations_path (str | Path | None) – path to jsonl file of human-curated OntologyStringResources to override the defaults of the parser.

  • global_actions (GlobalParserActions | None) – path to json file of GlobalParserActions to apply to the parser.

  • ontology_downloader (OntologyDownloader | None) – optional OntologyDownloader to download the ontology data from a remote source.

Return type:

None

find_kb(string)[source]

By default, just return the name of the parser.

If more complex behaviour is necessary, write a custom subclass and override this method.

Parameters:

string (str)

Return type:

str