kazu.ontology_preprocessing.parsers¶
This module consists entirely of implementations of OntologyParser
.
Some of these are aimed specifically at a custom format for individual
ontologies, like ChemblOntologyParser
or
MeddraOntologyParser
.
Others are aimed to provide flexibly for a user across a format, such as
RDFGraphParser
, TabularOntologyParser
and
JsonLinesOntologyParser
.
If you do not find a parser that meets your needs, please see Writing a Custom Parser.
Classes
Parser for the ATC Drug classification dataset. |
|
A subclass of |
|
Input is a CLO Owl file https://www.ebi.ac.uk/ols/ontologies/clo. |
|
Input should be an CL owl file e.g. https://www.ebi.ac.uk/ols/ontologies/cl. |
|
Input is an obo file from cellosaurus, e.g. https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo. |
|
A subclass of |
|
Input is a directory containing an extracted sqllite dump from Chembl. |
|
Input is a parquet file containing an extracted sqllite dump from Chembl. |
|
A parser for the Gene Ontology. |
|
Parse HGNC data and extract only Gene Families as entities. |
|
Parse HGNC data and extract individual genes as entities. |
|
A parser for a jsonlines dataset. |
|
Input is an unzipped directory to a Meddra release (Note, requires licence). |
|
A subclass of |
|
Parser for OpenTargets Disease release. |
|
Parser for the OT Target dataset. |
|
Parser for rdf files. |
|
Parse SKOS-XL RDF Files. |
|
Parse stato: input should be an owl file. |
|
For already tabulated data. |
|
Input should be an UBERON owl file e.g. https://www.ebi.ac.uk/ols/ontologies/uberon. |
- class kazu.ontology_preprocessing.parsers.ATCDrugClassificationParser[source]¶
Bases:
TabularOntologyParser
Parser for the ATC Drug classification dataset.
This requires a licence from WHO, available at https://www.who.int/tools/atc-ddd-toolkit/atc-classification .
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
entity_class (str)
name (str)
string_scorer (StringSimilarityScorer | None)
synonym_merge_threshold (float)
data_origin (str)
synonym_generator (CombinatorialSynonymGenerator | None)
autocurator (AutoCurator | None)
global_actions (GlobalParserActions | None)
ontology_downloader (OntologyDownloader | None)
kwargs – passed to pandas.read_csv
- Return type:
None
- parse_to_dataframe()[source]¶
Assume input file is already in correct format.
Inherit and override this method if different behaviour is required.
- Returns:
- Return type:
- levels_to_ignore = {'1', '2', '3'}¶
- class kazu.ontology_preprocessing.parsers.BiologicalProcessGeneOntologyParser[source]¶
Bases:
GeneOntologyParser
A subclass of
GeneOntologyParser
that filters to only thebiological_process
namespace.- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- class kazu.ontology_preprocessing.parsers.CLOOntologyParser[source]¶
Bases:
RDFGraphParser
Input is a CLO Owl file https://www.ebi.ac.uk/ols/ontologies/clo.
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- class kazu.ontology_preprocessing.parsers.CLOntologyParser[source]¶
Bases:
RDFGraphParser
Input should be an CL owl file e.g. https://www.ebi.ac.uk/ols/ontologies/cl.
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
- Return type:
None
- class kazu.ontology_preprocessing.parsers.CellosaurusOntologyParser[source]¶
Bases:
OntologyParser
Input is an obo file from cellosaurus, e.g. https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo.
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- score_and_group_ids(ids_and_source, is_symbolic)[source]¶
Treat all synonyms as seperate cell lines.
- Parameters:
- Returns:
- Return type:
tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]
- cell_line_re = re.compile('cell line', re.IGNORECASE)¶
- class kazu.ontology_preprocessing.parsers.CellularComponentGeneOntologyParser[source]¶
Bases:
GeneOntologyParser
A subclass of
GeneOntologyParser
that filters to only thecellular_component
namespace.- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- class kazu.ontology_preprocessing.parsers.ChemblOntologyParser[source]¶
Bases:
OntologyParser
Input is a directory containing an extracted sqllite dump from Chembl.
Deprecated since version 2.1.0: Use
kazu.ontology_preprocessing.parsers.ChemblParquetOntologyParser
instead. This is deprecated so we don’t have to store a large sqlite database file in resources.For example, this can be sourced from: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_33/chembl_33_sqlite.tar.gz.
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- class kazu.ontology_preprocessing.parsers.ChemblParquetOntologyParser[source]¶
Bases:
OntologyParser
Input is a parquet file containing an extracted sqllite dump from Chembl.
Note
See
kazu.ontology_preprocessing.downloads.ChemblParquetOntologyDownloader
for how the extraction is performed.- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- class kazu.ontology_preprocessing.parsers.GeneOntologyParser[source]¶
Bases:
RDFGraphParser
A parser for the Gene Ontology.
Differences from its parent class
RDFGraphParser
:Specify an appropriate
uri_regex
andsynonym_predicates
for the Gene Ontology.Drop entities with a defalt label containing
obsolete
- seeparse_to_dataframe()
.Cache the parsing of the rdf file, since we have multiple parsers that use different parts of the ontology, so this saves re-parsing the source file multiple times, which is a signficant cost as the file is very large and parsing rdf is expensive.
Subclasses of this class like
BiologicalProcessGeneOntologyParser
filter to a specific ‘namespace’ within the Gene Ontology. These are present as a convenience and for discoverability. It is straightforward to configure aGeneOntologyParser
instance to filter to a namespace without subclassing - see the implementation ofBiologicalProcessGeneOntologyParser
for details.- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
- Return type:
None
- parse_to_dataframe()[source]¶
A modification of
RDFGraphParser.parse_to_dataframe()
.The only difference from the overriden method is that this drops entities where the default label contains
obsolete
, as these are no longer relevant for Gene Ontology NER/Entity Linking.- Return type:
- static parse_to_graph(in_path)[source]¶
Cached version of
RDFGraphParser.parse_to_graph()
.Cached due to the expense of parsing Gene Ontology from scratch (otherwise we end up doing this 3 times in the public model pack).
- populate_databases(force=False, return_resources=False)[source]¶
Modified version of
RDFGraphParser.parse_to_graph()
to handle caching.We have custom logic here to clear the caching on
parse_to_graph()
, because the size of this cached graph is quite large in memory, and otherwise stays in continued usage throughout the runtime of kazu.- Parameters:
- Return type:
list[OntologyStringResource] | None
- class kazu.ontology_preprocessing.parsers.HGNCGeneFamilyParser[source]¶
Bases:
OntologyParser
Parse HGNC data and extract only Gene Families as entities.
Input is a json from HGNC. For example, http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json.
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- syn_column_keys = {'Common root gene symbol', 'Family alias'}¶
- class kazu.ontology_preprocessing.parsers.HGNCGeneOntologyParser[source]¶
Bases:
OntologyParser
Parse HGNC data and extract individual genes as entities.
Input is a json from HGNC. For example, http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json.
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- class kazu.ontology_preprocessing.parsers.HPOntologyParser[source]¶
Bases:
RDFGraphParser
- class kazu.ontology_preprocessing.parsers.JsonLinesOntologyParser[source]¶
Bases:
OntologyParser
A parser for a jsonlines dataset.
Assumes one kb entry per line (i.e. json object).
This should be subclassed and subclasses must implement
json_dict_to_parser_records()
.- json_dict_to_parser_records(jsons_gen)[source]¶
For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.
This means dictionaries should have keys for
SYN
,MAPPING_TYPE
,DEFAULT_LABEL
andIDX
. All other keys are used as mapping metadata.
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- class kazu.ontology_preprocessing.parsers.MeddraOntologyParser[source]¶
Bases:
OntologyParser
Input is an unzipped directory to a Meddra release (Note, requires licence).
This should contain the files ‘mdhier.asc’ and ‘llt.asc’.
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, exclude_socs=('Surgical and medical procedures', 'Social circumstances', 'Investigations'))[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- class kazu.ontology_preprocessing.parsers.MolecularFunctionGeneOntologyParser[source]¶
Bases:
GeneOntologyParser
A subclass of
GeneOntologyParser
that filters to only themolecular_function
namespace.- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None
- class kazu.ontology_preprocessing.parsers.MondoOntologyParser[source]¶
Bases:
OntologyParser
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- class kazu.ontology_preprocessing.parsers.OpenTargetsDiseaseOntologyParser[source]¶
Bases:
JsonLinesOntologyParser
Parser for OpenTargets Disease release.
OpenTargets has a lot of entities in its disease dataset, not all of which are diseases. Here, we use the
allowed_therapeutic_areas
argument to describe which specific therapeutic areas a given instance of this parser should use. See https://platform-docs.opentargets.org/disease-or-phenotype for more info.- __init__(in_path, entity_class, name, allowed_therapeutic_areas, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
entity_class (str)
name (str)
allowed_therapeutic_areas (Iterable[str]) – areas to use in this instance. These are IDs in OpenTargets’ format, like
MONDO_0024458
.string_scorer (StringSimilarityScorer | None)
synonym_merge_threshold (float)
data_origin (str)
synonym_generator (CombinatorialSynonymGenerator | None)
autocurator (AutoCurator | None)
global_actions (GlobalParserActions | None)
ontology_downloader (OntologyDownloader | None)
- Param:
ontology_downloader:
- json_dict_to_parser_records(jsons_gen)[source]¶
For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.
This means dictionaries should have keys for
SYN
,MAPPING_TYPE
,DEFAULT_LABEL
andIDX
. All other keys are used as mapping metadata.
- score_and_group_ids(ids_and_source, is_symbolic)[source]¶
Group disease IDs via cross-reference.
Falls back to superclass implementation if any xrefs are inconsistently described.
- Parameters:
- Returns:
- Return type:
tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]
- DF_XREF_FIELD_NAME = 'dbXRefs'¶
- class kazu.ontology_preprocessing.parsers.OpenTargetsMoleculeOntologyParser[source]¶
Bases:
JsonLinesOntologyParser
- json_dict_to_parser_records(jsons_gen)[source]¶
For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.
This means dictionaries should have keys for
SYN
,MAPPING_TYPE
,DEFAULT_LABEL
andIDX
. All other keys are used as mapping metadata.
- class kazu.ontology_preprocessing.parsers.OpenTargetsTargetOntologyParser[source]¶
Bases:
JsonLinesOntologyParser
Parser for the OT Target dataset.
Note
Automatically ignored records
Since there are many thousands of ensembl IDs that reference uninteresting genomic locations, we will likely never see them in natural language. Therefore, we automatically filter records that do not have an approved symbol defined. In addition, this class allows one to filter biotypes they’re not interested in using the
excluded_biotypes
argument of the constructor.- __init__(in_path, entity_class, name, excluded_biotypes=None, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
entity_class (str)
name (str)
excluded_biotypes (Iterable[str] | None) – if specified, ignore these biotypes. Note that an empty string “” is a biotype in OT for some reason…
string_scorer (StringSimilarityScorer | None)
synonym_merge_threshold (float)
data_origin (str)
synonym_generator (CombinatorialSynonymGenerator | None)
autocurator (AutoCurator | None)
global_actions (GlobalParserActions | None)
ontology_downloader (OntologyDownloader | None)
- json_dict_to_parser_records(jsons_gen)[source]¶
For a given input json (represented as a python dict), yield dictionary record(s) compatible with the expected structure of the Ontology Parser superclass.
This means dictionaries should have keys for
SYN
,MAPPING_TYPE
,DEFAULT_LABEL
andIDX
. All other keys are used as mapping metadata.
- score_and_group_ids(ids_and_source, is_symbolic)[source]¶
Group Ensembl gene IDs belonging to the same gene.
Note for non-biologists about genes
The concept of a ‘gene’ is complex, and Ensembl gene IDs actually refer to locations on the genome, rather than individual genes. In fact, one ‘gene’ can be made up of multiple Ensembl gene IDs, generally speaking these are exons that produce different isoforms of a given protein.
- Parameters:
- Returns:
- Return type:
tuple[frozenset[EquivalentIdSet], EquivalentIdAggregationStrategy]
- ANNOTATION_FIELDS = {'chemicalProbes', 'constraint', 'functionDescriptions', 'go', 'hallmarks', 'pathways', 'safetyLiabilities', 'subcellularLocations', 'targetClass', 'tractability'}¶
- class kazu.ontology_preprocessing.parsers.RDFGraphParser[source]¶
Bases:
OntologyParser
Parser for rdf files.
Supports any format of rdf that
rdflib.Graph.parse()
can infer the format of from the file extension, e.g..xml
,.ttl
,.owl
,.json
. Case of the extension does not matter. This functionality is handled byrdflib.util.guess_format()
, but will fall back to attempting to parse as turtle/ttl format in the case of an unknown file extension.- __init__(in_path, entity_class, name, uri_regex, synonym_predicates, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, label_predicate=rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'))[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
- Return type:
None
- static convert_to_rdflib_ref(pred: Path) Path [source]¶
- static convert_to_rdflib_ref(pred: Node) Node
- static convert_to_rdflib_ref(pred: str) URIRef
- find_kb(string)[source]¶
By default, just return the name of the parser.
If more complex behaviour is necessary, write a custom subclass and override this method.
- is_valid_iri(text)[source]¶
Check if input string is a valid IRI for the ontology being parsed.
Uses
self._uri_regex
to define valid IRIs.
- parse_to_dataframe()[source]¶
Implementations should override this method, returning a ‘long, thin’
pandas.DataFrame
of at least the following columns:[
IDX
,DEFAULT_LABEL
,SYN
,MAPPING_TYPE
]IDX
: the ontology idDEFAULT_LABEL
: the preferred labelSYN
: a synonym of the conceptMAPPING_TYPE
: the type of mapping from default label to synonym - e.g. xref, exactSyn etc. Usually defined by the ontologyNote
It is the responsibility of the implementation of
parse_to_dataframe
to add default labels as synonyms.Any ‘extra’ columns will be added to the
MetadataDatabase
as metadata fields for the given id in the relevant ontology.- Return type:
- static parse_to_graph(in_path)[source]¶
Parse the given input path using rdflib.
Called by
parse_to_dataframe()
, this is a separate method to allow overriding to tweak the parsing process, such as adding caching (as inparse_to_graph()
).
- class kazu.ontology_preprocessing.parsers.SKOSXLGraphParser[source]¶
Bases:
RDFGraphParser
Parse SKOS-XL RDF Files.
Note that this just sets a default label predicate and synonym predicate to SKOS-XL appropriate paths, and then passes to the parent RDFGraphParser class. This class is just a convenience to make specifying a SKOS-XL parser easier, this functionality is still available via RDFGraphParser directly.
- __init__(in_path, entity_class, name, uri_regex, synonym_predicates=(Path(http://www.w3.org/2008/05/skos-xl#altLabel / http://www.w3.org/2008/05/skos-xl#literalForm), ), string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, label_predicate=Path(http://www.w3.org/2008/05/skos-xl#prefLabel / http://www.w3.org/2008/05/skos-xl#literalForm))[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
- Return type:
None
- class kazu.ontology_preprocessing.parsers.StatoParser[source]¶
Bases:
RDFGraphParser
Parse stato: input should be an owl file.
Available at e.g. https://www.ebi.ac.uk/ols/ontologies/stato .
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, include_entity_patterns=None, exclude_entity_patterns=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.include_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
exclude_entity_patterns (Iterable[tuple[Path | Node | str, Node]] | None)
- Return type:
None
- class kazu.ontology_preprocessing.parsers.TabularOntologyParser[source]¶
Bases:
OntologyParser
For already tabulated data.
This expects
in_path
to be the path to a file that can be loaded bypandas.read_csv()
(e.g. a.csv
or.tsv
file), and the result be in the format that is produced byparse_to_dataframe()
- see the docs of that method for more details on the format of this dataframe.Note that this class’s
__init__
method takes a**kwargs
parameter which is passed through topandas.read_csv()
, which gives you a notable degree of flexibility on how exactly the input file is converted into this dataframe. Although if this becomes complex to pass through in the**kwargs
, it may be worth considering Writing a Custom Parser.- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None, **kwargs)[source]¶
- Parameters:
entity_class (str)
name (str)
string_scorer (StringSimilarityScorer | None)
synonym_merge_threshold (float)
data_origin (str)
synonym_generator (CombinatorialSynonymGenerator | None)
autocurator (AutoCurator | None)
global_actions (GlobalParserActions | None)
ontology_downloader (OntologyDownloader | None)
kwargs (Any) – passed to pandas.read_csv
- Return type:
None
- class kazu.ontology_preprocessing.parsers.UberonOntologyParser[source]¶
Bases:
RDFGraphParser
Input should be an UBERON owl file e.g. https://www.ebi.ac.uk/ols/ontologies/uberon.
- __init__(in_path, entity_class, name, string_scorer=None, synonym_merge_threshold=0.7, data_origin='unknown', synonym_generator=None, autocurator=None, curations_path=None, global_actions=None, ontology_downloader=None)[source]¶
- Parameters:
in_path (str | Path) – Path to some resource that should be processed (e.g. owl file, db config, tsv etc)
entity_class (str) – The entity class to associate with this parser throughout the pipeline. Also used in the parser when calling StringNormalizer to determine the class-appropriate behaviour.
name (str) – A string to represent a parser in the overall pipeline. Should be globally unique
string_scorer (StringSimilarityScorer | None) – Optional protocol of StringSimilarityScorer. Used for resolving ambiguous symbolic synonyms via similarity calculation of the default label associated with the conflicted labels. If no instance is provided, all synonym conflicts will be assumed to refer to different concepts. This is not recommended!
synonym_merge_threshold (float) – similarity threshold to trigger a merge of conflicted synonyms into a single
EquivalentIdSet
. Seescore_and_group_ids()
for further detailsdata_origin (str) – The origin of this dataset - e.g. HGNC release 2.1, MEDDRA 24.1 etc. Note, this is different from the parser.name, as is used to identify the origin of a mapping back to a data source
synonym_generator (CombinatorialSynonymGenerator | None) – optional CombinatorialSynonymGenerator. Used to generate synonyms for dictionary based NER matching
autocurator (AutoCurator | None) – optional
AutoCurator
. An AutoCurator contains a series of heuristics that determines what the default behaviour for aLinkingCandidate
should be. For example, “Ignore any strings shorter than two characters or longer than 50 characters”, or “use case sensitive matching when the LinkingCandidate is symbolic”curations_path (str | Path | None) – path to jsonl file of human-curated
OntologyStringResource
s to override the defaults of the parser.global_actions (GlobalParserActions | None) – path to json file of
GlobalParserActions
to apply to the parser.ontology_downloader (OntologyDownloader | None) – optional
OntologyDownloader
to download the ontology data from a remote source.
- Return type:
None