kazu.steps.ner.opsin

Classes

OpsinStep

A Step that calls Opsin (Open Parser for Systematic IUPAC Nomenclature) over py4j.

class kazu.steps.ner.opsin.OpsinStep[source]

Bases: Step

A Step that calls Opsin (Open Parser for Systematic IUPAC Nomenclature) over py4j.

TransformersModelForTokenClassificationNerStep often identifies IUPAC chemical nomenclature strings as Entitys with an entity_class of drug, but these entities fail to map to any of the drug parsers as no synonym is present. This step provides an extra way to resolve these chemical entities.

Opsin produces a SMILES from an IUPAC string and we use rdkit to convert that to a canonical SMILES to allow comparison between entities. This step then produces a Mapping with the canonical SMILES string as the idx.

Adding ${OpsinStep} just after ${MappingStep} in kazu/conf/Pipeline/default.yaml will enable this step.

Attention

To use this step, you will need py4j and rdkit installed, which is not installed as part of the default kazu install because this step isn’t used as part of the default pipeline.

You can either do:

$ pip install py4j rdkit

Or you can install required dependencies for all steps included in kazu with:

$ pip install kazu[all-steps]

Note

The nature of this functionality is considered experimental and we may split it into two steps in the future, without making a major release. If you are using or are interested in using this step, please open a GitHub issue.

Full details of possible change

In particular, this step does two things:

  1. Adjust incorrect NER boundaries (particularly coming from the TransformersModelForTokenClassificationNerStep)

  2. Link drug entities consisting of IUPAC strings to a canonical SMILES

The second of these aligns closely with the ‘linking’ stage in kazu, along with DictionaryEntityLinkingStep. We would ideally like to wrap the logic of 1. above into TransformersModelForTokenClassificationNerStep like with the NonContiguousEntitySplitter to fix these issues everywhere, and have 2. as a standalone linking step. However, this will require changes to the MappingLogic, and it may be tricky to de-couple 1 & 2 (the way this step currently does this depends on running Opsin to do accurately, which we would like to avoid doing twice, which may justify leaving this as a single step).

Examples

IUPAC Input

SMILES Output

Bicyclo[3.2.1]octane

C1CC2CCC(C1)C2

2,2’-ethylenedipyridine

c1ccc(CCc2ccccn2)nc1

Benzo[1”,2”:3,4;4”,5”:3’,4’]dicyclobuta[1,2-b:1’,2’-c’]difuran

c1cc2c3cc4c5cocc5c4cc3c2o1

Cyclohexanone ethyl methyl ketal

CCOC1(OC)CCCCC1

4-[2-(2-chloro-4-fluoroanilino)-5-methylpyrimidin-4-yl]-N-[(1S)-1-(3-chlorophenyl)-2-hydroxyethyl]-1H-pyrrole-2-carboxamide

Cc1cnc(Nc2ccc(F)cc2Cl)nc1-c1c[nH]c(C(=O)N[C@H](CO)c2cccc(Cl)c2)c1

7-cyclopentyl-5-(4-methoxyphenyl)pyrrolo[2,3-d]pyrimidin-4-amine

COc1ccc(-c2cn(C3CCCC3)c3ncnc(N)c23)cc1

[(3S,3aS,6R,6aS)-3-nitrooxy-2,3,3a,5,6,6a-hexahydrofuro[3,2-b]furan-6-yl] nitrate
(see pubchem)

O=[N+]([O-])O[C@H]1CO[C@H]2[C@@H]1OC[C@H]2O[N+](=O)[O-]

1,4:3,6-dianhydro-2,5-di-O-Nitro-D-glucitol

Opsin fails to parse this.
As a result, the Step will not produce a Mapping.

Paper:

Daniel M. Lowe, Peter T. Corbett, Peter Murray-Rust, and Robert C. Glen
Chemical Name to Structure: OPSIN, an Open Source Solution
Journal of Chemical Information and Modeling 2011 51 (3), 739-753
Bibtex Citation Details
@article{doi:10.1021/ci100384d,
author = {Lowe, Daniel M. and Corbett, Peter T. and Murray-Rust, Peter and Glen, Robert C.},
title = {Chemical Name to Structure: OPSIN, an Open Source Solution},
journal = {Journal of Chemical Information and Modeling},
volume = {51},
number = {3},
pages = {739-753},
year = {2011},
doi = {10.1021/ci100384d},
    note ={PMID: 21384929},

URL = {
        https://doi.org/10.1021/ci100384d

},
eprint = {
        https://doi.org/10.1021/ci100384d
}
}
__call__(doc)[source]

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:
Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(entity_class, opsin_fatjar_path, java_home, condition=None)[source]
Parameters:
  • entity_class (str) – search entities of this class for resolvable IUPAC string

  • opsin_fatjar_path (str) – path to a py4j fatjar, containing OPSIN dependencies

  • java_home (str) – path to installed java runtime

  • condition (Callable[[Document], bool] | None) – Since OPSIN can be slow, we can optionally specify a callable, so that any documents that don’t contain pre-existing drug entities are not processed

static extendString(ent, section, spaces=0)[source]

Extend a possible IUPAC Entity match to longer plausible strings.

TransformersModelForTokenClassificationNerStep tends to truncate IUPAC matches to a first hyphen. Here, we attempt to extend the match to try to capture the full IUPAC string.

Parameters:
  • ent (Entity) – the entity we suspect contains (some of) an IUPAC string

  • section (str) – the section the entity is contained in

  • spaces (int) – the maximum number of IUPAC spaces/breaks to ‘extend’ the match through

Returns:

plausible IUPAC strings with section start and end positions, ordered from longest to shortest.

Return type:

Iterable[tuple[str, int, int]]

parseString(name)[source]

Attempt to parse a potential IUPAC drug name into a Mapping.

Call Opsin with a potential IUPAC drug name to see if it parses - Opsin is fast enough that we can afford to try many potential candidates.

If Opsin parses the string successfully, we convert the resulting SMILES into a canonical SMILES using rdkit, and produce a Mapping with the canonical SMILES as the idx.

If Opsin parsing fails, None is returned. If the log level is set to debug, the error from Opsin for the parsing failure will be logged.

Parameters:

name (str) – the potential IUPAC drug name

Returns:

Return type:

Mapping | None