kazu.steps.ner.opsin¶
Classes
A Step that calls Opsin (Open Parser for Systematic IUPAC Nomenclature) over py4j. |
- class kazu.steps.ner.opsin.OpsinStep[source]¶
Bases:
Step
A Step that calls Opsin (Open Parser for Systematic IUPAC Nomenclature) over py4j.
TransformersModelForTokenClassificationNerStep
often identifies IUPAC chemical nomenclature strings asEntity
s with anentity_class
ofdrug
, but these entities fail to map to any of the drug parsers as no synonym is present. This step provides an extra way to resolve these chemical entities.Opsin produces a SMILES from an IUPAC string and we use rdkit to convert that to a canonical SMILES to allow comparison between entities. This step then produces a
Mapping
with the canonical SMILES string as theidx
.Adding
${OpsinStep}
just after${MappingStep}
inkazu/conf/Pipeline/default.yaml
will enable this step.Attention
To use this step, you will need py4j and rdkit installed, which is not installed as part of the default kazu install because this step isn’t used as part of the default pipeline.
You can either do:
$ pip install py4j rdkit
Or you can install required dependencies for all steps included in kazu with:
$ pip install kazu[all-steps]
Note
The nature of this functionality is considered experimental and we may split it into two steps in the future, without making a major release. If you are using or are interested in using this step, please open a GitHub issue.
Full details of possible change
In particular, this step does two things:
Adjust incorrect NER boundaries (particularly coming from the
TransformersModelForTokenClassificationNerStep
)Link drug entities consisting of IUPAC strings to a canonical SMILES
The second of these aligns closely with the ‘linking’ stage in kazu, along with
DictionaryEntityLinkingStep
. We would ideally like to wrap the logic of 1. above intoTransformersModelForTokenClassificationNerStep
like with theNonContiguousEntitySplitter
to fix these issues everywhere, and have 2. as a standalone linking step. However, this will require changes to the MappingLogic, and it may be tricky to de-couple 1 & 2 (the way this step currently does this depends on running Opsin to do accurately, which we would like to avoid doing twice, which may justify leaving this as a single step).Examples
IUPAC Input
SMILES Output
Bicyclo[3.2.1]octane
C1CC2CCC(C1)C2
2,2’-ethylenedipyridine
c1ccc(CCc2ccccn2)nc1
Benzo[1”,2”:3,4;4”,5”:3’,4’]dicyclobuta[1,2-b:1’,2’-c’]difuran
c1cc2c3cc4c5cocc5c4cc3c2o1
Cyclohexanone ethyl methyl ketal
CCOC1(OC)CCCCC1
4-[2-(2-chloro-4-fluoroanilino)-5-methylpyrimidin-4-yl]-N-[(1S)-1-(3-chlorophenyl)-2-hydroxyethyl]-1H-pyrrole-2-carboxamide
Cc1cnc(Nc2ccc(F)cc2Cl)nc1-c1c[nH]c(C(=O)N[C@H](CO)c2cccc(Cl)c2)c1
7-cyclopentyl-5-(4-methoxyphenyl)pyrrolo[2,3-d]pyrimidin-4-amine
COc1ccc(-c2cn(C3CCCC3)c3ncnc(N)c23)cc1
[(3S,3aS,6R,6aS)-3-nitrooxy-2,3,3a,5,6,6a-hexahydrofuro[3,2-b]furan-6-yl] nitrate(see pubchem)O=[N+]([O-])O[C@H]1CO[C@H]2[C@@H]1OC[C@H]2O[N+](=O)[O-]
1,4:3,6-dianhydro-2,5-di-O-Nitro-D-glucitol
Opsin fails to parse this.As a result, the Step will not produce aMapping
.Paper:
Daniel M. Lowe, Peter T. Corbett, Peter Murray-Rust, and Robert C. GlenChemical Name to Structure: OPSIN, an Open Source SolutionJournal of Chemical Information and Modeling 2011 51 (3), 739-753DOI: 10.1021/ci100384dBibtex Citation Details
@article{doi:10.1021/ci100384d, author = {Lowe, Daniel M. and Corbett, Peter T. and Murray-Rust, Peter and Glen, Robert C.}, title = {Chemical Name to Structure: OPSIN, an Open Source Solution}, journal = {Journal of Chemical Information and Modeling}, volume = {51}, number = {3}, pages = {739-753}, year = {2011}, doi = {10.1021/ci100384d}, note ={PMID: 21384929}, URL = { https://doi.org/10.1021/ci100384d }, eprint = { https://doi.org/10.1021/ci100384d } }
- __call__(doc)[source]¶
Process documents and respond with processed and failed documents.
Note that many steps will be decorated by
document_iterating_step()
ordocument_batch_step()
which will modify the ‘original’__call__
function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.
- __init__(entity_class, opsin_fatjar_path, java_home, condition=None)[source]¶
- Parameters:
entity_class (str) – search entities of this class for resolvable IUPAC string
opsin_fatjar_path (str) – path to a py4j fatjar, containing OPSIN dependencies
java_home (str) – path to installed java runtime
condition (Callable[[Document], bool] | None) – Since OPSIN can be slow, we can optionally specify a callable, so that any documents that don’t contain pre-existing drug entities are not processed
- static extendString(ent, section, spaces=0)[source]¶
Extend a possible IUPAC Entity match to longer plausible strings.
TransformersModelForTokenClassificationNerStep
tends to truncate IUPAC matches to a first hyphen. Here, we attempt to extend the match to try to capture the full IUPAC string.- Parameters:
- Returns:
plausible IUPAC strings with section start and end positions, ordered from longest to shortest.
- Return type:
- parseString(name)[source]¶
Attempt to parse a potential IUPAC drug name into a
Mapping
.Call Opsin with a potential IUPAC drug name to see if it parses - Opsin is fast enough that we can afford to try many potential candidates.
If Opsin parses the string successfully, we convert the resulting SMILES into a canonical SMILES using rdkit, and produce a
Mapping
with the canonical SMILES as theidx
.If Opsin parsing fails,
None
is returned. If the log level is set to debug, the error from Opsin for the parsing failure will be logged.