At a glance: How to use the default Kazu pipeline¶

For most use cases we’ve encountered, the default configuration should suffice. This will

Tag the following entity classes with a curated dictionary using the spaCy PhraseMatcher. This uses MemoryEfficientStringMatchingStep
1. gene
2. disease
3. drug
4. cell_line
5. cell_type
6. gene ontology (split into go_bp, go_cc and go_mf)
7. anatomy

Note

This step is limited to string matching only. A full FlashText implementation (i.e. based on tokens) is available via ExplosionStringMatchingStep, however this uses considerably more memory.

Tag the following entity classes with the TinyBERN2 model (see the EMNLP Kazu paper for more details). This uses TransformersModelForTokenClassificationNerStep
1. gene
2. disease
3. drug
4. cell_line
5. cell_type

Find candidates for linking the entities to knowledgebases according to the below yaml schema. This uses DictionaryEntityLinkingStep

drug:
  - CHEMBL
  - OPENTARGETS_MOLECULE
disease:
  - MONDO
  - OPENTARGETS_DISEASE
gene:
  - OPENTARGETS_TARGET
  - HGNC_GENE_FAMILY
anatomy:
  - UBERON
cell_line:
  - CELLOSAURUS
cell_type:
  - CLO
  - CL
go_bp:
  - BP_GENE_ONTOLOGY
go_mf:
  - MF_GENE_ONTOLOGY
go_cc:
  - CC_GENE_ONTOLOGY

Apply rules to disambiguate certain entity classes and mentions within a document using RulesBasedEntityClassDisambiguationFilterStep
Decide which candidates are appropriate and extract mappings accordingly. This uses MappingStep
Merge overlapping entities (where appropriate). This uses MergeOverlappingEntsStep
Detect abbreviations, and copy appropriate mapping information to the desired spans. This uses AbbreviationFinderStep
Perform some customisable cleanup. This uses CleanupStep

All of these steps are customisable via Hydra configuration.

Note that other steps are available in Kazu which are not used in the default pipeline, such as:

SethStep for tagging mutations with the SETH tagger.
OpsinStep for resolving IUPAC labels with the OPSIN.
StanzaStep for high accuracy sentence-segmentation (note that this does slow the pipeline down considerably, hence why it’s not in by default).
SpacyNerStep for using a generic spaCy pipeline (such as scispacy) for Named Entity Recognition.

Some of these require additional dependencies which are not included in the default installation of kazu. You can get all of these dependencies with:

$ pip install 'kazu[all-steps]'

Or you can install the specific required dependencies for just those steps out of the above that you are using - see their API docs for details.