At a glance: How to use the default Kazu pipeline

For most use cases we’ve encountered, the default configuration should suffice. This will

  1. Tag the following entity classes with a curated dictionary using the spaCy PhraseMatcher. This uses MemoryEfficientStringMatchingStep

    1. gene

    2. disease

    3. drug

    4. cell_line

    5. cell_type

    6. gene ontology (split into go_bp, go_cc and go_mf)

    7. anatomy

Note

This step is limited to string matching only. A full FlashText implementation (i.e. based on tokens) is available via ExplosionStringMatchingStep, however this uses considerably more memory.

  1. Tag the following entity classes with the TinyBERN2 model (see the EMNLP Kazu paper for more details). This uses TransformersModelForTokenClassificationNerStep

    1. gene

    2. disease

    3. drug

    4. cell_line

    5. cell_type

  2. Find candidates for linking the entities to knowledgebases according to the below yaml schema. This uses DictionaryEntityLinkingStep

    drug:
      - CHEMBL
      - OPENTARGETS_MOLECULE
    disease:
      - MONDO
      - OPENTARGETS_DISEASE
    gene:
      - OPENTARGETS_TARGET
      - HGNC_GENE_FAMILY
    anatomy:
      - UBERON
    cell_line:
      - CELLOSAURUS
    cell_type:
      - CLO
      - CL
    go_bp:
      - BP_GENE_ONTOLOGY
    go_mf:
      - MF_GENE_ONTOLOGY
    go_cc:
      - CC_GENE_ONTOLOGY
    
  3. Apply rules to disambiguate certain entity classes and mentions within a document using RulesBasedEntityClassDisambiguationFilterStep

  4. Decide which candidates are appropriate and extract mappings accordingly. This uses MappingStep

  5. Merge overlapping entities (where appropriate). This uses MergeOverlappingEntsStep

  6. Detect abbreviations, and copy appropriate mapping information to the desired spans. This uses AbbreviationFinderStep

  7. Perform some customisable cleanup. This uses CleanupStep

All of these steps are customisable via Hydra configuration.

Note that other steps are available in Kazu which are not used in the default pipeline, such as:

  • SethStep for tagging mutations with the SETH tagger.

  • OpsinStep for resolving IUPAC labels with the OPSIN.

  • StanzaStep for high accuracy sentence-segmentation (note that this does slow the pipeline down considerably, hence why it’s not in by default).

  • SpacyNerStep for using a generic spaCy pipeline (such as scispacy) for Named Entity Recognition.

Some of these require additional dependencies which are not included in the default installation of kazu. You can get all of these dependencies with:

$ pip install 'kazu[all-steps]'

Or you can install the specific required dependencies for just those steps out of the above that you are using - see their API docs for details.