At a glance: How to use the default Kazu pipeline¶
For most use cases we’ve encountered, the default configuration should suffice. This will
Tag the following entity classes with a curated dictionary using the spaCy PhraseMatcher. This uses
MemoryEfficientStringMatchingStepgene
disease
drug
cell_line
cell_type
gene ontology (split into go_bp, go_cc and go_mf)
anatomy
Note
This step is limited to string matching only. A full FlashText implementation
(i.e. based on tokens) is available via ExplosionStringMatchingStep,
however this uses considerably more memory.
Tag the following entity classes with the TinyBERN2 model (see the EMNLP Kazu paper for more details). This uses
TransformersModelForTokenClassificationNerStepgene
disease
drug
cell_line
cell_type
Find candidates for linking the entities to knowledgebases according to the below yaml schema. This uses
DictionaryEntityLinkingStepdrug: - CHEMBL - OPENTARGETS_MOLECULE disease: - MONDO - OPENTARGETS_DISEASE gene: - OPENTARGETS_TARGET - HGNC_GENE_FAMILY anatomy: - UBERON cell_line: - CELLOSAURUS cell_type: - CLO - CL go_bp: - BP_GENE_ONTOLOGY go_mf: - MF_GENE_ONTOLOGY go_cc: - CC_GENE_ONTOLOGY
Apply rules to disambiguate certain entity classes and mentions within a document using
RulesBasedEntityClassDisambiguationFilterStepDecide which candidates are appropriate and extract mappings accordingly. This uses
MappingStepMerge overlapping entities (where appropriate). This uses
MergeOverlappingEntsStepDetect abbreviations, and copy appropriate mapping information to the desired spans. This uses
AbbreviationFinderStepPerform some customisable cleanup. This uses
CleanupStep
All of these steps are customisable via Hydra configuration.
Note that other steps are available in Kazu which are not used in the default pipeline, such as:
SethStepfor tagging mutations with the SETH tagger.StanzaStepfor high accuracy sentence-segmentation (note that this does slow the pipeline down considerably, hence why it’s not in by default).SpacyNerStepfor using a generic spaCy pipeline (such as scispacy) for Named Entity Recognition.
Some of these require additional dependencies which are not included in the default installation of kazu. You can get all of these dependencies with:
$ pip install 'kazu[all-steps]'
Or you can install the specific required dependencies for just those steps out of the above that you are using - see their API docs for details.