At a glance: How to use the default Kazu pipeline¶
For most use cases we’ve encountered, the default configuration should suffice. This will
Tag the following entity classes with a curated dictionary using the spaCy PhraseMatcher. This uses
MemoryEfficientStringMatchingStep
gene
disease
drug
cell_line
cell_type
gene ontology (split into go_bp, go_cc and go_mf)
anatomy
Note
This step is limited to string matching only. A full FlashText implementation
(i.e. based on tokens) is available via ExplosionStringMatchingStep
,
however this uses considerably more memory.
Tag the following entity classes with the TinyBERN2 model (see the EMNLP Kazu paper for more details). This uses
TransformersModelForTokenClassificationNerStep
gene
disease
drug
cell_line
cell_type
Find candidates for linking the entities to knowledgebases according to the below yaml schema. This uses
DictionaryEntityLinkingStep
drug: - CHEMBL - OPENTARGETS_MOLECULE disease: - MONDO - OPENTARGETS_DISEASE gene: - OPENTARGETS_TARGET - HGNC_GENE_FAMILY anatomy: - UBERON cell_line: - CELLOSAURUS cell_type: - CLO - CL go_bp: - BP_GENE_ONTOLOGY go_mf: - MF_GENE_ONTOLOGY go_cc: - CC_GENE_ONTOLOGY
Apply rules to disambiguate certain entity classes and mentions within a document using
RulesBasedEntityClassDisambiguationFilterStep
Decide which candidates are appropriate and extract mappings accordingly. This uses
MappingStep
Merge overlapping entities (where appropriate). This uses
MergeOverlappingEntsStep
Detect abbreviations, and copy appropriate mapping information to the desired spans. This uses
AbbreviationFinderStep
Perform some customisable cleanup. This uses
CleanupStep
All of these steps are customisable via Hydra configuration.
Note that other steps are available in Kazu which are not used in the default pipeline, such as:
SethStep
for tagging mutations with the SETH tagger.StanzaStep
for high accuracy sentence-segmentation (note that this does slow the pipeline down considerably, hence why it’s not in by default).SpacyNerStep
for using a generic spaCy pipeline (such as scispacy) for Named Entity Recognition.
Some of these require additional dependencies which are not included in the default installation of kazu. You can get all of these dependencies with:
$ pip install 'kazu[all-steps]'
Or you can install the specific required dependencies for just those steps out of the above that you are using - see their API docs for details.