Quickstart

Installation

Python version 3.9 or higher is required (tested with Python 3.11).

Installing Pytorch (prerequisite)

Kazu handles pytorch installation for users where possible - if torch is installable with:

$ pip install torch

then Kazu will handle it for you.

However, this is only possible on some platforms (e.g Mac, Windows without using a GPU, Linux with a specific version of CUDA).

See the PyTorch website here and select your platform. If the command specifies an index_url you will need to run the command, although installing torchvision and torchaudio is not necessary.

For example, at time of writing these docs, to install pytorch on Linux without a GPU, you will need to do:

$ pip install torch --index-url https://download.pytorch.org/whl/cpu

Installing Kazu

If you have already installed pytorch or it is installable on your platform with pip install torch, installing Kazu is as simple as:

$ pip install kazu

If you intend to use Mypy on your own codebase, consider installing Kazu using:

$ pip install 'kazu[typed]'

This will pull in typing stubs for kazu’s dependencies (such as types-requests for Requests) so that mypy has access to as much relevant typing information as possible when type checking your codebase. Otherwise (depending on mypy config), you may see errors when running mypy like:

.venv/lib/python3.10/site-packages/kazu/steps/linking/post_processing/xref_manager.py:10: error: Library stubs not installed for "requests" [import]

Model Pack

In order to use the majority of Kazu, you will need a model pack, which contains the pretrained models and knowledge bases/ontologies required by the pipeline. These are available from the release page.

For Kazu to work as expected, you will need to set an environment variable KAZU_MODEL_PACK to the path to your model pack.

On MacOS/Linux/Windows Subsystem for Linux (WSL):

$ export KAZU_MODEL_PACK=/Users/me/path/to/kazu_model_pack_public-vCurrent.Version
For Windows

Using the default Windows CMD shell:

$ set KAZU_MODEL_PACK=C:\Users\me\path\to\kazu_model_pack_public-vCurrent.Version

Using Powershell:

$ $Env:KAZU_MODEL_PACK = 'C:\Users\me\path\to\kazu_model_pack_public-vCurrent.Version'

Default configuration

Kazu has a LOT of moving parts, each of which can be configured according to your requirements. Since this can get complicated, we use Hydra to manage different configurations, and provide a ‘default’ configuration that is generally useful in most circumstances (and is also a good starting point for your own tweaks). This default configuration is located in the ‘conf/’ directory of the model pack.

Processing your first document

Make sure you’ve installed Kazu correctly as above, and have set the KAZU_MODEL_PACK variable as described in the Model Pack section above.

The below code assumes a standard .py file (or console), if you wish to use a notebook, see the below section.

import hydra
from hydra.utils import instantiate

from kazu.data import Document
from kazu.pipeline import Pipeline
from kazu.utils.constants import HYDRA_VERSION_BASE
from pathlib import Path
import os

# the hydra config is kept in the model pack
cdir = Path(os.environ["KAZU_MODEL_PACK"]).joinpath("conf")


@hydra.main(
    version_base=HYDRA_VERSION_BASE, config_path=str(cdir), config_name="config"
)
def kazu_test(cfg):
    pipeline: Pipeline = instantiate(cfg.Pipeline)
    text = "EGFR mutations are often implicated in lung cancer"
    doc = Document.create_simple_document(text)
    pipeline([doc])
    print(f"{doc.sections[0].text}")
    # add other manipulation of the document here or a breakpoint() call
    # for interactive exploration.


if __name__ == "__main__":
    kazu_test()

You can now inspect the doc object, and explore what entities were detected on each section.

The above code snippet sets up your code as a Hydra application - which allows you a great deal of flexibility to re-configure many parts of kazu via command line overrides. See the Hydra docs for more detail on this.

Using Kazu in a notebook or other non-Hydra application

Sometimes, you will not want your overall application to be a Hydra application, where Hydra handles the command line argument parsing.

In some cases like running with a notebook, using Hydra to handle the argument parsing isn’t possible at all (see This Hydra issue for details with a Jupyter notebook).

You may not want Hydra to control command line argument parsing in other scenarios either. For example, you may wish to build a different command-line experience for your application and have no need for command-line overrides of the Kazu config. Alternatively, you want to embed Kazu components into another codebase, and still want Hydra to manage the configuration of the Kazu components.

Instead, you can instantiate Kazu objects using Hydra without making your whole program a ‘Hydra application’ by using the hydra compose API:

from hydra import compose, initialize_config_dir
from hydra.utils import instantiate

from kazu.data import Document
from kazu.pipeline import Pipeline
from kazu.utils.constants import HYDRA_VERSION_BASE
from pathlib import Path
import os

# the hydra config is kept in the model pack
cdir = Path(os.environ["KAZU_MODEL_PACK"]).joinpath("conf")


def kazu_test():
    with initialize_config_dir(version_base=HYDRA_VERSION_BASE, config_dir=str(cdir)):
        cfg = compose(config_name="config")
    pipeline: Pipeline = instantiate(cfg.Pipeline)
    text = "EGFR mutations are often implicated in lung cancer"
    doc = Document.create_simple_document(text)
    pipeline([doc])
    print(f"{doc.sections[0].text}")
    return doc


if __name__ == "__main__":
    doc = kazu_test()

Note that if running the above code in a Jupyter notebook, the if __name__ == "__main__": check is redundant (though it still behaves as expected) and you can just run kazu_test() directly in a cell of the notebook.

Running Steps

Components are wrapped as instances of kazu.steps.step.Step.

from kazu.data import Document, Entity
from kazu.steps.document_post_processing.abbreviation_finder import (
    AbbreviationFinderStep,
)

# creates a document with a single section
doc = Document.create_simple_document(
    "Epidermal Growth Factor Receptor (EGFR) is a gene."
)
# create an Entity for the span "Epidermal Growth Factor Receptor"
entity = Entity.load_contiguous_entity(
    # start and end are the character indices for the entity
    start=0,
    end=len("Epidermal Growth Factor Receptor"),
    namespace="example",
    entity_class="gene",
    match="Epidermal Growth Factor Receptor",
)

# add it to the documents first (and only) section
doc.sections[0].entities.append(entity)

# create an instance of the AbbreviationFinderStep
step = AbbreviationFinderStep()
# a step may fail to process a document, so it returns two lists:
# all the docs, and just the failures
processed, failed = step([doc])
# check that a new entity has been created, attached to the EGFR span
egfr_entity = next(filter(lambda x: x.match == "EGFR", doc.get_entities()))
assert egfr_entity.entity_class == "gene"
print(egfr_entity.match)