Kazu Data Model

The Kazu datamodel is based around the concepts of kazu.data.Documents and kazu.steps.step.Steps. Steps are run over documents, generally returning the original document with additional information added.

Documents are composed of a sequence of kazu.data.Sections (for instance: title, body). A Section is a container for text and metadata (such as entities detected by an NER step).

from kazu.data import Document, Entity
from kazu.steps.document_post_processing.abbreviation_finder import (
    AbbreviationFinderStep,
)

# creates a document with a single section
doc = Document.create_simple_document(
    "Epidermal Growth Factor Receptor (EGFR) is a gene."
)
# create an Entity for the span "Epidermal Growth Factor Receptor"
entity = Entity.load_contiguous_entity(
    # start and end are the character indices for the entity
    start=0,
    end=len("Epidermal Growth Factor Receptor"),
    namespace="example",
    entity_class="gene",
    match="Epidermal Growth Factor Receptor",
)

# add it to the documents first (and only) section
doc.sections[0].entities.append(entity)

# create an instance of the AbbreviationFinderStep
step = AbbreviationFinderStep()
# a step may fail to process a document, so it returns two lists:
# all the docs, and just the failures
processed, failed = step([doc])
# check that a new entity has been created, attached to the EGFR span
egfr_entity = next(filter(lambda x: x.match == "EGFR", doc.get_entities()))
assert egfr_entity.entity_class == "gene"
print(egfr_entity.match)

For convenience, and to handle additional logging/failure events, Steps can be wrapped in a kazu.pipeline.Pipeline.

For further data model documentation, please see the API docs for kazu.data.Entity, kazu.data.LinkingCandidate etc.

Data Serialization and deserialization

As Documents are the key containers of data processed by (or to be processed by) Kazu, Document.to_json() is the key method here for serialization, and Document.from_json() for deserialization.

Document and other classes that can be stored on Document have a from_dict() method.

Note

Under the hood, Kazu uses cattrs for its (de)serialization, so if you are already familiar with cattrs, you may prefer to use kazu.data.kazu_json_converter directly instead.

(De)serialization and generic metadata fields

Note

This is only relevant to advanced users, who are:

  • Modifying the pipeline or parsers so that they have custom metadata on some of Kazu’s classes

  • Using custom metadata that isn’t json-encodable ‘natively’ by Python

  • Want to both serialize and de-serialize this custom metadata and get back the same structured objects

If this isn’t you, skip this section!

Some of Kazu’s classes allow for a generic metadata dictionary on them. Since this allows storing arbitrary Python objects in this field, this can potentially break the ability to write to json and back.

In order to write to and read from json, keys of the metadata dictionary will need to be strings, as this is required in json.

Kazu uses cattrs for its (de)serialization, which means that primitives, enums and python dataclasses (with fields that are themselves supported) are supported out of the box for serialization as values in the metadata dictionary. As a result, all dataclasses and Enums in kazu.data will serialize without errors when stored inside one of these metadata fields.

Unfortunately, deserializing this output will leave the result containing dictionaries representing the relevant class/enum, rather than instances of the same class you originally had:

from kazu.data import Document, Section

doc = Document(
    idx="my_doc_id",
    sections=[Section(text="Some text here", name="my simple section")],
    metadata={
        "another section!": Section(
            text="another somehow related text!", name="this is in metadata"
        )
    },
)
doc_dict = doc.to_dict()
print(Document.from_dict(doc_dict).metadata)

Produces:

{'another section!': {'text': 'another somehow related text!', 'name': 'this is in metadata'}}

However, you can work around this with cattrs by deserializing the metadata section first. A quick way of doing this is below (though you could instead set up a cattrs ‘converter’ with custom structuring hooks):

from kazu.data import kazu_json_converter

# continuing from above
doc_dict["metadata"] = kazu_json_converter.structure(
    doc_dict["metadata"], dict[str, Section]
)
reloaded_doc = kazu_json_converter.structure(doc_dict, Document)

print(reloaded_doc == doc)

Produces (as expected):

True