Kazu Data Model¶
The Kazu datamodel is based around the concepts of kazu.data.Document
s and kazu.steps.step.Step
s. Steps are run over documents,
generally returning the original document with additional information added.
Documents are composed of a sequence of kazu.data.Section
s (for instance: title, body). A Section
is a container
for text and metadata (such as entities detected by an NER step).
from kazu.data import Document, Entity
from kazu.steps.document_post_processing.abbreviation_finder import (
AbbreviationFinderStep,
)
# creates a document with a single section
doc = Document.create_simple_document(
"Epidermal Growth Factor Receptor (EGFR) is a gene."
)
# create an Entity for the span "Epidermal Growth Factor Receptor"
entity = Entity.load_contiguous_entity(
# start and end are the character indices for the entity
start=0,
end=len("Epidermal Growth Factor Receptor"),
namespace="example",
entity_class="gene",
match="Epidermal Growth Factor Receptor",
)
# add it to the documents first (and only) section
doc.sections[0].entities.append(entity)
# create an instance of the AbbreviationFinderStep
step = AbbreviationFinderStep()
# a step may fail to process a document, so it returns two lists:
# all the docs, and just the failures
processed, failed = step([doc])
# check that a new entity has been created, attached to the EGFR span
egfr_entity = next(filter(lambda x: x.match == "EGFR", doc.get_entities()))
assert egfr_entity.entity_class == "gene"
print(egfr_entity.match)
For convenience, and to handle additional logging/failure events, Steps can be wrapped in a kazu.pipeline.Pipeline
.
For further data model documentation, please see the API docs for kazu.data.Entity
, kazu.data.LinkingCandidate
etc.
Data Serialization and deserialization¶
As Document
s are the key containers of data processed by (or to be
processed by) Kazu, Document.to_json()
is the key method here for serialization,
and Document.from_json()
for deserialization.
Document
and other classes that can be stored on Document
have
a from_dict()
method.
Note
Under the hood, Kazu uses cattrs for its (de)serialization,
so if you are already familiar with cattrs
, you may prefer to use kazu.data.kazu_json_converter
directly instead.
(De)serialization and generic metadata fields¶
Note
This is only relevant to advanced users, who are:
Modifying the pipeline or parsers so that they have custom metadata on some of Kazu’s classes
Using custom metadata that isn’t json-encodable ‘natively’ by Python
Want to both serialize and de-serialize this custom metadata and get back the same structured objects
If this isn’t you, skip this section!
Some of Kazu’s classes allow for a generic metadata
dictionary on them. Since this allows storing
arbitrary Python objects in this field, this can potentially break the ability to write to json and back.
In order to write to and read from json, keys of the metadata
dictionary will need to be strings,
as this is required in json.
Kazu uses cattrs for its (de)serialization, which means that
primitives, enums and python dataclasses
(with fields that are themselves supported)
are supported out of the box for serialization as values in the metadata
dictionary.
As a result, all dataclasses and Enums in kazu.data
will serialize without errors when stored inside
one of these metadata
fields.
Unfortunately, deserializing this output will leave the result containing dictionaries representing the relevant class/enum, rather than instances of the same class you originally had:
from kazu.data import Document, Section
doc = Document(
idx="my_doc_id",
sections=[Section(text="Some text here", name="my simple section")],
metadata={
"another section!": Section(
text="another somehow related text!", name="this is in metadata"
)
},
)
doc_dict = doc.to_dict()
print(Document.from_dict(doc_dict).metadata)
Produces:
{'another section!': {'text': 'another somehow related text!', 'name': 'this is in metadata'}}
However, you can work around this with cattrs by deserializing the metadata section first. A quick way of doing this is below (though you could instead set up a cattrs ‘converter’ with custom structuring hooks):
from kazu.data import kazu_json_converter
# continuing from above
doc_dict["metadata"] = kazu_json_converter.structure(
doc_dict["metadata"], dict[str, Section]
)
reloaded_doc = kazu_json_converter.structure(doc_dict, Document)
print(reloaded_doc == doc)
Produces (as expected):
True