kazu.utils.spacy_object_mapper

Classes

KazuToSpacyObjectMapper

Maps entities and text from a Section to the spaCy data model using basic_spacy_pipeline().

class kazu.utils.spacy_object_mapper.KazuToSpacyObjectMapper[source]

Bases: object

Maps entities and text from a Section to the spaCy data model using basic_spacy_pipeline().

Attention

Providing incomplete entity_classes for your usage (or leaving it blank) can lead to errors that might only occur infrequently when processing the results, and therefore may be difficult to track down.

Therefore, users should be careful to set entity_classes to all the entity classes corresponding to attributes that they will access on the spaCy Tokens within the Spans of the result of __call__(), whether directly or via spaCy Matcher rules that check these custom attributes.

The specific problem is that if you try to read a spaCy custom attribute that doesn’t exist, you will get an error like:

AttributeError: [E046] Can't retrieve unregistered extension attribute 'drug'.
Did you forget to call the `set_extension` method?

This class uses the provided entity_classes to call set_extension. If the provided entity_classes is incomplete - say, missing "drug" - and you then try to access the drug attribute on a token in the result, you will get this error.

__call__(section)[source]

Convert a Section into a dictionary of Entity to spaCy Spans.

Parameters:

section (Section)

Return type:

dict[Entity, Span]

__init__(entity_classes={}, set_attributes_incrementally=False)[source]
Parameters:
  • entity_classes (Iterable[str]) – known entity classes that the caller intends to access the spaCy extension attribute of with the result of __call__(). See note above about the need to take care here.

  • set_attributes_incrementally (bool) –

    whether to set a spaCy custom extension attribute for ‘new’ entity classes in Sectionpassed to __call__(). This will result in a more consistent result of __call__, where every Span in the dictionary will have an attribute for the relevant Entity’s entity class set to True for all the tokens in the span. However, it makes subtle bugs much more likely, so False is the default - see the note in the class-level docs if you are thinking about turning this on.

entity_classes

A set of entity classes known to this class. These will all have a spaCy custom extension attribute set. If set_attributes_incrementally is True, as well as the entity_classes passed into the __init__, this will include all entity classes encountered so far processing Sections passed in to __call__().