kazu.steps.other.merge_overlapping_ents

Classes

MergeOverlappingEntsStep

This step merges overlapping and nested entities.

class kazu.steps.other.merge_overlapping_ents.MergeOverlappingEntsStep[source]

Bases: Step

This step merges overlapping and nested entities.

The final result should not allow any overlapped entities see algorithm description below

__call__(doc)[source]

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.

Parameters:
Returns:

The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]

__init__(ent_class_preferred_order, ignore_non_contiguous=True)[source]

The algorithm for selecting an entity span is as follows:

  1. group entities by location

    In this context, a location is a span of text represented by a start and end char index tuple. A location represents all contiguous and non-contiguous entities that in some way overlap, even if not directly. E.g.

    A overlaps B but not C. B overlaps C.

    entities A, B and C are all considered to be part of the same location

  2. sort entities within each location, picking the best according to the following sort logic:

    1. prefer entities with mappings

    2. prefer longest spans

    3. prefer entities as configured by ent_class_preferred_order (see param description below)

    4. prefer entities by level of confidence of entity mention

    5. If all above are equal, the preferred entity is selected on the basis of the entity class name (reverse alphabetically ordered). Warning: This last sort criteria is arbitrary

Parameters:
  • ent_class_preferred_order (list[str]) – order of namespaces to prefer. Any partially overlapped entities are eliminated according to this ordering (first = higher priority). If an entity class is not specified, it’s assumed to have a priority of 0 (a.k.a lowest)

  • ignore_non_contiguous (bool) – should non-contiguous entities be excluded from the merge process?

filter_ents_across_class(ents)[source]

Choose the best entities per location.

Parameters:

ents (dict[tuple[int, int], set[Entity]])

Returns:

Return type:

list[Entity]

group_entities_by_location(entities)[source]
Parameters:

entities (list[Entity])

Returns:

dict of locations to set[Entity]

Return type:

dict[tuple[int, int], set[Entity]]

select_preferred_entity(ents)[source]
Parameters:

ents (set[Entity])

Returns:

tuple of Entity<preferred> ,list[Entity]<other entities at this location>

Return type:

tuple[Entity, list[Entity]]