kazu.steps.other.merge_overlapping_ents¶
Classes
This step merges overlapping and nested entities. |
- class kazu.steps.other.merge_overlapping_ents.MergeOverlappingEntsStep[source]¶
Bases:
Step
This step merges overlapping and nested entities.
The final result should not allow any overlapped entities see algorithm description below
- __call__(doc)[source]¶
Process documents and respond with processed and failed documents.
Note that many steps will be decorated by
document_iterating_step()
ordocument_batch_step()
which will modify the ‘original’__call__
function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.
- __init__(ent_class_preferred_order, ignore_non_contiguous=True)[source]¶
The algorithm for selecting an entity span is as follows:
group entities by location
In this context, a location is a span of text represented by a start and end char index tuple. A location represents all contiguous and non-contiguous entities that in some way overlap, even if not directly. E.g.
A overlaps B but not C. B overlaps C.
entities A, B and C are all considered to be part of the same location
sort entities within each location, picking the best according to the following sort logic:
prefer entities with mappings
prefer longest spans
prefer entities as configured by ent_class_preferred_order (see param description below)
prefer entities by level of confidence of entity mention
If all above are equal, the preferred entity is selected on the basis of the entity class name (reverse alphabetically ordered). Warning: This last sort criteria is arbitrary
- Parameters:
ent_class_preferred_order (list[str]) – order of namespaces to prefer. Any partially overlapped entities are eliminated according to this ordering (first = higher priority). If an entity class is not specified, it’s assumed to have a priority of 0 (a.k.a lowest)
ignore_non_contiguous (bool) – should non-contiguous entities be excluded from the merge process?