kazu.steps.linking.post_processing.disambiguation.strategies

Classes

AnnotationLevelDisambiguationStrategy

Certain entities are often mentioned by some colloquial name, even if it's technically incorrect.

DefinedElsewhereInDocumentDisambiguationStrategy

DisambiguationStrategy

The job of a DisambiguationStrategy is to produce a Set of EquivalentIdSet.

GildaTfIdfDisambiguationStrategy

PreferDefaultLabelMatchDisambiguationStrategy

Prefer ids where the entity match string is the default label (after normalisation).

PreferNearestEmbeddingToDefaultLabelDisambiguationStrategy

Prefer ids where the entity match string is nearest to the default label (as per the configured StringSimilarityScorer).

TfIdfDisambiguationStrategy

class kazu.steps.linking.post_processing.disambiguation.strategies.AnnotationLevelDisambiguationStrategy[source]

Bases: DisambiguationStrategy

Certain entities are often mentioned by some colloquial name, even if it’s technically incorrect.

In these cases, we may have an annotation_score field in the metadata_db, as a proxy of how widely studied the entity is. We use this annotation score as a proxy for ‘given a random mention of the entity, how likely is it that the author is referring to instance x vs instance y’. Naturally, this is a pretty unsophisticated disambiguation strategy, so should generally only be used as a last resort!

disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

prepare(document)[source]

Perform any preprocessing required.

Parameters:

document (Document)

Returns:

Return type:

None

class kazu.steps.linking.post_processing.disambiguation.strategies.DefinedElsewhereInDocumentDisambiguationStrategy[source]

Bases: DisambiguationStrategy

  1. look for entities on the document that have mappings

  2. filter the incoming set of EquivalentIdSet based on these mappings

__init__(confidence)[source]
Parameters:

confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.

disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

prepare(document)[source]

Note, this method can’t be cached, as the state of the document may change between executions.

Parameters:

document (Document)

Returns:

Return type:

None

class kazu.steps.linking.post_processing.disambiguation.strategies.DisambiguationStrategy[source]

Bases: ABC

The job of a DisambiguationStrategy is to produce a Set of EquivalentIdSet.

Warning

The EquivalentIdSets produced needn’t map to those contained within associated_id_sets. This may cause confusing behaviour during debugging.

A prepare() method is available, which can be cached in the event of any duplicated preprocessing work that may be required (see StrategyRunner for the complexities of how MappingStrategy and DisambiguationStrategy are coordinated).

__call__(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Call self as a function.

Parameters:
Return type:

set[EquivalentIdSet]

__init__(confidence)[source]
Parameters:

confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.

abstract disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

abstract prepare(document)[source]

Perform any preprocessing required.

Parameters:

document (Document)

Returns:

Return type:

None

class kazu.steps.linking.post_processing.disambiguation.strategies.GildaTfIdfDisambiguationStrategy[source]

Bases: DisambiguationStrategy

__init__(confidence, scorer, context_threshold_delta=0.01)[source]
Parameters:
static cacheable_build_document_representation(scorer, doc)[source]

Static cached method, so we don’t need to recalculate document representation between different instances of this class.

Parameters:
Returns:

Return type:

csr_matrix

disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

prepare(document)[source]

Build document representations by parser names here, and store in a dict.

This method is cached so we don’t need to call it multiple times per document.

Parameters:

document (Document)

Returns:

Return type:

None

class kazu.steps.linking.post_processing.disambiguation.strategies.PreferDefaultLabelMatchDisambiguationStrategy[source]

Bases: DisambiguationStrategy

Prefer ids where the entity match string is the default label (after normalisation).

Note

This strategy is intended to be used with kazu.steps.linking.post_processing.mapping_strategies.strategies.ExactMatchMappingStrategy with the disambiguation_essential argument set to True.

__init__(confidence)[source]
Parameters:

confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.

disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

prepare(document)[source]

Perform any preprocessing required.

Parameters:

document (Document)

Returns:

Return type:

None

class kazu.steps.linking.post_processing.disambiguation.strategies.PreferNearestEmbeddingToDefaultLabelDisambiguationStrategy[source]

Bases: DisambiguationStrategy

Prefer ids where the entity match string is nearest to the default label (as per the configured StringSimilarityScorer).

In the case where multiple ID’s share the same nearest embedding distance, multiple IDs will be returned. This can happen if there are two ids that share the same default label.

__init__(complex_string_scorer, confidence)[source]
Parameters:
  • confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.

  • complex_string_scorer (StringSimilarityScorer)

disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

prepare(document)[source]

Perform any preprocessing required.

Parameters:

document (Document)

Returns:

Return type:

None

class kazu.steps.linking.post_processing.disambiguation.strategies.TfIdfDisambiguationStrategy[source]

Bases: DisambiguationStrategy

  1. retrieve all synonyms associated with a EquivalentIdSet, filter out ambiguous ones and build a query matrix with the unambiguous ones.

  2. retrieve a list of all detected entity strings from the document, regardless of source and build a document representation matrix of these.

  3. perform TFIDF on the query vs document, and sort according to most likely synonym hit from 1.

  4. if the score is above the minimum threshold, create a mapping.

__init__(confidence, scorer, context_threshold=0.7, relevant_aggregation_strategies=None)[source]
Parameters:
  • confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.

  • scorer (TfIdfScorer) – handles scoring of contexts

  • context_threshold (float) – only consider synonyms above this search threshold

  • relevant_aggregation_strategies (Iterable[EquivalentIdAggregationStrategy] | None) – Only consider these strategies when selecting synonyms from the synonym database, when building a representation. If none, all strategies will be considered

build_id_set_representation(parser_name, id_sets)[source]
Parameters:
Return type:

dict[str, set[EquivalentIdSet]]

static cacheable_build_document_representation(scorer, doc, parsers)[source]

Static cached method, so we don’t need to recalculate document representation between different instances of this class.

Parameters:
  • scorer (TfIdfScorer)

  • doc (Document)

  • parsers (frozenset[str]) – technically this only has to be a hashable iterable of string - but it should also be unique otherwise duplicate work will be done and thrown away, so pragmatically a frozenset makes sense.

Returns:

Return type:

dict[str, csr_matrix]

disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]

Select a subset of EquivalentIdSet.

Parameters:
  • id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified EquivalentIdSet

  • document (Document) – source document

  • parser_name (str) – name of parser that the id_set comes from

  • ent_match (str | None) – matched entity string

  • ent_match_norm (str | None) – normalised version of entity string

Returns:

Return type:

set[EquivalentIdSet]

prepare(document)[source]

Build document representations by parser names here, and store in a dict. This method is cached so we don’t need to call it multiple times per document.

Parameters:

document (Document)

Returns:

Return type:

None