kazu.steps.linking.post_processing.disambiguation.strategies¶
Classes
Certain entities are often mentioned by some colloquial name, even if it's technically incorrect. |
|
The job of a DisambiguationStrategy is to produce a Set of |
|
Prefer ids where the entity match string is the default label (after normalisation). |
|
Prefer ids where the entity match string is nearest to the default label (as per the configured |
|
- class kazu.steps.linking.post_processing.disambiguation.strategies.AnnotationLevelDisambiguationStrategy[source]¶
Bases:
DisambiguationStrategy
Certain entities are often mentioned by some colloquial name, even if it’s technically incorrect.
In these cases, we may have an annotation_score field in the metadata_db, as a proxy of how widely studied the entity is. We use this annotation score as a proxy for ‘given a random mention of the entity, how likely is it that the author is referring to instance x vs instance y’. Naturally, this is a pretty unsophisticated disambiguation strategy, so should generally only be used as a last resort!
- disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type:
- class kazu.steps.linking.post_processing.disambiguation.strategies.DefinedElsewhereInDocumentDisambiguationStrategy[source]¶
Bases:
DisambiguationStrategy
look for entities on the document that have mappings
filter the incoming set of
EquivalentIdSet
based on these mappings
- __init__(confidence)[source]¶
- Parameters:
confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
- disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type:
- class kazu.steps.linking.post_processing.disambiguation.strategies.DisambiguationStrategy[source]¶
Bases:
ABC
The job of a DisambiguationStrategy is to produce a Set of
EquivalentIdSet
.Warning
The
EquivalentIdSet
s produced needn’t map to those contained withinassociated_id_sets
. This may cause confusing behaviour during debugging.A
prepare()
method is available, which can be cached in the event of any duplicated preprocessing work that may be required (seeStrategyRunner
for the complexities of how MappingStrategy and DisambiguationStrategy are coordinated).- __call__(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Call self as a function.
- Parameters:
- Return type:
- __init__(confidence)[source]¶
- Parameters:
confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
- abstract disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type:
- class kazu.steps.linking.post_processing.disambiguation.strategies.GildaTfIdfDisambiguationStrategy[source]¶
Bases:
DisambiguationStrategy
- __init__(confidence, scorer, context_threshold_delta=0.01)[source]¶
- Parameters:
confidence (DisambiguationConfidence)
scorer (GildaTfIdfScorer)
context_threshold_delta (float) – If the maximum delta between the top two
EquivalentIdSet
s is below this value, assume disambiguation has failed
- static cacheable_build_document_representation(scorer, doc)[source]¶
Static cached method, so we don’t need to recalculate document representation between different instances of this class.
- Parameters:
scorer (GildaTfIdfScorer)
doc (Document)
- Returns:
- Return type:
- disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type:
- class kazu.steps.linking.post_processing.disambiguation.strategies.PreferDefaultLabelMatchDisambiguationStrategy[source]¶
Bases:
DisambiguationStrategy
Prefer ids where the entity match string is the default label (after normalisation).
Note
This strategy is intended to be used with
kazu.steps.linking.post_processing.mapping_strategies.strategies.ExactMatchMappingStrategy
with thedisambiguation_essential
argument set toTrue
.- __init__(confidence)[source]¶
- Parameters:
confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
- disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type:
- class kazu.steps.linking.post_processing.disambiguation.strategies.PreferNearestEmbeddingToDefaultLabelDisambiguationStrategy[source]¶
Bases:
DisambiguationStrategy
Prefer ids where the entity match string is nearest to the default label (as per the configured
StringSimilarityScorer
).In the case where multiple ID’s share the same nearest embedding distance, multiple IDs will be returned. This can happen if there are two ids that share the same default label.
- __init__(complex_string_scorer, confidence)[source]¶
- Parameters:
confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
complex_string_scorer (StringSimilarityScorer)
- disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type:
- class kazu.steps.linking.post_processing.disambiguation.strategies.TfIdfDisambiguationStrategy[source]¶
Bases:
DisambiguationStrategy
retrieve all synonyms associated with a
EquivalentIdSet
, filter out ambiguous ones and build a query matrix with the unambiguous ones.retrieve a list of all detected entity strings from the document, regardless of source and build a document representation matrix of these.
perform TFIDF on the query vs document, and sort according to most likely synonym hit from 1.
if the score is above the minimum threshold, create a mapping.
- __init__(confidence, scorer, context_threshold=0.7, relevant_aggregation_strategies=None)[source]¶
- Parameters:
confidence (DisambiguationConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
scorer (TfIdfScorer) – handles scoring of contexts
context_threshold (float) – only consider synonyms above this search threshold
relevant_aggregation_strategies (Iterable[EquivalentIdAggregationStrategy] | None) – Only consider these strategies when selecting synonyms from the synonym database, when building a representation. If none, all strategies will be considered
- build_id_set_representation(parser_name, id_sets)[source]¶
- Parameters:
parser_name (str)
id_sets (set[EquivalentIdSet])
- Return type:
- static cacheable_build_document_representation(scorer, doc, parsers)[source]¶
Static cached method, so we don’t need to recalculate document representation between different instances of this class.
- Parameters:
scorer (TfIdfScorer)
doc (Document)
parsers (frozenset[str]) – technically this only has to be a hashable iterable of string - but it should also be unique otherwise duplicate work will be done and thrown away, so pragmatically a frozenset makes sense.
- Returns:
- Return type:
- disambiguate(id_sets, document, parser_name, ent_match=None, ent_match_norm=None)[source]¶
Select a subset of
EquivalentIdSet
.- Parameters:
id_sets (set[EquivalentIdSet]) – disambiguation result should be based on these id_sets - either a standard subset, or subset based on modified
EquivalentIdSet
document (Document) – source document
parser_name (str) – name of parser that the id_set comes from
ent_match (str | None) – matched entity string
ent_match_norm (str | None) – normalised version of entity string
- Returns:
- Return type: