kazu.steps.linking.post_processing.mapping_strategies.strategies¶
Classes
Returns any exact matches. |
|
Factory class to produce mappings. |
|
A MappingStrategy is responsible for actualising instances of |
|
Same as parent class, but a complex string scorer with a predefined threshold is used to confirm that the ent_match is broadly similar to one of the candidates attached to the |
|
Split both query and reference terms by whitespace. |
|
For a |
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.ExactMatchMappingStrategy[source]¶
Bases:
MappingStrategy
Returns any exact matches.
- static filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶
Algorithms should override this method to return the “best”
CandidatesToMetrics
for a given query string.Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple
LinkingCandidate
s will be carried forward for disambiguation (if configured).- Parameters:
ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.
- Returns:
a dict of filtered candidates
- Return type:
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.MappingFactory[source]¶
Bases:
object
Factory class to produce mappings.
- static create_mapping(parser_name, source, idx, string_match_strategy, string_match_confidence, disambiguation_strategy=None, disambiguation_confidence=None, additional_metadata=None, xref_source_parser_name=None)[source]¶
- Parameters:
parser_name (str)
source (str)
idx (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_strategy (str | None)
disambiguation_confidence (DisambiguationConfidence | None)
additional_metadata (dict | None)
xref_source_parser_name (str | None)
- Return type:
- static create_mapping_from_id_set(id_set, parser_name, string_match_strategy, string_match_confidence, disambiguation_strategy, disambiguation_confidence=None, additional_metadata=None)[source]¶
- Parameters:
id_set (EquivalentIdSet)
parser_name (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_strategy (str | None)
disambiguation_confidence (DisambiguationConfidence | None)
additional_metadata (dict | None)
- Return type:
- static create_mapping_from_id_sets(id_sets, parser_name, string_match_strategy, string_match_confidence, disambiguation_strategy, disambiguation_confidence=None, additional_metadata=None)[source]¶
- Parameters:
id_sets (set[EquivalentIdSet])
parser_name (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_strategy (str | None)
disambiguation_confidence (DisambiguationConfidence | None)
additional_metadata (dict | None)
- Return type:
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.MappingStrategy[source]¶
Bases:
ABC
A MappingStrategy is responsible for actualising instances of
Mapping
.This is performed in two steps:
Filter the set of
CandidatesToMetrics
associated with anEntity
down to the most appropriate ones, (e.g. based on string similarity).If required, apply any configured
DisambiguationStrategy
to the filtered instances ofEquivalentIdSet
.
Selected instances of
EquivalentIdSet
are converted toMapping
.- __call__(ent_match, ent_match_norm, document, candidates)[source]¶
- Parameters:
ent_match (str) – unnormalised NER string match (i.e.
Entity.match
)ent_match_norm (str) – normalised NER string match (i.e.
Entity.match_norm
)document (Document) – originating document
candidates (dict[LinkingCandidate, LinkingMetrics]) – set of candidates to consider. Note, candidates from different parsers should not be mixed.
- Returns:
- Return type:
- __init__(confidence, disambiguation_strategies=None, disambiguation_essential=False)[source]¶
- Parameters:
confidence (StringMatchConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
disambiguation_strategies (list[DisambiguationStrategy] | None) – after
filter_candidates()
is called, these strategies are triggered if either multiple entries ofCandidatesToMetrics
remain, and/or any of them are ambiguous.disambiguation_essential (bool) – disambiguation strategies MUST deliver a result, in order for this strategy to pass.
- disambiguate_if_required(filtered_candidates, document, parser_name, ent_match, ent_match_norm)[source]¶
Applies disambiguation strategies if configured, and either len(filtered_candidates) > 1 or any of the filtered_candidates are ambiguous. If ids are still ambiguous after all strategies have run, the disambiguation confidence will be
DisambiguationConfidence.AMBIGUOUS
- Parameters:
filtered_candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to disambiguate
document (Document) – originating Document
parser_name (str) – parser name associated with these candidates
ent_match (str) – string of entity to be disambiguated
ent_match_norm (str) – normalised string of entity to be disambiguated
- Returns:
- Return type:
tuple[set[EquivalentIdSet], str | None, DisambiguationConfidence | None]
- abstract filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶
Algorithms should override this method to return the “best”
CandidatesToMetrics
for a given query string.Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple
LinkingCandidate
s will be carried forward for disambiguation (if configured).- Parameters:
ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.
- Returns:
a dict of filtered candidates
- Return type:
- prepare(document)[source]¶
Perform any setup that needs to run once per document.
Care should be taken if trying to cache this step, as the Document state is liable to change between executions.
- Parameters:
document (Document)
- Returns:
- Return type:
None
- DISAMBIGUATION_NOT_REQUIRED = 'disambiguation_not_required'¶
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.StrongMatchMappingStrategy[source]¶
Bases:
MappingStrategy
sort
CandidatesToMetrics
by highest scoring search match to identify the highest scoring match.query remaining matches to see whether their scores are greater than this best score - the differential (i.e. there are many close string matches).
- __init__(confidence, disambiguation_strategies=None, disambiguation_essential=False, search_threshold=80.0, symbolic_only=False, differential=2.0)[source]¶
- Parameters:
confidence (StringMatchConfidence)
disambiguation_strategies (list[DisambiguationStrategy] | None)
disambiguation_essential (bool)
search_threshold (float) – only consider linking candidates above this search threshold
symbolic_only (bool) – only consider candidates that are symbolic
differential (float) – only consider candidates with search scores equal or greater to the best match minus this value
- filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶
Algorithms should override this method to return the “best”
CandidatesToMetrics
for a given query string.Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple
LinkingCandidate
s will be carried forward for disambiguation (if configured).- Parameters:
ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.
- Returns:
a dict of filtered candidates
- Return type:
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.StrongMatchWithEmbeddingConfirmationStringMatchingStrategy[source]¶
Bases:
StrongMatchMappingStrategy
Same as parent class, but a complex string scorer with a predefined threshold is used to confirm that the ent_match is broadly similar to one of the candidates attached to the
CandidatesToMetrics
.Useful for refining non-symbolic close string matches (e.g. “Neck disease” and “Heck disease”).
- __init__(confidence, complex_string_scorer, disambiguation_strategies=None, disambiguation_essential=False, search_threshold=80.0, embedding_threshold=0.6, symbolic_only=False, differential=2.0)[source]¶
- Parameters:
confidence (StringMatchConfidence)
complex_string_scorer (StringSimilarityScorer) – only consider linking candidates passing this string scorer call
disambiguation_strategies (list[DisambiguationStrategy] | None)
disambiguation_essential (bool)
search_threshold (float) – only consider candidates above this search threshold
embedding_threshold (float) – the Entity.match and one of the LinkingCandidate.raw_synonyms must be above this threshold (according to the complex_string_scorer) for the candidate to be valid
symbolic_only (bool) – only consider candidates that are symbolic
differential (float) – only consider candidates with search scores equal or greater to the best match minus this value
- filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶
Algorithms should override this method to return the “best”
CandidatesToMetrics
for a given query string.Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple
LinkingCandidate
s will be carried forward for disambiguation (if configured).- Parameters:
ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.
- Returns:
a dict of filtered candidates
- Return type:
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.SymbolMatchMappingStrategy[source]¶
Bases:
MappingStrategy
Split both query and reference terms by whitespace.
Select the term with the most splits as the ‘query’. Check all of these tokens (and no more) are within the other term. Useful for symbol matching e.g. “MAP K8” (longest) vs “MAPK8” (shortest).
- classmethod filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶
Algorithms should override this method to return the “best”
CandidatesToMetrics
for a given query string.Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple
LinkingCandidate
s will be carried forward for disambiguation (if configured).- Parameters:
ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.
- Returns:
a dict of filtered candidates
- Return type:
- class kazu.steps.linking.post_processing.mapping_strategies.strategies.SynNormIsSubStringMappingStrategy[source]¶
Bases:
MappingStrategy
For a
CandidatesToMetrics
, see if any of their .synonym_norm are string matches of the match_norm tokens based on whitespace tokenisation.If exactly one element of
CandidatesToMetrics
matches, prefer it.Works best on symbolic entities, e.g. “TESTIN gene” ->”TESTIN”.
- __init__(confidence, disambiguation_strategies=None, disambiguation_essential=False, min_syn_norm_len_to_consider=3)[source]¶
- Parameters:
confidence (StringMatchConfidence)
disambiguation_strategies (list[DisambiguationStrategy] | None)
disambiguation_essential (bool)
min_syn_norm_len_to_consider (int) – only consider elements of
CandidatesToMetrics
where the length ofsynonym_norm
is equal to or greater than this value.
- filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶
Algorithms should override this method to return the “best”
CandidatesToMetrics
for a given query string.Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple
LinkingCandidate
s will be carried forward for disambiguation (if configured).- Parameters:
ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.
- Returns:
a dict of filtered candidates
- Return type: