kazu.steps.linking.post_processing.mapping_strategies.strategies¶

Classes

`ExactMatchMappingStrategy`	Returns any exact matches.
`MappingFactory`	Factory class to produce mappings.
`MappingStrategy`	A MappingStrategy is responsible for actualising instances of `Mapping`
`StrongMatchMappingStrategy`
`StrongMatchWithEmbeddingConfirmationStringMatchingStrategy`	Same as parent class, but a complex string scorer with a predefined threshold is used to confirm that the ent_match is broadly similar to one of the candidates attached to the `CandidatesToMetrics`.
`SymbolMatchMappingStrategy`	Split both query and reference terms by whitespace.
`SynNormIsSubStringMappingStrategy`	For a `CandidatesToMetrics`, see if any of their .synonym_norm are string matches of the match_norm tokens based on whitespace tokenisation.

class kazu.steps.linking.post_processing.mapping_strategies.strategies.ExactMatchMappingStrategy[source]¶

Bases: MappingStrategy

Returns any exact matches.

static filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:

ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.MappingFactory[source]¶

Bases: object

Factory class to produce mappings.

static create_mapping(parser_name, source, idx, string_match_strategy, string_match_confidence, disambiguation_strategy=None, disambiguation_confidence=None, additional_metadata=None, xref_source_parser_name=None)[source]¶

Parameters:

parser_name (str)
source (str)
idx (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_strategy (str | None)
disambiguation_confidence (DisambiguationConfidence | None)
additional_metadata (dict | None)
xref_source_parser_name (str | None)

Return type:

Mapping

static create_mapping_from_id_set(id_set, parser_name, string_match_strategy, string_match_confidence, disambiguation_strategy, disambiguation_confidence=None, additional_metadata=None)[source]¶

Parameters:

id_set (EquivalentIdSet)
parser_name (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_strategy (str | None)
disambiguation_confidence (DisambiguationConfidence | None)
additional_metadata (dict | None)

Return type:

Iterable[Mapping]

static create_mapping_from_id_sets(id_sets, parser_name, string_match_strategy, string_match_confidence, disambiguation_strategy, disambiguation_confidence=None, additional_metadata=None)[source]¶

Parameters:

id_sets (set[EquivalentIdSet])
parser_name (str)
string_match_strategy (str)
string_match_confidence (StringMatchConfidence)
disambiguation_strategy (str | None)
disambiguation_confidence (DisambiguationConfidence | None)
additional_metadata (dict | None)

Return type:

Iterable[Mapping]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.MappingStrategy[source]¶

Bases: ABC

A MappingStrategy is responsible for actualising instances of Mapping.

This is performed in two steps:

Filter the set of CandidatesToMetrics associated with an Entity down to the most appropriate ones, (e.g. based on string similarity).
If required, apply any configured DisambiguationStrategy to the filtered instances of EquivalentIdSet.

Selected instances of EquivalentIdSet are converted to Mapping.

__call__(ent_match, ent_match_norm, document, candidates)[source]¶

Parameters:

ent_match (str) – unnormalised NER string match (i.e. Entity.match)
ent_match_norm (str) – normalised NER string match (i.e. Entity.match_norm)
document (Document) – originating document
candidates (dict[LinkingCandidate, LinkingMetrics]) – set of candidates to consider. Note, candidates from different parsers should not be mixed.

Returns:

Return type:

Iterable[Mapping]

__init__(confidence, disambiguation_strategies=None, disambiguation_essential=False)[source]¶

Parameters:

confidence (StringMatchConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.
disambiguation_strategies (list[DisambiguationStrategy] | None) – after filter_candidates() is called, these strategies are triggered if either multiple entries of CandidatesToMetrics remain, and/or any of them are ambiguous.
disambiguation_essential (bool) – disambiguation strategies MUST deliver a result, in order for this strategy to pass.

disambiguate_if_required(filtered_candidates, document, parser_name, ent_match, ent_match_norm)[source]¶

Applies disambiguation strategies if configured, and either len(filtered_candidates) > 1 or any of the filtered_candidates are ambiguous. If ids are still ambiguous after all strategies have run, the disambiguation confidence will be DisambiguationConfidence.AMBIGUOUS

Parameters:

filtered_candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to disambiguate
document (Document) – originating Document
parser_name (str) – parser name associated with these candidates
ent_match (str) – string of entity to be disambiguated
ent_match_norm (str) – normalised string of entity to be disambiguated

Returns:

Return type:

tuple[set[EquivalentIdSet], str | None, DisambiguationConfidence | None]

abstract filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Parameters:

ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

prepare(document)[source]¶

Perform any setup that needs to run once per document.

Care should be taken if trying to cache this step, as the Document state is liable to change between executions.

Parameters:: document (Document)
Returns:
Return type:: None

DISAMBIGUATION_NOT_REQUIRED = 'disambiguation_not_required'¶

class kazu.steps.linking.post_processing.mapping_strategies.strategies.StrongMatchMappingStrategy[source]¶

Bases: MappingStrategy

sort CandidatesToMetrics by highest scoring search match to identify the highest scoring match.
query remaining matches to see whether their scores are greater than this best score - the differential (i.e. there are many close string matches).

__init__(confidence, disambiguation_strategies=None, disambiguation_essential=False, search_threshold=80.0, symbolic_only=False, differential=2.0)[source]¶

Parameters:

confidence (StringMatchConfidence)
disambiguation_strategies (list[DisambiguationStrategy] | None)
disambiguation_essential (bool)
search_threshold (float) – only consider linking candidates above this search threshold
symbolic_only (bool) – only consider candidates that are symbolic
differential (float) – only consider candidates with search scores equal or greater to the best match minus this value

filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Parameters:

ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.StrongMatchWithEmbeddingConfirmationStringMatchingStrategy[source]¶

Bases: StrongMatchMappingStrategy

Same as parent class, but a complex string scorer with a predefined threshold is used to confirm that the ent_match is broadly similar to one of the candidates attached to the CandidatesToMetrics.

Useful for refining non-symbolic close string matches (e.g. “Neck disease” and “Heck disease”).

__init__(confidence, complex_string_scorer, disambiguation_strategies=None, disambiguation_essential=False, search_threshold=80.0, embedding_threshold=0.6, symbolic_only=False, differential=2.0)[source]¶

Parameters:

confidence (StringMatchConfidence)
complex_string_scorer (StringSimilarityScorer) – only consider linking candidates passing this string scorer call
disambiguation_strategies (list[DisambiguationStrategy] | None)
disambiguation_essential (bool)
search_threshold (float) – only consider candidates above this search threshold
embedding_threshold (float) – the Entity.match and one of the LinkingCandidate.raw_synonyms must be above this threshold (according to the complex_string_scorer) for the candidate to be valid
symbolic_only (bool) – only consider candidates that are symbolic
differential (float) – only consider candidates with search scores equal or greater to the best match minus this value

filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Parameters:

ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.SymbolMatchMappingStrategy[source]¶

Bases: MappingStrategy

Split both query and reference terms by whitespace.

Select the term with the most splits as the ‘query’. Check all of these tokens (and no more) are within the other term. Useful for symbol matching e.g. “MAP K8” (longest) vs “MAPK8” (shortest).

classmethod filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Parameters:

ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

static match_symbols(s1, s2)[source]¶

Parameters:

s1 (str)
s2 (str)

Return type:

bool

class kazu.steps.linking.post_processing.mapping_strategies.strategies.SynNormIsSubStringMappingStrategy[source]¶

Bases: MappingStrategy

For a CandidatesToMetrics, see if any of their .synonym_norm are string matches of the match_norm tokens based on whitespace tokenisation.

If exactly one element of CandidatesToMetrics matches, prefer it.

Works best on symbolic entities, e.g. “TESTIN gene” ->”TESTIN”.

__init__(confidence, disambiguation_strategies=None, disambiguation_essential=False, min_syn_norm_len_to_consider=3)[source]¶

Parameters:

confidence (StringMatchConfidence)
disambiguation_strategies (list[DisambiguationStrategy] | None)
disambiguation_essential (bool)
min_syn_norm_len_to_consider (int) – only consider elements of CandidatesToMetrics where the length of synonym_norm is equal to or greater than this value.

filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]¶

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Parameters:

ent_match (str) – the raw entity string.
ent_match_norm (str) – normalised version of the entity string.
document (Document) – originating Document.
candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.
parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]