kazu.steps.linking.post_processing.mapping_strategies.strategies

Classes

ExactMatchMappingStrategy

Returns any exact matches.

MappingFactory

Factory class to produce mappings.

MappingStrategy

A MappingStrategy is responsible for actualising instances of Mapping

StrongMatchMappingStrategy

StrongMatchWithEmbeddingConfirmationStringMatchingStrategy

Same as parent class, but a complex string scorer with a predefined threshold is used to confirm that the ent_match is broadly similar to one of the candidates attached to the CandidatesToMetrics.

SymbolMatchMappingStrategy

Split both query and reference terms by whitespace.

SynNormIsSubStringMappingStrategy

For a CandidatesToMetrics, see if any of their .synonym_norm are string matches of the match_norm tokens based on whitespace tokenisation.

class kazu.steps.linking.post_processing.mapping_strategies.strategies.ExactMatchMappingStrategy[source]

Bases: MappingStrategy

Returns any exact matches.

static filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:
  • ent_match (str) – the raw entity string.

  • ent_match_norm (str) – normalised version of the entity string.

  • document (Document) – originating Document.

  • candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.

  • parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.MappingFactory[source]

Bases: object

Factory class to produce mappings.

static create_mapping(parser_name, source, idx, string_match_strategy, string_match_confidence, disambiguation_strategy=None, disambiguation_confidence=None, additional_metadata=None, xref_source_parser_name=None)[source]
Parameters:
Return type:

Mapping

static create_mapping_from_id_set(id_set, parser_name, string_match_strategy, string_match_confidence, disambiguation_strategy, disambiguation_confidence=None, additional_metadata=None)[source]
Parameters:
Return type:

Iterable[Mapping]

static create_mapping_from_id_sets(id_sets, parser_name, string_match_strategy, string_match_confidence, disambiguation_strategy, disambiguation_confidence=None, additional_metadata=None)[source]
Parameters:
Return type:

Iterable[Mapping]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.MappingStrategy[source]

Bases: ABC

A MappingStrategy is responsible for actualising instances of Mapping.

This is performed in two steps:

  1. Filter the set of CandidatesToMetrics associated with an Entity down to the most appropriate ones, (e.g. based on string similarity).

  2. If required, apply any configured DisambiguationStrategy to the filtered instances of EquivalentIdSet.

Selected instances of EquivalentIdSet are converted to Mapping.

__call__(ent_match, ent_match_norm, document, candidates)[source]
Parameters:
Returns:

Return type:

Iterable[Mapping]

__init__(confidence, disambiguation_strategies=None, disambiguation_essential=False)[source]
Parameters:
  • confidence (StringMatchConfidence) – the level of confidence that should be assigned to this strategy. This is simply a label for human users, and has no bearing on the actual algorithm.

  • disambiguation_strategies (list[DisambiguationStrategy] | None) – after filter_candidates() is called, these strategies are triggered if either multiple entries of CandidatesToMetrics remain, and/or any of them are ambiguous.

  • disambiguation_essential (bool) – disambiguation strategies MUST deliver a result, in order for this strategy to pass.

disambiguate_if_required(filtered_candidates, document, parser_name, ent_match, ent_match_norm)[source]

Applies disambiguation strategies if configured, and either len(filtered_candidates) > 1 or any of the filtered_candidates are ambiguous. If ids are still ambiguous after all strategies have run, the disambiguation confidence will be DisambiguationConfidence.AMBIGUOUS

Parameters:
  • filtered_candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to disambiguate

  • document (Document) – originating Document

  • parser_name (str) – parser name associated with these candidates

  • ent_match (str) – string of entity to be disambiguated

  • ent_match_norm (str) – normalised string of entity to be disambiguated

Returns:

Return type:

tuple[set[EquivalentIdSet], str | None, DisambiguationConfidence | None]

abstract filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:
  • ent_match (str) – the raw entity string.

  • ent_match_norm (str) – normalised version of the entity string.

  • document (Document) – originating Document.

  • candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.

  • parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

prepare(document)[source]

Perform any setup that needs to run once per document.

Care should be taken if trying to cache this step, as the Document state is liable to change between executions.

Parameters:

document (Document)

Returns:

Return type:

None

DISAMBIGUATION_NOT_REQUIRED = 'disambiguation_not_required'
class kazu.steps.linking.post_processing.mapping_strategies.strategies.StrongMatchMappingStrategy[source]

Bases: MappingStrategy

  1. sort CandidatesToMetrics by highest scoring search match to identify the highest scoring match.

  2. query remaining matches to see whether their scores are greater than this best score - the differential (i.e. there are many close string matches).

__init__(confidence, disambiguation_strategies=None, disambiguation_essential=False, search_threshold=80.0, symbolic_only=False, differential=2.0)[source]
Parameters:
  • confidence (StringMatchConfidence)

  • disambiguation_strategies (list[DisambiguationStrategy] | None)

  • disambiguation_essential (bool)

  • search_threshold (float) – only consider linking candidates above this search threshold

  • symbolic_only (bool) – only consider candidates that are symbolic

  • differential (float) – only consider candidates with search scores equal or greater to the best match minus this value

filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:
  • ent_match (str) – the raw entity string.

  • ent_match_norm (str) – normalised version of the entity string.

  • document (Document) – originating Document.

  • candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.

  • parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.StrongMatchWithEmbeddingConfirmationStringMatchingStrategy[source]

Bases: StrongMatchMappingStrategy

Same as parent class, but a complex string scorer with a predefined threshold is used to confirm that the ent_match is broadly similar to one of the candidates attached to the CandidatesToMetrics.

Useful for refining non-symbolic close string matches (e.g. “Neck disease” and “Heck disease”).

__init__(confidence, complex_string_scorer, disambiguation_strategies=None, disambiguation_essential=False, search_threshold=80.0, embedding_threshold=0.6, symbolic_only=False, differential=2.0)[source]
Parameters:
  • confidence (StringMatchConfidence)

  • complex_string_scorer (StringSimilarityScorer) – only consider linking candidates passing this string scorer call

  • disambiguation_strategies (list[DisambiguationStrategy] | None)

  • disambiguation_essential (bool)

  • search_threshold (float) – only consider candidates above this search threshold

  • embedding_threshold (float) – the Entity.match and one of the LinkingCandidate.raw_synonyms must be above this threshold (according to the complex_string_scorer) for the candidate to be valid

  • symbolic_only (bool) – only consider candidates that are symbolic

  • differential (float) – only consider candidates with search scores equal or greater to the best match minus this value

filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:
  • ent_match (str) – the raw entity string.

  • ent_match_norm (str) – normalised version of the entity string.

  • document (Document) – originating Document.

  • candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.

  • parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

class kazu.steps.linking.post_processing.mapping_strategies.strategies.SymbolMatchMappingStrategy[source]

Bases: MappingStrategy

Split both query and reference terms by whitespace.

Select the term with the most splits as the ‘query’. Check all of these tokens (and no more) are within the other term. Useful for symbol matching e.g. “MAP K8” (longest) vs “MAPK8” (shortest).

classmethod filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:
  • ent_match (str) – the raw entity string.

  • ent_match_norm (str) – normalised version of the entity string.

  • document (Document) – originating Document.

  • candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.

  • parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]

static match_symbols(s1, s2)[source]
Parameters:
Return type:

bool

class kazu.steps.linking.post_processing.mapping_strategies.strategies.SynNormIsSubStringMappingStrategy[source]

Bases: MappingStrategy

For a CandidatesToMetrics, see if any of their .synonym_norm are string matches of the match_norm tokens based on whitespace tokenisation.

If exactly one element of CandidatesToMetrics matches, prefer it.

Works best on symbolic entities, e.g. “TESTIN gene” ->”TESTIN”.

__init__(confidence, disambiguation_strategies=None, disambiguation_essential=False, min_syn_norm_len_to_consider=3)[source]
Parameters:
filter_candidates(ent_match, ent_match_norm, document, candidates, parser_name)[source]

Algorithms should override this method to return the “best” CandidatesToMetrics for a given query string.

Ideally, this will be a dict with a single element. However, it may not be possible to identify a single best match. In this scenario, the id sets of multiple LinkingCandidates will be carried forward for disambiguation (if configured).

Parameters:
  • ent_match (str) – the raw entity string.

  • ent_match_norm (str) – normalised version of the entity string.

  • document (Document) – originating Document.

  • candidates (dict[LinkingCandidate, LinkingMetrics]) – candidates to filter.

  • parser_name (str) – parser name associated with these candidates.

Returns:

a dict of filtered candidates

Return type:

dict[LinkingCandidate, LinkingMetrics]