kazu.ontology_preprocessing.synonym_generation

Classes

CombinatorialSynonymGenerator

For every permutation of modifiers, generate a list of syns, then aggregate at the end.

GreekSymbolSubstitution

NgramHyphenation

Generate hyphenated variants of ngrams.

SeparatorExpansion

SpellingVariationReplacement

Generate additional synonyms using a mapping of (known) synonyms to a list of variations.

StopWordRemover

Remove stopwords from a string.

StringReplacement

SuffixReplacement

Interchange all suffixes within a provided set to produce new synonyms.

SynonymGenerator

TokenListReplacementGenerator

Given lists of tokens, generate an alternative string based upon a query token.

VerbPhraseVariantGenerator

Generate alternative verb phrases based on a list of tense templates, and lemmas matched in a query.

class kazu.ontology_preprocessing.synonym_generation.CombinatorialSynonymGenerator[source]

Bases: object

For every permutation of modifiers, generate a list of syns, then aggregate at the end.

__call__(ontology_resources)[source]

Takes a set of OntologyStringResources, and returns a new set of OntologyStringResources with generated synonyms added as alternative_synonyms.

Parameters:

ontology_resources (set[OntologyStringResource])

Returns:

Return type:

set[OntologyStringResource]

__init__(synonym_generators)[source]
Parameters:

synonym_generators (Iterable[SynonymGenerator])

class kazu.ontology_preprocessing.synonym_generation.GreekSymbolSubstitution[source]

Bases: object

ALL_SUBS: dict[str, set[str]] = {'alpha': {'Α', 'α'}, 'beta': {'Β', 'β', 'ϐ'}, 'chi': {'Χ', 'χ'}, 'delta': {'Δ', 'δ'}, 'epsilon': {'Ε', 'ε'}, 'eta': {'Η', 'η'}, 'final sigma': {'ς'}, 'gamma': {'Γ', 'γ'}, 'iota': {'Ι', 'ι'}, 'kappa': {'Κ', 'κ'}, 'lambda': {'Λ', 'λ'}, 'mu': {'Μ', 'μ'}, 'nu': {'Ν', 'ν'}, 'omega': {'Ω', 'ω'}, 'omicron': {'Ο', 'ο'}, 'phi': {'Φ', 'φ', 'ϕ'}, 'pi': {'Π', 'π'}, 'psi': {'Ψ', 'ψ'}, 'rho': {'Ρ', 'ρ'}, 'sigma': {'Σ', 'σ'}, 'tau': {'Τ', 'τ'}, 'theta': {'Θ', 'θ', 'ϴ'}, 'upsilon': {'Υ', 'υ'}, 'xi': {'Ξ', 'ξ'}, 'zeta': {'Ζ', 'ζ'}, 'Α': {'a', 'alpha', 'α'}, 'Β': {'b', 'beta', 'β'}, 'Γ': {'g', 'gamma', 'γ'}, 'Δ': {'d', 'delta', 'δ'}, 'Ε': {'e', 'epsilon', 'ε'}, 'Ζ': {'z', 'zeta', 'ζ'}, 'Η': {'e', 'eta', 'η'}, 'Θ': {'t', 'theta', 'θ'}, 'Ι': {'i', 'iota', 'ι'}, 'Κ': {'k', 'kappa', 'κ'}, 'Λ': {'l', 'lambda', 'λ'}, 'Μ': {'m', 'mu', 'μ'}, 'Ν': {'n', 'nu', 'ν'}, 'Ξ': {'x', 'xi', 'ξ'}, 'Ο': {'o', 'omicron', 'ο'}, 'Π': {'p', 'pi', 'π'}, 'Ρ': {'r', 'rho', 'ρ'}, 'Σ': {'s', 'sigma', 'σ'}, 'Τ': {'t', 'tau', 'τ'}, 'Υ': {'u', 'upsilon', 'υ'}, 'Φ': {'p', 'phi', 'φ'}, 'Χ': {'c', 'chi', 'χ'}, 'Ψ': {'p', 'psi', 'ψ'}, 'Ω': {'o', 'omega', 'ω'}, 'α': {'a', 'alpha', 'Α'}, 'β': {'b', 'beta', 'Β'}, 'γ': {'g', 'gamma', 'Γ'}, 'δ': {'d', 'delta', 'Δ'}, 'ε': {'e', 'epsilon', 'Ε'}, 'ζ': {'z', 'zeta', 'Ζ'}, 'η': {'e', 'eta', 'Η'}, 'θ': {'t', 'theta', 'Θ'}, 'ι': {'i', 'iota', 'Ι'}, 'κ': {'k', 'kappa', 'Κ'}, 'λ': {'l', 'lambda', 'Λ'}, 'μ': {'m', 'mu', 'Μ'}, 'ν': {'n', 'nu', 'Ν'}, 'ξ': {'x', 'xi', 'Ξ'}, 'ο': {'o', 'omicron', 'Ο'}, 'π': {'p', 'pi', 'Π'}, 'ρ': {'r', 'rho', 'Ρ'}, 'ς': {'f', 'final sigma', 'Σ'}, 'σ': {'s', 'sigma', 'Σ'}, 'τ': {'t', 'tau', 'Τ'}, 'υ': {'u', 'upsilon', 'Υ'}, 'φ': {'p', 'phi', 'Φ'}, 'χ': {'c', 'chi', 'Χ'}, 'ψ': {'p', 'psi', 'Ψ'}, 'ω': {'o', 'omega', 'Ω'}, 'ϐ': {'b', 'beta', 'Β'}, 'ϕ': {'p', 'phi', 'Φ'}, 'ϴ': {'t', 'theta', 'θ'}}
greek_letter = 'ω'
lower_greek_letter = 'θ'
spelling = 'omega'
upper_greek_letter = 'Ω'
class kazu.ontology_preprocessing.synonym_generation.NgramHyphenation[source]

Bases: SynonymGenerator

Generate hyphenated variants of ngrams.

__init__(ngram=2)[source]
Parameters:

ngram (int)

call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

class kazu.ontology_preprocessing.synonym_generation.SeparatorExpansion[source]

Bases: SynonymGenerator

__init__()[source]
Return type:

None

call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

class kazu.ontology_preprocessing.synonym_generation.SpellingVariationReplacement[source]

Bases: SynonymGenerator

Generate additional synonyms using a mapping of (known) synonyms to a list of variations.

__init__(input_path)[source]
Parameters:

input_path (str | Path)

call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

class kazu.ontology_preprocessing.synonym_generation.StopWordRemover[source]

Bases: SynonymGenerator

Remove stopwords from a string.

classmethod call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

all_stopwords = {'and', 'by', 'caused', 'in', 'involved', 'of', 'the', 'to', 'with'}
class kazu.ontology_preprocessing.synonym_generation.StringReplacement[source]

Bases: SynonymGenerator

__init__(replacement_dict=None, digit_aware_replacement_dict=None, include_greek=True)[source]
Parameters:
call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

GREEK_VARIANT_PREFIX_SUFFIX = {' ', '-', '‐', '‑', '‒', '–', '—', '―', '−'}
class kazu.ontology_preprocessing.synonym_generation.SuffixReplacement[source]

Bases: SynonymGenerator

Interchange all suffixes within a provided set to produce new synonyms.

Note, this is expected to be noisy, and for most of the generated synonyms not to be valid words. This class is present as a generation step for high recall, with curation of synonyms expected later.

In particular, note that this also doesn’t check for the longest matching suffix - e.g. for a synonym ‘anaemia’ and the suffixes ‘ia’, ‘a’ and ‘ic’, the new synonyms ‘anaemic’ and ‘amaemiic’ will both be generated.

__init__(suffixes)[source]
Parameters:

suffixes (Iterable[str])

call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

class kazu.ontology_preprocessing.synonym_generation.SynonymGenerator[source]

Bases: ABC

__call__(string_to_mutate)[source]

Takes a string, and returns a set containing the original and generated strings.

Caching prevents re-computation of redundant strings.

Parameters:

string_to_mutate (str)

Returns:

Return type:

set[str]

abstract call(string_to_mutate)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

string_to_mutate (str)

Return type:

set[str]

class kazu.ontology_preprocessing.synonym_generation.TokenListReplacementGenerator[source]

Bases: SynonymGenerator

Given lists of tokens, generate an alternative string based upon a query token.

Note, this implementation is pretty basic, and only replaces one token at a time. It’s mainly designed for ontologies like Meddra which stretch the definition of an entity somewhat, by incorporating verbs (e.g. “increase in AST”).

__init__(token_lists_to_consider)[source]
Parameters:

token_lists_to_consider (list[list[str]]) – if any token from the sublist matches a query string, generate new strings based upon all tokens in this sublist.

call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

class kazu.ontology_preprocessing.synonym_generation.VerbPhraseVariantGenerator[source]

Bases: SynonymGenerator

Generate alternative verb phrases based on a list of tense templates, and lemmas matched in a query.

It’s mainly designed for ontologies like Meddra which stretch the definition of an entity somewhat, by incorporating verbs (e.g. “increase in AST”).

__init__(tense_templates, lemmas_to_consider, spacy_model_path)[source]
Parameters:
  • tense_templates (list[str]) –

    template expressons to generate, for example:

    ["{NOUN} {TARGET}", "{TARGET} in {NOUN}"]
    

  • lemmas_to_consider (dict[str, list[str]]) –

    a dict of verb lemmas to surface forms to generate, for example:

    {"increase": ["increasing", "increased"], "decrease": ["decreased", "decreasing"]}
    

  • spacy_model_path (str) – path to a serialised spaCy model - must have a lemmatizer component.

call(synonym_str)[source]

Implementations should override this method to generate new strings from an input string.

Parameters:

synonym_str (str)

Return type:

set[str]

NOUN_PLACEHOLDER = 'NOUN'
VERB_PLACEHOLDER = 'TARGET'