kazu.utils.string_normalizer

Classes

AnatomyStringNormalizer

CompanyStringNormalizer

DefaultStringNormalizer

Normalize a biomedical string for search.

DiseaseStringNormalizer

EntityClassNormalizer

Protocol describing methods a normalizer should implement.

GeneStringNormalizer

GildaUtils

Functions derived from gilda used by (some of) the StringNormalizers.

StringNormalizer

Call custom entity class normalizers, or a default normalizer if none is available.

class kazu.utils.string_normalizer.AnatomyStringNormalizer[source]

Bases: EntityClassNormalizer

static is_symbol_like(original_string)[source]

Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)

Parameters:

original_string (str)

Returns:

Return type:

bool

static normalize_noun_phrase(original_string)[source]

Revert to DefaultStringNormalizer.normalize_noun_phrase().

Parameters:

original_string (str)

Returns:

Return type:

str

static normalize_symbol(original_string)[source]

Revert to DefaultStringNormalizer.normalize_noun_phrase() (note, since all anatomy is non-symbolic, this is theoretically superfluous, but we include it anyway)

Parameters:

original_string (str)

Returns:

Return type:

str

class kazu.utils.string_normalizer.CompanyStringNormalizer[source]

Bases: EntityClassNormalizer

static is_symbol_like(original_string)[source]

Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)

Parameters:

original_string (str)

Returns:

Return type:

bool

static normalize_noun_phrase(original_string)[source]

Revert to DefaultStringNormalizer.normalize_noun_phrase().

Parameters:

original_string (str)

Returns:

Return type:

str

static normalize_symbol(original_string)[source]

Just upper case.

Parameters:

original_string (str)

Returns:

Return type:

str

class kazu.utils.string_normalizer.DefaultStringNormalizer[source]

Bases: EntityClassNormalizer

Normalize a biomedical string for search.

Suitable for most use cases

static depluralize(string)[source]

Apply some depluralisation rules.

Parameters:

string (str)

Returns:

Return type:

str

static handle_lower_case_prefixes(string)[source]

Preserve case only if first char of contiguous subsequence is lower case, and is alphanum, and upper case detected in rest of part. Currently unused as it causes problems with normalisation of e.g. erbB2, which is a commonly used form of the symbol.

Parameters:

string (str)

Returns:

Return type:

str

static is_symbol_like(original_string)[source]

Checks for ratio of upper to lower case characters, and numeric to alpha characters.

Parameters:

original_string (str)

Returns:

Return type:

bool

static normalize_noun_phrase(original_string)[source]

Method for normalising a noun phrase.

Parameters:

original_string (str)

Returns:

Return type:

str

static normalize_symbol(original_string)[source]

Method for normalising a symbol.

Parameters:

original_string (str)

Returns:

Return type:

str

static remove_non_alphanum(string)[source]

Removes all non alphanumeric characters.

Parameters:

string (str)

Returns:

Return type:

str

static replace_greek(string)[source]

Replaces greek characters with string representation.

Parameters:

string (str)

Returns:

Return type:

str

static replace_substrings(original_string)[source]

Replaces a range of other strings that might be confusing to a classifier, such as roman numerals.

Parameters:

original_string (str)

Returns:

Return type:

str

static split_on_numbers(string)[source]

Splits a string on numbers, for consistency.

Parameters:

string (str)

Returns:

Return type:

str

static sub_greek_char_abbreviations(string)[source]

substitute single characters for alphanumeric representation - e.g. A -> ALPHA B–> BETA

Parameters:

string (str)

Returns:

Return type:

str

allowed_additional_chars = {' ', '(', ')', '+', '-', '‐'}
greek_subs = {'Α': 'alpha', 'Β': 'beta', 'Γ': 'gamma', 'Δ': 'delta', 'Ε': 'epsilon', 'Ζ': 'zeta', 'Η': 'eta', 'Θ': 'theta', 'Ι': 'iota', 'Κ': 'kappa', 'Λ': 'lambda', 'Μ': 'mu', 'Ν': 'nu', 'Ξ': 'xi', 'Ο': 'omicron', 'Π': 'pi', 'Ρ': 'rho', 'Σ': 'sigma', 'Τ': 'tau', 'Υ': 'upsilon', 'Φ': 'phi', 'Χ': 'chi', 'Ψ': 'psi', 'Ω': 'omega', 'α': 'alpha', 'β': 'beta', 'γ': 'gamma', 'δ': 'delta', 'ε': 'epsilon', 'ζ': 'zeta', 'η': 'eta', 'θ': 'theta', 'ι': 'iota', 'κ': 'kappa', 'λ': 'lambda', 'μ': 'mu', 'ν': 'nu', 'ξ': 'xi', 'ο': 'omicron', 'π': 'pi', 'ρ': 'rho', 'ς': 'final sigma', 'σ': 'sigma', 'τ': 'tau', 'υ': 'upsilon', 'φ': 'phi', 'χ': 'chi', 'ψ': 'psi', 'ω': 'omega', 'ϐ': 'beta', 'ϕ': 'phi', 'ϴ': 'theta'}
greek_subs_upper = {'Α': ' ALPHA ', 'Β': ' BETA ', 'Γ': ' GAMMA ', 'Δ': ' DELTA ', 'Ε': ' EPSILON ', 'Ζ': ' ZETA ', 'Η': ' ETA ', 'Θ': ' THETA ', 'Ι': ' IOTA ', 'Κ': ' KAPPA ', 'Λ': ' LAMBDA ', 'Μ': ' MU ', 'Ν': ' NU ', 'Ξ': ' XI ', 'Ο': ' OMICRON ', 'Π': ' PI ', 'Ρ': ' RHO ', 'Σ': ' SIGMA ', 'Τ': ' TAU ', 'Υ': ' UPSILON ', 'Φ': ' PHI ', 'Χ': ' CHI ', 'Ψ': ' PSI ', 'Ω': ' OMEGA ', 'α': ' ALPHA ', 'β': ' BETA ', 'γ': ' GAMMA ', 'δ': ' DELTA ', 'ε': ' EPSILON ', 'ζ': ' ZETA ', 'η': ' ETA ', 'θ': ' THETA ', 'ι': ' IOTA ', 'κ': ' KAPPA ', 'λ': ' LAMBDA ', 'μ': ' MU ', 'ν': ' NU ', 'ξ': ' XI ', 'ο': ' OMICRON ', 'π': ' PI ', 'ρ': ' RHO ', 'ς': ' FINAL SIGMA ', 'σ': ' SIGMA ', 'τ': ' TAU ', 'υ': ' UPSILON ', 'φ': ' PHI ', 'χ': ' CHI ', 'ψ': ' PSI ', 'ω': ' OMEGA ', 'ϐ': ' BETA ', 'ϕ': ' PHI ', 'ϴ': ' THETA '}
number_split_pattern = re.compile('(\\d+)')
other_subs = {'(': ' (', ')': ') ', ',': ' ', '/': ' ', 'II': ' 2 ', 'III': ' 3 ', 'IV': ' 4 ', 'IX': ' 9 ', 'VI': ' 6 ', 'VII': ' 7 ', 'VIII': ' 8 ', 'XI': ' 11 ', 'XII': ' 12 '}
re_subs = {re.compile('(?<!\\()-(?!\\))'): ' ', re.compile('(?<!\\()‐(?!\\))'): ' ', re.compile('\\sI\\s|\\sI$'): ' 1 ', re.compile('\\sV\\s|\\sV$'): ' 5 '}
re_subs_2 = {re.compile('\\sA\\s|\\sA$|^A\\s'): ' ALPHA ', re.compile('\\sB\\s|\\sB$|^B\\s'): ' BETA '}
class kazu.utils.string_normalizer.DiseaseStringNormalizer[source]

Bases: EntityClassNormalizer

static is_symbol_like(original_string)[source]

Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)

Parameters:

original_string (str)

Returns:

Return type:

bool

static normalize_noun_phrase(original_string)[source]

Revert to DefaultStringNormalizer.normalize_noun_phrase().

Parameters:

original_string (str)

Returns:

Return type:

str

static normalize_symbol(original_string)[source]

Revert to DefaultStringNormalizer.normalize_symbol().

Parameters:

original_string (str)

Returns:

Return type:

str

known_disease_short_nouns = {'Flu', 'HIV', 'NSCLC', 'STI', 'flu'}
class kazu.utils.string_normalizer.EntityClassNormalizer[source]

Bases: Protocol

Protocol describing methods a normalizer should implement.

__init__(*args, **kwargs)[source]
static is_symbol_like(original_string)[source]

Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)

Parameters:

original_string (str)

Returns:

Return type:

bool

static normalize_noun_phrase(original_string)[source]

Method for normalising a noun phrase.

Parameters:

original_string (str)

Returns:

Return type:

str

static normalize_symbol(original_string)[source]

Method for normalising a symbol.

Parameters:

original_string (str)

Returns:

Return type:

str

class kazu.utils.string_normalizer.GeneStringNormalizer[source]

Bases: EntityClassNormalizer

static gene_token_classifier(original_string)[source]

Slightly modified version of DefaultStringNormalizer.is_symbol_like(), designed to work on single tokens. Checks if the casing of the symbol changes from lower to upper (if so, is likely to be symbolic, e.g. erbB2)

Parameters:

original_string (str)

Returns:

Return type:

bool

static is_symbol_like(original_string)[source]

A symbol classifier that is designed to improve recall on natural text, especially gene symbols looks at the ratio of upper case to lower case chars, and the ratio of integer to alpha chars. If the ratio of upper case or integers is higher, assume it’s a symbol. Also if the first char is lower case, and any subsequent characters are upper case, it’s probably a symbol (e.g. erbB2)

Parameters:

original_string (str)

Returns:

Return type:

bool

static normalize_noun_phrase(original_string)[source]

Revert to DefaultStringNormalizer.normalize_noun_phrase() for non symbolic genes.

Parameters:

original_string (str)

Returns:

Return type:

str

static normalize_symbol(original_string)[source]

Contrary to other entity classes, gene symbols require special handling because of their highly unusual nature.

Parameters:

original_string (str)

Returns:

Return type:

str

static remove_trailing_s_if_otherwise_capitalised(string)[source]

Frustratingly, some gene symbols are pluralised like ERBBs. we can’t just remove trailing s as this breaks genuine symbols like ‘MDH-s’ and ‘GASP10ps’. So, we only strip the trailing ‘s’ if the char before is upper case.

Parameters:

string (str)

Returns:

Return type:

str

gene_name_suffixes = {'an', 'ase', 'gen', 'gon', 'in'}
class kazu.utils.string_normalizer.GildaUtils[source]

Bases: object

Functions derived from gilda used by (some of) the StringNormalizers.

Original Credit:

Paper:

Benjamin M Gyori, Charles Tapley Hoyt, and Albert Steppi. 2022.
Bioinformatics Advances. Vbac034.
Bibtex Citation Details
@article{gyori2022gilda,
    author = {Gyori, Benjamin M and Hoyt, Charles Tapley and Steppi, Albert},
    title = "{{Gilda: biomedical entity text normalization with machine-learned disambiguation as a service}}",
    journal = {Bioinformatics Advances},
    year = {2022},
    month = {05},
    issn = {2635-0041},
    doi = {10.1093/bioadv/vbac034},
    url = {https://doi.org/10.1093/bioadv/vbac034},
    note = {vbac034}
}

Licensed under BSD 2-Clause.

Copyright (c) 2019, Benjamin M. Gyori, Harvard Medical School All rights reserved.

Full License

BSD 2-Clause License

Copyright (c) 2019, Benjamin M. Gyori, Harvard Medical School All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

static depluralize(word)[source]

Return the depluralized version of the word, along with a status flag.

Parameters:

word (str) – The word which is to be depluralized.

Returns:

A tuple containing:

  1. The original word, if it is detected to be non-plural, or the depluralized version of the word.

  2. A status flag representing the detected pluralization status of the word, with non_plural (e.g. BRAF), plural_oes (e.g. mosquitoes), plural_ies (e.g. antibodies), plural_es (e.g. switches), plural_cap_s (e.g. MAPKs), and plural_s (e.g. receptors).

Return type:

tuple[str, str]

classmethod split_on_dashes_or_space(s)[source]

Split input string on a space or any kind of dash.

Parameters:

s (str) – Input string for splitting

Returns:

The resulting split string as a list

Return type:

list[str]

DASHES_OR_SPACE_PATTERN = re.compile('[ ‐―−‒—‑\\-–]+')
PLURAL_CAPS_S_PATTERN: Pattern = regex.Regex('^\\p{Lu}+$', flags=regex.V0)
class kazu.utils.string_normalizer.StringNormalizer[source]

Bases: object

Call custom entity class normalizers, or a default normalizer if none is available.

static classify_symbolic(original_string, entity_class=None)[source]
Parameters:
  • original_string (str)

  • entity_class (str | None)

Return type:

bool

static normalize(original_string, entity_class=None)[source]
Parameters:
  • original_string (str)

  • entity_class (str | None)

Return type:

str

normalizers: dict[str | None, type[EntityClassNormalizer]] = {'anatomy': <class 'kazu.utils.string_normalizer.AnatomyStringNormalizer'>, 'company': <class 'kazu.utils.string_normalizer.CompanyStringNormalizer'>, 'disease': <class 'kazu.utils.string_normalizer.DiseaseStringNormalizer'>, 'gene': <class 'kazu.utils.string_normalizer.GeneStringNormalizer'>}