kazu.utils.string_normalizer¶
Classes
Normalize a biomedical string for search. |
|
Protocol describing methods a normalizer should implement. |
|
Functions derived from gilda used by (some of) the |
|
Call custom entity class normalizers, or a default normalizer if none is available. |
- class kazu.utils.string_normalizer.AnatomyStringNormalizer[source]¶
Bases:
EntityClassNormalizer
- static is_symbol_like(original_string)[source]¶
Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)
- static normalize_symbol(original_string)[source]¶
Revert to
DefaultStringNormalizer.normalize_noun_phrase()
(note, since all anatomy is non-symbolic, this is theoretically superfluous, but we include it anyway)
- class kazu.utils.string_normalizer.CompanyStringNormalizer[source]¶
Bases:
EntityClassNormalizer
- static is_symbol_like(original_string)[source]¶
Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)
- class kazu.utils.string_normalizer.DefaultStringNormalizer[source]¶
Bases:
EntityClassNormalizer
Normalize a biomedical string for search.
Suitable for most use cases
- static handle_lower_case_prefixes(string)[source]¶
Preserve case only if first char of contiguous subsequence is lower case, and is alphanum, and upper case detected in rest of part. Currently unused as it causes problems with normalisation of e.g. erbB2, which is a commonly used form of the symbol.
- static is_symbol_like(original_string)[source]¶
Checks for ratio of upper to lower case characters, and numeric to alpha characters.
- static replace_substrings(original_string)[source]¶
Replaces a range of other strings that might be confusing to a classifier, such as roman numerals.
- static sub_greek_char_abbreviations(string)[source]¶
substitute single characters for alphanumeric representation - e.g. A -> ALPHA B–> BETA
- allowed_additional_chars = {' ', '(', ')', '+', '-', '‐'}¶
- greek_subs = {'Α': 'alpha', 'Β': 'beta', 'Γ': 'gamma', 'Δ': 'delta', 'Ε': 'epsilon', 'Ζ': 'zeta', 'Η': 'eta', 'Θ': 'theta', 'Ι': 'iota', 'Κ': 'kappa', 'Λ': 'lambda', 'Μ': 'mu', 'Ν': 'nu', 'Ξ': 'xi', 'Ο': 'omicron', 'Π': 'pi', 'Ρ': 'rho', 'Σ': 'sigma', 'Τ': 'tau', 'Υ': 'upsilon', 'Φ': 'phi', 'Χ': 'chi', 'Ψ': 'psi', 'Ω': 'omega', 'α': 'alpha', 'β': 'beta', 'γ': 'gamma', 'δ': 'delta', 'ε': 'epsilon', 'ζ': 'zeta', 'η': 'eta', 'θ': 'theta', 'ι': 'iota', 'κ': 'kappa', 'λ': 'lambda', 'μ': 'mu', 'ν': 'nu', 'ξ': 'xi', 'ο': 'omicron', 'π': 'pi', 'ρ': 'rho', 'ς': 'final sigma', 'σ': 'sigma', 'τ': 'tau', 'υ': 'upsilon', 'φ': 'phi', 'χ': 'chi', 'ψ': 'psi', 'ω': 'omega', 'ϐ': 'beta', 'ϕ': 'phi', 'ϴ': 'theta'}¶
- greek_subs_upper = {'Α': ' ALPHA ', 'Β': ' BETA ', 'Γ': ' GAMMA ', 'Δ': ' DELTA ', 'Ε': ' EPSILON ', 'Ζ': ' ZETA ', 'Η': ' ETA ', 'Θ': ' THETA ', 'Ι': ' IOTA ', 'Κ': ' KAPPA ', 'Λ': ' LAMBDA ', 'Μ': ' MU ', 'Ν': ' NU ', 'Ξ': ' XI ', 'Ο': ' OMICRON ', 'Π': ' PI ', 'Ρ': ' RHO ', 'Σ': ' SIGMA ', 'Τ': ' TAU ', 'Υ': ' UPSILON ', 'Φ': ' PHI ', 'Χ': ' CHI ', 'Ψ': ' PSI ', 'Ω': ' OMEGA ', 'α': ' ALPHA ', 'β': ' BETA ', 'γ': ' GAMMA ', 'δ': ' DELTA ', 'ε': ' EPSILON ', 'ζ': ' ZETA ', 'η': ' ETA ', 'θ': ' THETA ', 'ι': ' IOTA ', 'κ': ' KAPPA ', 'λ': ' LAMBDA ', 'μ': ' MU ', 'ν': ' NU ', 'ξ': ' XI ', 'ο': ' OMICRON ', 'π': ' PI ', 'ρ': ' RHO ', 'ς': ' FINAL SIGMA ', 'σ': ' SIGMA ', 'τ': ' TAU ', 'υ': ' UPSILON ', 'φ': ' PHI ', 'χ': ' CHI ', 'ψ': ' PSI ', 'ω': ' OMEGA ', 'ϐ': ' BETA ', 'ϕ': ' PHI ', 'ϴ': ' THETA '}¶
- number_split_pattern = re.compile('(\\d+)')¶
- other_subs = {'(': ' (', ')': ') ', ',': ' ', '/': ' ', 'II': ' 2 ', 'III': ' 3 ', 'IV': ' 4 ', 'IX': ' 9 ', 'VI': ' 6 ', 'VII': ' 7 ', 'VIII': ' 8 ', 'XI': ' 11 ', 'XII': ' 12 '}¶
- re_subs = {re.compile('(?<!\\()-(?!\\))'): ' ', re.compile('(?<!\\()‐(?!\\))'): ' ', re.compile('\\sI\\s|\\sI$'): ' 1 ', re.compile('\\sV\\s|\\sV$'): ' 5 '}¶
- re_subs_2 = {re.compile('\\sA\\s|\\sA$|^A\\s'): ' ALPHA ', re.compile('\\sB\\s|\\sB$|^B\\s'): ' BETA '}¶
- class kazu.utils.string_normalizer.DiseaseStringNormalizer[source]¶
Bases:
EntityClassNormalizer
- static is_symbol_like(original_string)[source]¶
Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)
- static normalize_symbol(original_string)[source]¶
Revert to
DefaultStringNormalizer.normalize_symbol()
.
- known_disease_short_nouns = {'Flu', 'HIV', 'NSCLC', 'STI', 'flu'}¶
- class kazu.utils.string_normalizer.EntityClassNormalizer[source]¶
Bases:
Protocol
Protocol describing methods a normalizer should implement.
- static is_symbol_like(original_string)[source]¶
Method to determine whether a string is a symbol (e.g. “AD”) or a noun phrase (e.g. “Alzheimers Disease”)
- class kazu.utils.string_normalizer.GeneStringNormalizer[source]¶
Bases:
EntityClassNormalizer
- static gene_token_classifier(original_string)[source]¶
Slightly modified version of
DefaultStringNormalizer.is_symbol_like()
, designed to work on single tokens. Checks if the casing of the symbol changes from lower to upper (if so, is likely to be symbolic, e.g. erbB2)
- static is_symbol_like(original_string)[source]¶
A symbol classifier that is designed to improve recall on natural text, especially gene symbols looks at the ratio of upper case to lower case chars, and the ratio of integer to alpha chars. If the ratio of upper case or integers is higher, assume it’s a symbol. Also if the first char is lower case, and any subsequent characters are upper case, it’s probably a symbol (e.g. erbB2)
- static normalize_noun_phrase(original_string)[source]¶
Revert to
DefaultStringNormalizer.normalize_noun_phrase()
for non symbolic genes.
- static normalize_symbol(original_string)[source]¶
Contrary to other entity classes, gene symbols require special handling because of their highly unusual nature.
- static remove_trailing_s_if_otherwise_capitalised(string)[source]¶
Frustratingly, some gene symbols are pluralised like ERBBs. we can’t just remove trailing s as this breaks genuine symbols like ‘MDH-s’ and ‘GASP10ps’. So, we only strip the trailing ‘s’ if the char before is upper case.
- gene_name_suffixes = {'an', 'ase', 'gen', 'gon', 'in'}¶
- class kazu.utils.string_normalizer.GildaUtils[source]¶
Bases:
object
Functions derived from gilda used by (some of) the
StringNormalizer
s.Original Credit:
Paper:
Benjamin M Gyori, Charles Tapley Hoyt, and Albert Steppi. 2022.Bioinformatics Advances. Vbac034.Bibtex Citation Details
@article{gyori2022gilda, author = {Gyori, Benjamin M and Hoyt, Charles Tapley and Steppi, Albert}, title = "{{Gilda: biomedical entity text normalization with machine-learned disambiguation as a service}}", journal = {Bioinformatics Advances}, year = {2022}, month = {05}, issn = {2635-0041}, doi = {10.1093/bioadv/vbac034}, url = {https://doi.org/10.1093/bioadv/vbac034}, note = {vbac034} }
Licensed under BSD 2-Clause.
Copyright (c) 2019, Benjamin M. Gyori, Harvard Medical School All rights reserved.
Full License
BSD 2-Clause License
Copyright (c) 2019, Benjamin M. Gyori, Harvard Medical School All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- static depluralize(word)[source]¶
Return the depluralized version of the word, along with a status flag.
- Parameters:
word (str) – The word which is to be depluralized.
- Returns:
A tuple containing:
The original word, if it is detected to be non-plural, or the depluralized version of the word.
A status flag representing the detected pluralization status of the word, with non_plural (e.g. BRAF), plural_oes (e.g. mosquitoes), plural_ies (e.g. antibodies), plural_es (e.g. switches), plural_cap_s (e.g. MAPKs), and plural_s (e.g. receptors).
- Return type:
- DASHES_OR_SPACE_PATTERN = re.compile('[ ‑—–‒−―\\-‐]+')¶
- PLURAL_CAPS_S_PATTERN: Pattern = regex.Regex('^\\p{Lu}+$', flags=regex.V0)¶
- class kazu.utils.string_normalizer.StringNormalizer[source]¶
Bases:
object
Call custom entity class normalizers, or a default normalizer if none is available.
- normalizers: dict[str | None, type[EntityClassNormalizer]] = {'anatomy': <class 'kazu.utils.string_normalizer.AnatomyStringNormalizer'>, 'company': <class 'kazu.utils.string_normalizer.CompanyStringNormalizer'>, 'disease': <class 'kazu.utils.string_normalizer.DiseaseStringNormalizer'>, 'gene': <class 'kazu.utils.string_normalizer.GeneStringNormalizer'>}¶