kazu.utils.abbreviation_detector¶

Original Credit:

https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py

Licensed under Apache 2.0

Full License Notice

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Paper:

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019.
ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing.
In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327 Florence, Italy.
Association for Computational Linguistics.

Bibtex Citation Details

@inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
    King, Daniel  and
    Beltagy, Iz  and
    Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}

Functions

`filter_matches`(section, matcher_output, doc)	From https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.
`find_abbreviation`(long_form_candidate, ...)	From https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.
`short_form_filter`(span)	From https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.

Classes

KazuAbbreviationDetector

Modified version of https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.

class kazu.utils.abbreviation_detector.KazuAbbreviationDetector[source]¶

Bases: object

Modified version of https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.

see top of file for original implementation credit

Detects abbreviations using the algorithm in “A simple algorithm for identifying abbreviation definitions in biomedical text.”, (Schwartz & Hearst, 2003).

If an abbreviation is detected, a new instance of Entity is generated, copying information from the originating long span. If the original long span was not an entity, the abbreviation entity is removed. In the latter case, you can force the class to not delete entities by providing a list of strings to exclude_abbrvs. For instance, this might be wise for abbreviations that are very common and therefore not defined (e.g. ‘NSCLC’). Note, however, that the abbreviation detection is always preferred, so if a long form entity is detected, that will always be chosen

__call__(document)[source]¶

Call self as a function.

Parameters:: document (Document)
Return type:: None

__init__(namespace, exclude_abbrvs=None)[source]¶

Parameters:

namespace (str) – the namespace to give any generated entities
exclude_abbrvs (Iterable[str] | None) – detected abbreviations matching this list will not be removed, even if no source entities are found

Return type:

None

load_matcher()[source]¶

Return type:: None

kazu.utils.abbreviation_detector.filter_matches(section, matcher_output, doc)[source]¶

From https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.

Parameters:

section (Section)
matcher_output (list[tuple[int, int, int]])
doc (Doc)

Returns:

Return type:

list[tuple[Section, Span, Span]]

kazu.utils.abbreviation_detector.find_abbreviation(long_form_candidate, short_form_candidate)[source]¶

From https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.

Implements the abbreviation detection algorithm in “A simple algorithm for identifying abbreviation definitions in biomedical text.”, (Schwartz & Hearst, 2003). The algorithm works by enumerating the characters in the short form of the abbreviation, checking that they can be matched against characters in a candidate text for the long form in order, as well as requiring that the first letter of the abbreviated form matches the _beginning_ letter of a word.

Parameters:

long_form_candidate (Span) – The spaCy span for the long form candidate of the definition
short_form_candidate (Span) – The spaCy span for the abbreviation candidate

Returns:

The short form abbreviation and the span corresponding to the long form expansion, or None if a match is not found

Return type:

tuple[Span, Span | None]

kazu.utils.abbreviation_detector.short_form_filter(span)[source]¶

From https://github.com/allenai/scispacy/blob/main/scispacy/abbreviation.py.

Parameters:: span (Span)
Returns:
Return type:: bool