kazu.steps.linking.post_processing.disambiguation.context_scoring¶

Functions

create_word_and_char_ngrams(s[, words, chars])

Function to create char and word ngrams.

Classes

`GildaTfIdfScorer`	This class uses a single TFIDF model for 'Gilda-inspired' method of disambiguation.
`TfIdfScorer`	This class manages a set of TFIDF models (via `sklearn.feature_extraction.text.TfidfVectorizer`).

class kazu.steps.linking.post_processing.disambiguation.context_scoring.GildaTfIdfScorer[source]¶

Bases: object

This class uses a single TFIDF model for ‘Gilda-inspired’ method of disambiguation. It uses a pretrained TF-IDF model, and contexual text mapped to knowledgebase identifiers (such as wikipedia descriptions of the entity). The sparse matrices of these contexts are then compared cosine wise with a target matrix to determine the most likely identifier.

Context matrices are kept in a disk cache until needed, with only a sample held in memory. The size of this in memory cache can be controlled with the KAZU_GILDA_TFIDF_DISAMBIGUATION_IN_MEMORY_CACHE_SIZE env variable.

Caution

If no context is available, the ID automatically scores 0.0. The downside of this is that any ids without a context automatically appear at the bottom of any rankings.

Original Credit:

https://github.com/indralab/gilda

Paper:

Benjamin M Gyori, Charles Tapley Hoyt, and Albert Steppi. 2022.
Gilda: biomedical entity text normalization with machine-learned disambiguation as a service.
Bioinformatics Advances. Vbac034.

Bibtex Citation Details

@article{gyori2022gilda,
    author = {Gyori, Benjamin M and Hoyt, Charles Tapley and Steppi, Albert},
    title = "{{Gilda: biomedical entity text normalization with machine-learned disambiguation as a service}}",
    journal = {Bioinformatics Advances},
    year = {2022},
    month = {05},
    issn = {2635-0041},
    doi = {10.1093/bioadv/vbac034},
    url = {https://doi.org/10.1093/bioadv/vbac034},
    note = {vbac034}
}

It’s a singleton, so that the model can be accessed in multiple locations without the need to load it into memory multiple times.

__call__(context_vec, id_sets, parser_name)[source]¶

Given a context vector, yield the most likely identifiers and their score from the given set of identifiers.

Parameters:

context_vec (ndarray)
id_sets (set[EquivalentIdSet])
parser_name (str)

Returns:

identifier strings and scores, starting with the string with the best score

Return type:

Iterable[tuple[str, float]]

__init__(contexts_path, model_path)[source]¶

Parameters:

contexts_path (str) –

json file in the format:

{"<parser name>": {"<idx>": "<context string>"}}

model_path (str) – path to a pretrained sklearn.feature_extraction.text.TfidfVectorizer model

class kazu.steps.linking.post_processing.disambiguation.context_scoring.TfIdfScorer[source]¶

Bases: object

This class manages a set of TFIDF models (via sklearn.feature_extraction.text.TfidfVectorizer).

It’s a singleton, so that the models can be accessed in multiple locations without the need to load them into memory multiple times.

__call__(strings, matrix, parser)[source]¶

Transform a list of strings with a parser-specific vectorizer and score against a matrix.

Parameters:

strings (list[str])
matrix (ndarray)
parser (str)

Returns:

matching strings and their score sorted by best score

Return type:

Iterable[tuple[str, float]]

__init__()[source]¶

Return type:: None

build_vectorizers()[source]¶

Return type:: dict[str, TfidfVectorizer]

kazu.steps.linking.post_processing.disambiguation.context_scoring.create_word_and_char_ngrams(s, words=(1, 2), chars=(2, 3))[source]¶

Function to create char and word ngrams.

Parameters:

s (str) – string to process
words (Iterable[int]) – create n words
chars (Iterable[int]) – create n chars

Returns:

list of strings comprised of words and chars

Return type:

list[str]