kazu.steps.linking.post_processing.disambiguation.context_scoring

Functions

create_word_and_char_ngrams(s[, words, chars])

Function to create char and word ngrams.

Classes

GildaTfIdfScorer

This class uses a single TFIDF model for 'Gilda-inspired' method of disambiguation.

TfIdfScorer

This class manages a set of TFIDF models (via sklearn.feature_extraction.text.TfidfVectorizer).

class kazu.steps.linking.post_processing.disambiguation.context_scoring.GildaTfIdfScorer[source]

Bases: object

This class uses a single TFIDF model for ‘Gilda-inspired’ method of disambiguation. It uses a pretrained TF-IDF model, and contexual text mapped to knowledgebase identifiers (such as wikipedia descriptions of the entity). The sparse matrices of these contexts are then compared cosine wise with a target matrix to determine the most likely identifier.

Context matrices are kept in a disk cache until needed, with only a sample held in memory. The size of this in memory cache can be controlled with the KAZU_GILDA_TFIDF_DISAMBIGUATION_IN_MEMORY_CACHE_SIZE env variable.

Caution

If no context is available, the ID automatically scores 0.0. The downside of this is that any ids without a context automatically appear at the bottom of any rankings.

Original Credit:

https://github.com/indralab/gilda

Paper:

Benjamin M Gyori, Charles Tapley Hoyt, and Albert Steppi. 2022.
Bioinformatics Advances. Vbac034.
Bibtex Citation Details
@article{gyori2022gilda,
    author = {Gyori, Benjamin M and Hoyt, Charles Tapley and Steppi, Albert},
    title = "{{Gilda: biomedical entity text normalization with machine-learned disambiguation as a service}}",
    journal = {Bioinformatics Advances},
    year = {2022},
    month = {05},
    issn = {2635-0041},
    doi = {10.1093/bioadv/vbac034},
    url = {https://doi.org/10.1093/bioadv/vbac034},
    note = {vbac034}
}

It’s a singleton, so that the model can be accessed in multiple locations without the need to load it into memory multiple times.

__call__(context_vec, id_sets, parser_name)[source]

Given a context vector, yield the most likely identifiers and their score from the given set of identifiers.

Parameters:
Returns:

identifier strings and scores, starting with the string with the best score

Return type:

Iterable[tuple[str, float]]

__init__(contexts_path, model_path)[source]
Parameters:
class kazu.steps.linking.post_processing.disambiguation.context_scoring.TfIdfScorer[source]

Bases: object

This class manages a set of TFIDF models (via sklearn.feature_extraction.text.TfidfVectorizer).

It’s a singleton, so that the models can be accessed in multiple locations without the need to load them into memory multiple times.

__call__(strings, matrix, parser)[source]

Transform a list of strings with a parser-specific vectorizer and score against a matrix.

Parameters:
Returns:

matching strings and their score sorted by best score

Return type:

Iterable[tuple[str, float]]

__init__()[source]
build_vectorizers()[source]
Return type:

dict[str, TfidfVectorizer]

kazu.steps.linking.post_processing.disambiguation.context_scoring.create_word_and_char_ngrams(s, words=(1, 2), chars=(2, 3))[source]

Function to create char and word ngrams.

Parameters:
Returns:

list of strings comprised of words and chars

Return type:

list[str]