kazu.steps.linking.post_processing.disambiguation.context_scoring¶
Functions
|
Function to create char and word ngrams. |
Classes
This class uses a single TFIDF model for 'Gilda-inspired' method of disambiguation. |
|
This class manages a set of TFIDF models (via |
- class kazu.steps.linking.post_processing.disambiguation.context_scoring.GildaTfIdfScorer[source]¶
Bases:
object
This class uses a single TFIDF model for ‘Gilda-inspired’ method of disambiguation. It uses a pretrained TF-IDF model, and contexual text mapped to knowledgebase identifiers (such as wikipedia descriptions of the entity). The sparse matrices of these contexts are then compared cosine wise with a target matrix to determine the most likely identifier.
Context matrices are kept in a disk cache until needed, with only a sample held in memory. The size of this in memory cache can be controlled with the
KAZU_GILDA_TFIDF_DISAMBIGUATION_IN_MEMORY_CACHE_SIZE
env variable.Caution
If no context is available, the ID automatically scores 0.0. The downside of this is that any ids without a context automatically appear at the bottom of any rankings.
Original Credit:
https://github.com/indralab/gilda
Paper:
Benjamin M Gyori, Charles Tapley Hoyt, and Albert Steppi. 2022.Bioinformatics Advances. Vbac034.Bibtex Citation Details
@article{gyori2022gilda, author = {Gyori, Benjamin M and Hoyt, Charles Tapley and Steppi, Albert}, title = "{{Gilda: biomedical entity text normalization with machine-learned disambiguation as a service}}", journal = {Bioinformatics Advances}, year = {2022}, month = {05}, issn = {2635-0041}, doi = {10.1093/bioadv/vbac034}, url = {https://doi.org/10.1093/bioadv/vbac034}, note = {vbac034} }
It’s a singleton, so that the model can be accessed in multiple locations without the need to load it into memory multiple times.
- __call__(context_vec, id_sets, parser_name)[source]¶
Given a context vector, yield the most likely identifiers and their score from the given set of identifiers.
- __init__(contexts_path, model_path)[source]¶
- Parameters:
contexts_path (str) –
json file in the format:
{"<parser name>": {"<idx>": "<context string>"}}
model_path (str) – path to a pretrained
sklearn.feature_extraction.text.TfidfVectorizer
model
- class kazu.steps.linking.post_processing.disambiguation.context_scoring.TfIdfScorer[source]¶
Bases:
object
This class manages a set of TFIDF models (via
sklearn.feature_extraction.text.TfidfVectorizer
).It’s a singleton, so that the models can be accessed in multiple locations without the need to load them into memory multiple times.
- __call__(strings, matrix, parser)[source]¶
Transform a list of strings with a parser-specific vectorizer and score against a matrix.