kazu.linking.sapbert.train¶
Functions
|
Classes
A knowledgebase entry. |
|
GoldStandardExample(gold_default_label, gold_iri, candidates) |
|
Dataset used for training SapBert. |
|
Data collator to be used with HFSapbertPairwiseDataset. |
|
Manages the loading/parsing of multiple evaluation datasets. |
|
To evaluate a given embedding model, we need a query datasource (i.e. things that need to be linked)] and an ontology datasource (i.e. things that we need to generate an embedding space for, that can be queried against) each should have three columns:. |
|
- class kazu.linking.sapbert.train.Candidate[source]¶
Bases:
NamedTuple
A knowledgebase entry.
- static __new__(_cls, default_label, iri, correct)¶
Create new instance of Candidate(default_label, iri, correct)
- class kazu.linking.sapbert.train.GoldStandardExample[source]¶
Bases:
NamedTuple
GoldStandardExample(gold_default_label, gold_iri, candidates)
- static __new__(_cls, gold_default_label, gold_iri, candidates)¶
Create new instance of GoldStandardExample(gold_default_label, gold_iri, candidates)
- class kazu.linking.sapbert.train.HFSapbertPairwiseDataset[source]¶
Bases:
Dataset
Dataset used for training SapBert.
- __init__(encodings_1, encodings_2, labels)[source]¶
- Parameters:
encodings_1 (BatchEncoding) – encodings for example 1
encodings_2 (BatchEncoding) – encodings for example 2
labels (ndarray) – labels i.e. knowledgebase identifier for both encodings, as an int
- class kazu.linking.sapbert.train.PLSapbertModel[source]¶
Bases:
LightningModule
- __init__(model_name_or_path, sapbert_training_params=None, sapbert_evaluation_manager=None, *args, **kwargs)[source]¶
- Parameters:
model_name_or_path (str) – passed to AutoModel.from_pretrained
sapbert_training_params (SapbertTrainingParams | None) – optional SapbertTrainingParams, only needed if training a model
sapbert_evaluation_manager (SapbertEvaluationDataManager | None) – optional SapbertEvaluationDataManager, only needed if training a model
args (Any) – passed to LightningModule
kwargs (Any) – passed to LightningModule
- configure_optimizers()[source]¶
Implementation of LightningModule.configure_optimizers.
- evaluate_topk_acc(queries)[source]¶
Get a dictionary of accuracy results at different levels of k (nearest neighbours)
- Parameters:
queries (list[GoldStandardExample])
- Returns:
- Return type:
- forward(batch)[source]¶
For inference.
- Parameters:
batch (BatchEncoding) – standard bert input, with an additional ‘indices’ for representing the location of the embedding
- Returns:
- Return type:
- get_candidate_dict(np_candidates, golden_iri)[source]¶
Convert rows in a dataframe representing candidate KB entries into a corresponding
Candidate
per row.
- predict_step(batch, batch_idx, dataloader_idx=None)[source]¶
Implementation of LightningModule.predict_step.
- train_dataloader()[source]¶
Implementation of LightningModule.train_dataloader.
- Return type:
DataLoader | Sequence[DataLoader] | Sequence[Sequence[DataLoader]] | Sequence[Dict[str, DataLoader]] | Dict[str, DataLoader] | Dict[str, Dict[str, DataLoader]] | Dict[str, Sequence[DataLoader]]
- training_step(batch, batch_idx, *args, **kwargs)[source]¶
Implementation of LightningModule.training_step.
- val_dataloader()[source]¶
Implementation of LightningModule.val_dataloader.
- Return type:
- validation_epoch_end(outputs)[source]¶
Lightning override generate new embeddings for each
SapbertEvaluationDataset.ontology_source
and query them withSapbertEvaluationDataset.query_source
- class kazu.linking.sapbert.train.SapbertDataCollatorWithPadding[source]¶
Bases:
object
Data collator to be used with HFSapbertPairwiseDataset.
- __call__(features)[source]¶
Call self as a function.
- Parameters:
features (list[dict[str, BatchEncoding]])
- Return type:
- __init__(tokenizer, padding=True, max_length=None, pad_to_multiple_of=None)[source]¶
- Parameters:
tokenizer (PreTrainedTokenizerBase)
padding (bool | str | PaddingStrategy)
max_length (int | None)
pad_to_multiple_of (int | None)
- Return type:
None
- padding: bool | str | PaddingStrategy = True¶
- tokenizer: PreTrainedTokenizerBase¶
- class kazu.linking.sapbert.train.SapbertEvaluationDataManager[source]¶
Bases:
object
Manages the loading/parsing of multiple evaluation datasets. Each dataset should have two sources, a query source and an ontology source. these are then converted into data loaders, while maintaining a reference to the embedding metadata that should be used for evaluation.
self.dataset is dict[dataset_name, SapbertEvaluationDataset] after construction
- class kazu.linking.sapbert.train.SapbertEvaluationDataset[source]¶
Bases:
NamedTuple
To evaluate a given embedding model, we need a query datasource (i.e. things that need to be linked)] and an ontology datasource (i.e. things that we need to generate an embedding space for, that can be queried against) each should have three columns:
default_label (text), iri (ontology id) and source (ontology name)
- static __new__(_cls, query_source, ontology_source)¶
Create new instance of SapbertEvaluationDataset(query_source, ontology_source)