kazu.utils.utils¶
Functions
|
|
|
Return list of char ngrams as a string. |
|
Return list of word ngrams as a single space-separated string. |
Convert documents into a BatchEncoding. |
|
Return a map of documents, indexed by order of sections. |
|
|
For a given entity and a list of docs, find the doc the entity belongs to. |
|
Check if a string is a valid word by checking the start and end characters are in predefined start/end sets of word boundaries. |
Classes
A condition that returns True if a document has any entities that match the class of the required_entity_classes. |
|
- class kazu.utils.utils.EntityClassFilter[source]¶
Bases:
object
A condition that returns True if a document has any entities that match the class of the required_entity_classes.
- class kazu.utils.utils.Singleton[source]¶
Bases:
type
- kazu.utils.utils.create_word_ngrams(s, n=2)[source]¶
Return list of word ngrams as a single space-separated string.
- kazu.utils.utils.documents_to_document_section_batch_encodings_map(docs, tokenizer, stride=128, max_length=512)[source]¶
Convert documents into a BatchEncoding. Also returns a list of <int + section> for the resulting encoding.
- Parameters:
tokenizer (PreTrainedTokenizerBase)
stride (int)
max_length (int)
- Returns:
- Return type:
tuple[BatchEncoding, dict[int, Section]]
- kazu.utils.utils.documents_to_id_section_map(docs)[source]¶
Return a map of documents, indexed by order of sections.
- kazu.utils.utils.find_document_from_entity(docs, entity)[source]¶
For a given entity and a list of docs, find the doc the entity belongs to.
- kazu.utils.utils.linking_candidates_to_ontology_string_resources(candidates)[source]¶
- Parameters:
candidates (Iterable[LinkingCandidate])
- Returns:
- Return type: