kazu.utils.utils

Functions

as_path(p)

create_char_ngrams(s[, n])

Return list of char ngrams as a string.

create_word_ngrams(s[, n])

Return list of word ngrams as a single space-separated string.

documents_to_document_section_batch_encodings_map(...)

Convert documents into a BatchEncoding.

documents_to_id_section_map(docs)

Return a map of documents, indexed by order of sections.

find_document_from_entity(docs, entity)

For a given entity and a list of docs, find the doc the entity belongs to.

get_match_entity_class_hash(ent)

linking_candidates_to_ontology_string_resources(...)

word_is_valid(start_char, end_char, starts, ends)

Check if a string is a valid word by checking the start and end characters are in predefined start/end sets of word boundaries.

Classes

EntityClassFilter

A condition that returns True if a document has any entities that match the class of the required_entity_classes.

Singleton

class kazu.utils.utils.EntityClassFilter[source]

Bases: object

A condition that returns True if a document has any entities that match the class of the required_entity_classes.

__call__(document)[source]

Call self as a function.

Parameters:

document (Document)

Return type:

bool

__init__(required_entity_classes)[source]
Parameters:

required_entity_classes (Iterable[str]) – list of str, specifying entity classes to assess

class kazu.utils.utils.Singleton[source]

Bases: type

__call__(*args, **kwargs)[source]

Call self as a function.

static clear_all()[source]
kazu.utils.utils.as_path(p)[source]
Parameters:

p (str | Path)

Return type:

Path

kazu.utils.utils.create_char_ngrams(s, n=2)[source]

Return list of char ngrams as a string.

Parameters:
Return type:

list[str]

kazu.utils.utils.create_word_ngrams(s, n=2)[source]

Return list of word ngrams as a single space-separated string.

Parameters:
Return type:

list[str]

kazu.utils.utils.documents_to_document_section_batch_encodings_map(docs, tokenizer, stride=128, max_length=512)[source]

Convert documents into a BatchEncoding. Also returns a list of <int + section> for the resulting encoding.

Parameters:
Returns:

Return type:

tuple[BatchEncoding, dict[int, Section]]

kazu.utils.utils.documents_to_id_section_map(docs)[source]

Return a map of documents, indexed by order of sections.

Parameters:

docs (list[Document])

Returns:

Return type:

dict[int, Section]

kazu.utils.utils.find_document_from_entity(docs, entity)[source]

For a given entity and a list of docs, find the doc the entity belongs to.

Parameters:
Returns:

Return type:

Document

kazu.utils.utils.get_match_entity_class_hash(ent)[source]
Parameters:

ent (Entity)

Return type:

int

kazu.utils.utils.linking_candidates_to_ontology_string_resources(candidates)[source]
Parameters:

candidates (Iterable[LinkingCandidate])

Returns:

Return type:

set[OntologyStringResource]

kazu.utils.utils.word_is_valid(start_char, end_char, starts, ends)[source]

Check if a string is a valid word by checking the start and end characters are in predefined start/end sets of word boundaries.

Parameters:
Return type:

bool