kazu.utils.utils¶

Functions

`as_path`(p)
`create_char_ngrams`(s[, n])	Return list of char ngrams as a string.
`create_word_ngrams`(s[, n])	Return list of word ngrams as a single space-separated string.
`documents_to_document_section_batch_encodings_map`(...)	Convert documents into a BatchEncoding.
`documents_to_id_section_map`(docs)	Return a map of documents, indexed by order of sections.
`find_document_from_entity`(docs, entity)	For a given entity and a list of docs, find the doc the entity belongs to.
`get_match_entity_class_hash`(ent)
`linking_candidates_to_ontology_string_resources`(...)
`word_is_valid`(start_char, end_char, starts, ends)	Check if a string is a valid word by checking the start and end characters are in predefined start/end sets of word boundaries.

Classes

`EntityClassFilter`	A condition that returns True if a document has any entities that match the class of the required_entity_classes.
`Singleton`

class kazu.utils.utils.EntityClassFilter[source]¶

Bases: object

A condition that returns True if a document has any entities that match the class of the required_entity_classes.

__call__(document)[source]¶

Call self as a function.

Parameters:: document (Document)
Return type:: bool

__init__(required_entity_classes)[source]¶

Parameters:: required_entity_classes (Iterable[str]) – list of str, specifying entity classes to assess

class kazu.utils.utils.Singleton[source]¶

Bases: type

__call__(*args, **kwargs)[source]¶

Call self as a function.

Parameters:

args (Any)
kwargs (Any)

Return type:

Any

static clear_all()[source]¶

Return type:: None

kazu.utils.utils.as_path(p)[source]¶

Parameters:: p (str | Path)
Return type:: Path

kazu.utils.utils.create_char_ngrams(s, n=2)[source]¶

Return list of char ngrams as a string.

Parameters:

s (str)
n (int)

Return type:

list[str]

kazu.utils.utils.create_word_ngrams(s, n=2)[source]¶

Return list of word ngrams as a single space-separated string.

Parameters:

s (str)
n (int)

Return type:

list[str]

kazu.utils.utils.documents_to_document_section_batch_encodings_map(docs, tokenizer, stride=128, max_length=512)[source]¶

Convert documents into a BatchEncoding. Also returns a list of <int + section> for the resulting encoding.

Parameters:

docs (list[Document])
tokenizer (PreTrainedTokenizerBase)
stride (int)
max_length (int)

Returns:

Return type:

tuple[BatchEncoding, dict[int, Section]]

kazu.utils.utils.documents_to_id_section_map(docs)[source]¶

Return a map of documents, indexed by order of sections.

Parameters:: docs (list[Document])
Returns:
Return type:: dict[int, Section]

kazu.utils.utils.find_document_from_entity(docs, entity)[source]¶

For a given entity and a list of docs, find the doc the entity belongs to.

Parameters:

docs (list[Document])
entity (Entity)

Returns:

Return type:

Document

kazu.utils.utils.get_match_entity_class_hash(ent)[source]¶

Parameters:: ent (Entity)
Return type:: int

kazu.utils.utils.linking_candidates_to_ontology_string_resources(candidates)[source]¶

Parameters:: candidates (Iterable[LinkingCandidate])
Returns:
Return type:: set[OntologyStringResource]

kazu.utils.utils.word_is_valid(start_char, end_char, starts, ends)[source]¶

Check if a string is a valid word by checking the start and end characters are in predefined start/end sets of word boundaries.

Parameters:

start_char (int)
end_char (int)
starts (set[int])
ends (set[int])

Return type:

bool