Scaling with Ray¶
Usually, we want to run Kazu over large number of documents, so we need a framework to handle the distributed processing.
Ray is a simple to use Actor style framework that works extremely well for this. In this example,
we demonstrate how Ray can be used to scale Kazu over multiple cores.
Note
Ray can also be used in a multi node environment, for extreme scaling. Please refer to the Ray docs for this.
Overview¶
We’ll use the Kazu LLMNERStep
with some clean up actions to build a Kazu pipeline. We’ll then create multiple
Ray actors to instantiate this pipeline, then feed those actors Kazu Document
s through ray.util.queue.Queue
.
The actors will process the documents, and write the results to another ray.util.queue.Queue
. The main process will then
read from this second queue and write the results to disk.
The code for this orchestration is in `scripts/examples/annotate_with_llm.py`
and the configuration is in
`scripts/examples/conf/annotate_with_llm/default.yaml`
The script can be executed with:
$ python scripts/examples/annotate_with_llm.py --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True
Note
You will need to add values for the configuration keys marked ???, such as your input directory, vertex config etc.