


Currently just provides sentence-segmentation using a tokenizer trained on the genia treebank.

class kazu.steps.other.stanza.StanzaStep[source]

Bases: Step

Currently just provides sentence-segmentation using a tokenizer trained on the genia treebank.


To use this step, you will need stanza installed, which is not installed as part of the default kazu install because this step isn’t used as part of the default pipeline.

You can either do:

$ pip install stanza

Or you can install required dependencies for all steps included in kazu with:

$ pip install kazu[all-steps]

Stanza paper:

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020. [pdf][bib]

Stanza biomedical and clinical models:

Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz.
Journal of the American Medical Informatics Association. 2021.
Bibtex Citation Details (both papers above)
    author = {Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System
    title = {Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    url = {https://nlp.stanford.edu/pubs/qi2020stanza.pdf},
    year = {2020}

    author = {Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D and Langlotz, Curtis P},
    title = "{Biomedical and clinical English model packages for the Stanza Python NLP library}",
    journal = {Journal of the American Medical Informatics Association},
    volume = {28},
    number = {9},
    pages = {1892-1899},
    year = {2021},
    month = {06},
    abstract = "{The study sought to develop and evaluate neural natural language processing (NLP) packages for the
    syntactic analysis and named entity recognition of biomedical and clinical English text.We implement and train
    biomedical and clinical English NLP pipelines by extending the widely used Stanza library originally designed
    for general NLP tasks. Our models are trained with a mix of public datasets such as the CRAFT treebank as well
    as with a private corpus of radiology reports annotated with 5 radiology-domain entities. The resulting
    pipelines are fully based on neural networks, and are able to perform tokenization, part-of-speech tagging,
    lemmatization, dependency parsing, and named entity recognition for both biomedical and clinical text. We
    compare our systems against popular open-source NLP libraries such as CoreNLP and scispaCy, state-of-the-art
    models such as the BioBERT models, and winning systems from the BioNLP CRAFT shared task.For syntactic
    analysis, our systems achieve much better performance compared with the released scispaCy models and CoreNLP
    models retrained on the same treebanks, and are on par with the winning system from the CRAFT shared task. For
    NER, our systems substantially outperform scispaCy, and are better or on par with the state-of-the-art
    performance from BioBERT, while being much more computationally efficient.We introduce biomedical and clinical
    NLP packages built for the Stanza library. These packages offer performance that is similar to the state of the
    art, and are also optimized for ease of use. To facilitate research, we make all our models publicly available.
    We also provide an online demonstration (http://stanza.run/bio).}",
    issn = {1527-974X},
    doi = {10.1093/jamia/ocab090},
    url = {https://doi.org/10.1093/jamia/ocab090},
    eprint = {https://academic.oup.com/jamia/article-pdf/28/9/1892/39731803/ocab090.pdf},

Process documents and respond with processed and failed documents.

Note that many steps will be decorated by document_iterating_step() or document_batch_step() which will modify the ‘original’ __call__ function signature to match the expected signature for a step, as the decorators handle the exception/failed documents logic for you.


The first element is all the provided docs (now modified by the processing), the second is the docs that failed to (fully) process correctly.

Return type:

tuple[list[Document], list[Document]]


stanza_pipeline (Pipeline) – The stanza pipeline the step uses for sentence-segmentation