Skip to content

Runnable

Image title

Orchestrate your functions, notebooks, scripts anywhere!!

Runner icons created by Leremy - Flaticon


Example

The below data science flavored code is a well-known iris example from scikit-learn.

"""
Example of Logistic regression using scikit-learn
https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
"""

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.linear_model import LogisticRegression


def load_data():
    # import some data to play with
    iris = datasets.load_iris()
    X = iris.data[:, :2]  # we only take the first two features.
    Y = iris.target

    return X, Y


def model_fit(X: np.ndarray, Y: np.ndarray, C: float = 1e5):
    logreg = LogisticRegression(C=C)
    logreg.fit(X, Y)

    return logreg


def generate_plots(X: np.ndarray, Y: np.ndarray, logreg: LogisticRegression):
    _, ax = plt.subplots(figsize=(4, 3))
    DecisionBoundaryDisplay.from_estimator(
        logreg,
        X,
        cmap=plt.cm.Paired,
        ax=ax,
        response_method="predict",
        plot_method="pcolormesh",
        shading="auto",
        xlabel="Sepal length",
        ylabel="Sepal width",
        eps=0.5,
    )

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors="k", cmap=plt.cm.Paired)

    plt.xticks(())
    plt.yticks(())

    plt.savefig("iris_logistic.png")

    # TODO: What is the right value?
    return 0.6


## Without any orchestration
def main():
    X, Y = load_data()
    logreg = model_fit(X, Y, C=1.0)
    generate_plots(X, Y, logreg)


## With runnable orchestration
def runnable_pipeline():
    # The below code can be anywhere
    from runnable import Catalog, Pipeline, PythonTask, metric, pickled

    # X, Y = load_data()
    load_data_task = PythonTask(
        function=load_data,
        name="load_data",
        returns=[pickled("X"), pickled("Y")],  # (1)
    )

    # logreg = model_fit(X, Y, C=1.0)
    model_fit_task = PythonTask(
        function=model_fit,
        name="model_fit",
        returns=[pickled("logreg")],
    )

    # generate_plots(X, Y, logreg)
    generate_plots_task = PythonTask(
        function=generate_plots,
        name="generate_plots",
        terminate_with_success=True,
        catalog=Catalog(put=["iris_logistic.png"]),  # (2)
        returns=[metric("score")],
    )

    pipeline = Pipeline(
        steps=[load_data_task, model_fit_task, generate_plots_task],
    )  # (4)

    pipeline.execute()

    return pipeline


if __name__ == "__main__":
    # main()
    runnable_pipeline()
  1. Return two serialized objects X and Y.
  2. Store the file iris_logistic.png for future reference.
  3. Define the sequence of tasks.
  4. Define a pipeline with the tasks

The difference between native driver and runnable orchestration:

Notebooks and Shell scripts

You can execute notebooks and shell scripts too!!

They can be written just as you would want them, plain old notebooks and scripts.

- X, Y = load_data()
+load_data_task = PythonTask(
+    function=load_data,
+     name="load_data",
+     returns=[pickled("X"), pickled("Y")], (1)
+    )

-logreg = model_fit(X, Y, C=1.0)
+model_fit_task = PythonTask(
+   function=model_fit,
+   name="model_fit",
+   returns=[pickled("logreg")],
+   )

-generate_plots(X, Y, logreg)
+generate_plots_task = PythonTask(
+   function=generate_plots,
+   name="generate_plots",
+   terminate_with_success=True,
+   catalog=Catalog(put=["iris_logistic.png"]), (2)
+   )


+pipeline = Pipeline(
+   steps=[load_data_task, model_fit_task, generate_plots_task], (3)

  • Domain code remains completely independent of driver code.
  • The driver function has an equivalent and intuitive runnable expression
  • Reproducible by default, runnable stores metadata about code/data/config for every execution.
  • The pipeline is runnable in any environment.

Why runnable?

Obviously, there are a lot of orchestration tools. A well maintained and curated list is available here.

Broadly, they could be classed into native or meta orchestrators.

Image title Image title

native orchestrators

  • Focus on resource management, job scheduling, robustness and scalability.
  • Have less features on domain (data engineering, data science) activities.
  • Difficult to run locally.
  • Not ideal for quick experimentation or research activities.

meta orchestrators

  • An abstraction over native orchestrators.
  • Oriented towards domain (data engineering, data science) features.
  • Easy to get started and run locally.
  • Ideal for quick experimentation or research activities.

runnable is a meta orchestrator with simple API, geared towards data engineering, data science projects. It works in conjunction with native orchestrators and an alternative to kedro or metaflow.

runnable could also function as an SDK for native orchestrators as it always compiles pipeline definitions to native orchestrators.


  • Easy to adopt, its mostly your code


    Your application code remains as it is. Runnable exists outside of it.

    • No API's or decorators or any imposed structure.

    Getting started

  • 🏗 Bring your infrastructure


    runnable is not a platform. It works with your platforms.

    • runnable composes pipeline definitions suited to your infrastructure.

    Infrastructure

  • 📝 Reproducibility


    Runnable tracks key information to reproduce the execution. All this happens without any additional code.

    Run Log

  • 🔁 Retry failues


    Debug any failure in your local development environment.

    Retry

  • 🔬 Testing


    Unit test your code and pipelines.

    • mock/patch the steps of the pipeline
    • test your functions as you normally do.

    Test

  • 💔 Move on


    Moving away from runnable is as simple as deleting relevant files.

    • Your application code remains as it is.

Comparisons

For the purpose of comparisons, consider the following function:

def func(x: int, y:pd.DataFrame):
    # Access some data, input.csv
    # do something with the inputs.
    # Write a file called output.csv for downstream steps.
    # return an output.
    return z

It takes

  • inputs x (integer) and y (a pandas dataframe or any other object),
  • processes input data, input.csv expected on local file system
  • writes a file, output.csv to local filesystem
  • returns z (a simple datatype or object)

The function in wrapped in runnable as:

from somewhere import func
from runnable import PythonTask, pickled, Catalog

# instruction to get input.csv from catalog at the start of the step.
# and move output.csv to the catalog at the end of the step
catalog = Catalog(get=["input.csv"], put=["output.csv"])

# Call the function, func and expect it to return "z" while moving the files
# It is expected that "x" and "y" are parameters set by some upstream step.
# If the return parameter is an object, use pickled("z")
func_task = PythonTask(name="function", function=func, returns=["z"], catalog=catalog)

Below are the implementations in alternative frameworks. Note that the below are the best of our understanding of the frameworks, please let us know if there are better implementations.

Along with the observations,

  • We have implemented MNIST example in pytorch in multiple frameworks for easier practical comparison.
  • The tutorials are inspired from tutorials of popular frameworks to give a flavor of runnable.

metaflow

The function in metaflow's step would roughly be:

from metaflow import step, conda, FlowSpec

class Flow(FlowSpec)

    @conda(libraries={...})
    @step
    def func_step(self):
        from somewhere import func
        self.z = func(self.x, self.y)

        # Use metaflow.S3 to move files
        # Move to next step.
        ...
  • The API between runnable and metaflow are comparable.
  • There is a mechanism for functions to accept/return parameters.
  • Both support parallel branches, arbitrary nesting of pipelines.

The differences:

dependency management:

runnable depends on the activated virtualenv for dependencies which is natural to python. Use custom docker images to provide the same environment in cloud based executions.

metaflow uses decorators (conda, pypi) to specify dependencies. This has an advantage of abstraction from docker ecosystem for the user.

dataflow:

In runnable, data flow between steps is by an instruction in runnable to glob files in local disk and present them in the same structure to downstream steps.

metaflow needs a code based instruction to do so.

notebooks:

runnable allows notebook as tasks. Notebooks can take JSON style inputs and can return pythonic objects for downstream steps.

metaflow does not support notebooks as tasks.

infrastructure:

runnable, in many ways, is just a transpiler to your chosen infrastructure.

metaflow is a platform with its own specified infrastructure.

modular pipelines

In runnable the individual pipelines of parallel and map states are pipelines themselves and can run in isolation. This is not true in metaflow.

unit testing pipelines

runnable pipelines are testable using mocked executor where the executables can be mocked/patched. In metaflow, it depends on how the python function is wrapped in the pipeline.

distributed training

metaflow supports distributed training.

As of now, runnable does not support distributed training but is in the works.


kedro

The function in kedro implementation would roughly be:

Note that any movement of files should happen via data catalog.

from kedro.pipeline import Pipeline, node, pipeline
from somewhere import func

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=func,
                inputs=["params:x", "y"],
                outputs=["z"],
                name="my_function",
            ),
            ...
        ]
    )
Footprint

kedro has a larger footprint in the domain code by the configuration files. It is tightly structured and provides a CLI to get started.

To use runnable as part of the project requires adding a pipeline definition file (in python or yaml) and an optional configuration file.

dataflow

Kedro needs the data flowing through the pipeline via catalog.yaml which provides a central place to understand the data.

In runnable, the data is presented to the individual tasks as requested by the catalog instruction.

notebooks

Kedro supports notebooks for exploration but not as tasks of the pipeline.

dynamic pipelines

kedro does not support dynamic pipelines or map state.

distributed training

kedro supports distributed training via a plugin.

As of now, runnable does not support distributed training but is in the works.