Why runnable?

Obviously, there are a lot of orchestration tools. A well maintained and curated list is available here.

Broadly, they could be classed into native or meta orchestrators.

native orchestrators¶

Focus on resource management, job scheduling, robustness and scalability.
Have less features on domain (data engineering, data science) activities.
Difficult to run locally.
Not ideal for quick experimentation or research activities.

meta orchestrators¶

An abstraction over native orchestrators.
Oriented towards domain (data engineering, data science) features.
Easy to get started and run locally.
Ideal for quick experimentation or research activities.

runnable is a meta orchestrator with simple API, geared towards data engineering, data science projects. It works in conjunction with native orchestrators and an alternative to kedro or metaflow, in the design philosophy.

runnable could also function as an SDK for native orchestrators as it always compiles pipeline definitions to native orchestrators.

Easy to adopt, its mostly your code

Your application code remains as it is. Runnable exists outside of it.
- No API's or decorators or any imposed structure.
Getting started
Bring your infrastructure

runnable is not a platform. It works with your platforms.
- runnable composes pipeline definitions suited to your infrastructure.
Infrastructure
Reproducibility

Runnable tracks key information to reproduce the execution. All this happens without any additional code.

Run Log
Retry failues

Debug any failure in your local development environment.

Retry
Testing

Unit test your code and pipelines.
- mock/patch the steps of the pipeline
- test your functions as you normally do.
Test
Move on

Moving away from runnable is as simple as deleting relevant files.
- Your application code remains as it is.

Comparisons¶

To simplify the core of runnable, consider the following function:

def func(x: int, y: pd.DataFrame):
    # Access some data, input.csv
    # do something with the inputs.
    # Write a file called output.csv for downstream steps.
    # return an output.
    return z

It takes

inputs x (integer) and y (a pandas dataframe or any other object),
processes input data, input.csv expected on local file system
writes a file, output.csv to local filesystem
returns z (a simple datatype or object)

The function in wrapped in runnable as:

from somewhere import func
from runnable import PythonTask, pickled, Catalog

# instruction to get input.csv from catalog at the start of the step.
# and move output.csv to the catalog at the end of the step
catalog = Catalog(get=["input.csv"], put=["output.csv"])

# Call the function, func and expect it to return "z" while moving the files
# It is expected that "x" and "y" are parameters set by some upstream step.
# If the return parameter is an object, use pickled("z")
func_task = PythonTask(name="function", function=func, returns=["z"], catalog=catalog)

Briefly:

The function remains the same as written
The required data sets are put in place for the function execution
The required input parameters are inspected and passed in from the available parameters
After the function call, the return parameters are added to the parameter space
The processed data is stored for future use.

Below are the implementations in alternative frameworks. Note that the below are the best of our understanding of the frameworks, please let us know if there are better implementations.

Along with the observations,

We have implemented MNIST example in pytorch in multiple frameworks for easier practical comparison.
The tutorials are inspired from tutorials of popular frameworks to give a flavor of runnable.

metaflow¶

The function in metaflow's step would roughly be:

from metaflow import step, conda, FlowSpec

class Flow(FlowSpec)

    @conda(libraries={...})
    @step
    def func_step(self):
        from somewhere import func
        self.z = func(self.x, self.y)

        # Use metaflow.S3 to move files
        # Move to next step.
        ...

Though the philosophy is similar, there are some implementation differences in:

Dependency management - metaflow requiring decorators while runnable works in the project environment
Dataflow - runnable moves data in and out via the configuration while in metaflow the user is expected to write code.
Support for notebooks - runnable allows notebooks to be steps.
Platform vs package - runnable is a package while metaflow takes a platform perspective

kedro¶

The function in kedro implementation would roughly be:

Note that any movement of files should happen via data catalog.

from kedro.pipeline import Pipeline, node, pipeline
from somewhere import func

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=func,
                inputs=["params:x", "y"],
                outputs=["z"],
                name="my_function",
            ),
            ...
        ]
    )

kedro has a larger footprint in the domain code by the configuration files. It imposes a structure and code organization while runnable does not have an opinion on the code structure.

runnable supports notebooks, dynamic pipelines while kedro lacks support for these.