Handling Large Datasets¶

Our pipeline is working great with small datasets that fit in memory. But what happens when your dataset is 100GB? Or when preprocessing generates gigabytes of intermediate results? Let's solve this with efficient file-based storage.

The Memory Problem¶

In Chapter 4, we passed data between steps using pickled():

PythonTask(
    function=preprocess_data,
    returns=[pickled("preprocessed_data")]  # All data kept in memory!
)

Problems with this approach:

Large datasets won't fit in memory
Pickling/unpickling is slow for big objects
Can't easily inspect intermediate results
Memory pressure on your system

The Solution: Catalog for File Storage¶

Instead of passing data through memory, save it to files and let Runnable manage them:

examples/tutorials/getting-started/05_handling_datasets.py

from runnable import Pipeline, PythonTask, Catalog, pickled

def load_data_to_file(data_path="data.csv"):
    """Load data and save to file."""
    df = load_data(data_path)
    df.to_csv("dataset.csv", index=False)
    return {"rows": len(df), "columns": len(df.columns)}

# Store the dataset file automatically
PythonTask(
    function=load_data_to_file,
    name="load_data",
    catalog=Catalog(put=["dataset.csv"]),  # Store this file
    returns=[pickled("dataset_info")]  # Only metadata in memory
)

Try it:

uv run examples/tutorials/getting-started/05_handling_datasets.py

How Catalog Works¶

Step 1: Create and Store Files¶

def preprocess_from_file(test_size=0.2, random_state=42):
    # Load from file
    df = pd.read_csv("dataset.csv")

    # Do your preprocessing
    preprocessed = preprocess_data(df, test_size, random_state)

    # Save results to files
    preprocessed['X_train'].to_csv("X_train.csv", index=False)
    preprocessed['X_test'].to_csv("X_test.csv", index=False)
    preprocessed['y_train'].to_csv("y_train.csv", index=False)
    preprocessed['y_test'].to_csv("y_test.csv", index=False)

    return {"train_samples": len(preprocessed['X_train'])}

PythonTask(
    function=preprocess_from_file,
    name="preprocess",
    catalog=Catalog(
        get=["dataset.csv"],  # Get input file
        put=["X_train.csv", "X_test.csv", "y_train.csv", "y_test.csv"]  # Store outputs
    )
)

Step 2: Retrieve and Use Files¶

def train_from_files(n_estimators=100, random_state=42):
    # Files are automatically available!
    X_train = pd.read_csv("X_train.csv")
    y_train = pd.read_csv("y_train.csv")['target']

    # Train your model
    model = train_model(...)
    return model

PythonTask(
    function=train_from_files,
    name="train",
    catalog=Catalog(get=["X_train.csv", "y_train.csv"])  # Get only what you need
)

Complete File-Based Pipeline¶

Here's the full pipeline using file storage for large data:

examples/tutorials/getting-started/05_handling_datasets.py

def evaluate_from_files(model_data):
    """Load test data from files and evaluate model."""
    # Load test data from files
    X_test = pd.read_csv("X_test.csv")
    y_test = pd.read_csv("y_test.csv")['target']

    print(f"Evaluating on {len(X_test)} test samples")

    preprocessed_data = {
        'X_train': None,  # Not needed for evaluation
        'y_train': None,
        'X_test': X_test,
        'y_test': y_test
    }

    results = evaluate_model(model_data, preprocessed_data)
    return results


def main():
    """Demonstrate file-based data management with Catalog."""
    print("=" * 50)
    print("Chapter 5: Handling Large Datasets")
    print("=" * 50)

    pipeline = Pipeline(steps=[
        # Load data and store the dataset file
        PythonTask(
            function=load_data_to_file,
            name="load_data",
            catalog=Catalog(put=["dataset.csv"]),
            returns=[pickled("dataset_info")]
        ),
        # Preprocess and store all intermediate files
        PythonTask(
            function=preprocess_from_file,
            name="preprocess",
            catalog=Catalog(
                get=["dataset.csv"],  # Get the dataset
                put=["X_train.csv", "X_test.csv", "y_train.csv", "y_test.csv"]  # Store results
            ),
            returns=[pickled("preprocess_info")]
        ),
        # Train model using files
        PythonTask(
            function=train_from_files,
            name="train",
            catalog=Catalog(get=["X_train.csv", "y_train.csv"]),
            returns=[pickled("model_data")]
        ),
        # Evaluate using files
        PythonTask(
            function=evaluate_from_files,
            name="evaluate",
            catalog=Catalog(get=["X_test.csv", "y_test.csv"]),
            returns=[pickled("evaluation_results")]

What You Get with File-Based Storage¶

💾 Handle Large Datasets¶

Your dataset can be bigger than available RAM - only load what you need when you need it:

# Only load training data for training step
X_train = pd.read_csv("X_train.csv")  # Maybe 50GB
# X_test isn't loaded - saves memory!

🔄 Automatic File Management¶

Runnable handles file locations transparently:

put=["file.parquet"] - Stores file safely in .runnable/ catalog
get=["file.parquet"] - Makes file available in your working directory
Files appear exactly where your code expects them

📦 Inspect Intermediate Results¶

All intermediate files are preserved:

# Check what preprocessing produced
ls .runnable/catalog/
# X_train.csv  X_test.csv  y_train.csv  y_test.csv

🚀 Resume Without Reloading¶

If training fails, you don't need to reload and preprocess your 100GB dataset - it's already there!

🤝 Share Results¶

Team members can reuse your preprocessed data without running expensive preprocessing steps.

When to Use Files vs Memory¶

Use Catalog(put=[...]) for files when:

Dataset is large (>1GB)
Preprocessing is expensive
You want to inspect intermediate results
Team members need to share data

Use pickled() for memory when:

Data is small (<100MB)
Objects are complex (models, configs)
You need fast passing between steps

Mixing Files and Memory¶

You can use both approaches in the same pipeline:

pipeline = Pipeline(steps=[
    PythonTask(
        function=load_data_to_file,
        catalog=Catalog(put=["dataset.csv"]),  # Large data → file
        returns=[pickled("metadata")]  # Small metadata → memory
    ),
    PythonTask(
        function=train_from_files,
        catalog=Catalog(get=["dataset.csv"]),  # Get large data from file
        returns=[pickled("model")]  # Model usually fits in memory
    )
])

Compare: Memory vs File Storage¶

Memory Passing (Chapters 1-4):

❌ Limited by available RAM
❌ Slow for large objects
❌ Hard to inspect intermediate data
✅ Simple for small objects
✅ Fast for small data

File Storage (Chapter 5):

✅ Handle datasets larger than RAM
✅ Efficient for large files
✅ Easy to inspect intermediate results
✅ Shareable across runs and team members
✅ Automatic file management

What's Next?¶

We can now handle large datasets efficiently. But what about saving your trained models and results permanently? What if your teammate wants to use your model without rerunning everything?

Next chapter: We'll add persistent storage for models and results that can be shared across runs and team members.

Next: Sharing Results - Persistent model artifacts and metrics