Skip to content

Tracking Model Performance

You've trained models and stored files (Chapter 5), but how do you track performance over time? How do you compare today's 94% accuracy with yesterday's 91%? Let's add structured metrics tracking for model analytics.

The Metrics Tracking Problem

After running your pipeline, you have files but no performance history:

# Train a model...
pipeline.execute()
# Files are saved (Chapter 5 ✅) but:
# - What was the accuracy?
# - How does it compare to previous runs?
# - Which hyperparameters worked best?
# - Can teammates see model performance trends?

The missing piece: structured performance tracking

  • Metrics buried in output logs
  • No systematic performance comparison
  • Hard to track model improvements over time
  • Team can't easily compare model performance

The Solution: Structured Metrics Tracking

Building on Chapter 5's file storage, let's add metrics tracking for performance analytics:

examples/tutorials/getting-started/06_sharing_results.py
from runnable import Pipeline, PythonTask, Catalog, metric, pickled

def save_evaluation_metrics(evaluation_results):
    """Extract and return metrics for tracking."""
    accuracy = evaluation_results['accuracy']
    report = evaluation_results['classification_report']

    # Return metrics for structured tracking (not just files)
    return (
        accuracy,
        report['weighted avg']['precision'],
        report['weighted avg']['recall'],
        report['weighted avg']['f1-score']
    )

PythonTask(
    function=save_evaluation_metrics,
    name="save_metrics",
    returns=[
        metric("accuracy"),      # ← Tracked in run log for analytics
        metric("precision"),     # ← Not just saved to files
        metric("recall"),
        metric("f1_score")
    ]
)

Try it:

uv run examples/tutorials/getting-started/06_sharing_results.py

Understanding metric() vs pickled() Returns

Key insight: Different return types serve different purposes:

# Chapter 5 approach: File storage
returns=[pickled("model_data")]     # Stores complex objects
catalog=Catalog(put=["file.csv"])   # Stores files

# Chapter 6 approach: Metrics tracking
returns=[metric("accuracy")]        # Tracks performance numbers
# No catalog needed - metrics go to run log

How metric() works:

  1. Structured storage: Metrics stored as key-value pairs in run log
  2. Easy comparison: Query and compare across runs
  3. Analytics ready: Perfect for tracking trends and performance
  4. Lightweight: Just numbers, not large files or objects

When to use what:

  • metric(): Performance numbers, hyperparameters, counts
  • pickled(): Models, complex objects, datasets
  • catalog=Catalog(put=[]): Files, reports, artifacts

Tracking Metrics Over Time

Use metric() to track performance metrics:

def save_evaluation_metrics(evaluation_results):
    """Save metrics for tracking."""
    accuracy = evaluation_results['accuracy']
    report = evaluation_results['classification_report']

    # Return metrics for tracking (metrics don't need files)
    return (
        accuracy,
        report['weighted avg']['precision'],
        report['weighted avg']['recall'],
        report['weighted avg']['f1-score']
    )

PythonTask(
    function=save_evaluation_metrics,
    name="save_metrics",
    returns=[
        metric("accuracy"),      # Tracked in run log automatically
        metric("precision"),     # No catalog needed for metrics
        metric("recall"),
        metric("f1_score")
    ]
)

Metrics are special:

  • Automatically tracked in run logs
  • Easy to compare across runs
  • Can be visualized over time
  • Help identify model improvements

Comparing Metrics Across Runs

The real power: compare performance across different runs and experiments:

# Run different experiments
RUNNABLE_PRM_n_estimators=50 uv run 06_sharing_results.py   # Run 1
RUNNABLE_PRM_n_estimators=100 uv run 06_sharing_results.py  # Run 2
RUNNABLE_PRM_n_estimators=200 uv run 06_sharing_results.py  # Run 3

# Compare metrics across all runs
ls .run_log_store/
# curious-euler-0123/  # n_estimators=50
# happy-tesla-0124/    # n_estimators=100
# wise-darwin-0125/    # n_estimators=200

Each run's metrics are tracked:

# .run_log_store/curious-euler-0123/run_log.json
{
  "run_id": "curious-euler-0123",
  "parameters": {"n_estimators": 50},
  "metrics": {
    "accuracy": 0.8234,
    "precision": 0.8156,
    "recall": 0.8234,
    "f1_score": 0.8189
  }
}

Complete Pipeline with Metrics Tracking

Here's a pipeline that combines Chapter 5's file storage with metrics tracking:

examples/tutorials/getting-started/06_sharing_results.py
        PythonTask(
            function=save_evaluation_metrics,
            name="save_metrics",
            catalog=Catalog(put=["evaluation_report.json", "metrics_summary.json"]),
            returns=[
                metric("accuracy"),
                metric("precision"),
                metric("recall"),
                metric("f1_score")
            ]
        ),

What You Get with Metrics Tracking

📊 Structured Performance Data

Unlike buried logs, metrics are stored in structured format:

# Easy to parse and compare
{
  "run_id": "happy-euler-0123",
  "parameters": {"n_estimators": 100, "test_size": 0.2},
  "metrics": {
    "accuracy": 0.9234,
    "precision": 0.9156
  }
}

🎯 Experiment Comparison

Compare hyperparameters and results side by side:

# Run A: n_estimators=50  → accuracy: 0.8234
# Run B: n_estimators=100 → accuracy: 0.9156
# Run C: n_estimators=200 → accuracy: 0.9234

# Clear winner: Run C with n_estimators=200

📈 Performance History

Compare different runs:

# See all your runs
ls .run_log_store/

# Compare metrics from different runs
cat .run_log_store/run-1/run_log.json | grep accuracy
cat .run_log_store/run-2/run_log.json | grep accuracy

🤝 Team Performance Tracking

  • Compare metrics across team members' experiments
  • Track overall model performance improvements
  • See which hyperparameters work best across the team
  • Build shared knowledge of what approaches work

Where Metrics Are Stored

Metrics live in the run log (not catalog like files from Chapter 5):

.run_log_store/             # Metrics stored here
  ├── run-id-123/
     └── run_log.json      # Contains structured metrics
  └── run-id-124/
      └── run_log.json      # Each run's metrics

# Quick metrics lookup
cat .run_log_store/*/run_log.json | grep -A 10 '"metrics"'

Key difference from Chapter 5: - Files.catalog/ (datasets, models, reports) - Metrics.run_log_store/ (performance numbers)

Compare: Ad-hoc vs Structured Metrics

Ad-hoc Performance Tracking (Chapters 1-5):

  • ❌ Metrics buried in print statements
  • ❌ No systematic comparison across runs
  • ❌ Hard to answer "which experiment was best?"
  • ❌ Team can't easily compare approaches

Structured Metrics Tracking (Chapter 6):

  • ✅ Metrics stored as searchable data
  • ✅ Easy comparison across experiments
  • ✅ Clear performance trends over time
  • ✅ Team collaboration on model performance
  • ✅ Data-driven experiment decisions

Real-World Metrics Use Cases

Hyperparameter Optimization

# Track different hyperparameter combinations
returns=[
    metric("accuracy"),
    metric("n_estimators"),     # Track hyperparameter used
    metric("max_depth"),        # Track another hyperparameter
    metric("training_time")     # Track performance metrics
]

A/B Testing

# Compare two models
returns=[
    metric("model_a_accuracy"),
    metric("model_b_accuracy")
]

Team Leaderboard

# Everyone tracks the same metrics
returns=[
    metric("accuracy"),
    metric("f1_score"),
    metric("training_time"),
    metric("data_scientist")  # Track who ran the experiment
]

# Easy to see: "Who achieved the best accuracy?"
# "Which approach is fastest?"

What's Next?

We have reproducible pipelines, flexible configuration, efficient data handling (Chapter 5), and structured metrics tracking (Chapter 6). But everything is running on your laptop. What about production?

Next chapter: We'll show how the same pipeline runs anywhere - your laptop, containers, or Kubernetes - without code changes.


Next: Running Anywhere - Same code, different environments