Overview

Run TrainJobs locally using different backends and common operations

The Kubeflow SDK allows you to run TrainJobs on your local machine without deploying to a Kubernetes cluster. This is ideal for:

  • Environments where Kubernetes is not available

  • Learning and educational purposes

  • Quick prototyping and experimentation

  • Development and testing of training scripts

Available Backends

Local Process Backend

Run TrainJobs directly using native Python processes and virtual environments. This is the fastest option for simple, single-node training.

Best for:

  • Environments where Docker/Podman is not available

  • Testing training scripts without container overhead

  • Quick prototyping and development

Learn more about Local Process Backend

Container Backend with Docker

Run distributed TrainJobs in isolated Docker containers with full multi-node support.

Best for:

  • Reproducible containerized environments

  • Distributed training with multiple containers

  • General use cases, especially on macOS/Windows

Learn more about Docker Backend

Container Backend with Podman

Run distributed TrainJobs using Podman, a daemonless container engine with enhanced security.

Best for:

  • Linux servers with systemd integration

  • Rootless containerized training

  • Security-focused environments

Learn more about Podman Backend

Backend Comparison

Feature

Local Process

Docker

Podman

Setup

No additional software

Docker Desktop/Engine

Podman installation

Isolation

Virtual environments

Full container isolation

Full container isolation

Multi-node

Not supported

Supported

Supported

Root Required

No

Docker group or root

Rootless supported

Startup Time

Fast (seconds)

Medium (container start)

Medium (container start)

Best For

Quick prototyping

General use, wide ecosystem

Security, Linux servers

Switching Between Backends

All backends use the same TrainerClient interface, making it easy to progress from local development to production deployment. The same training code works across all backends - only the backend configuration changes.

Local Process Backend

Complete some quick local testing:

from kubeflow.trainer import TrainerClient, CustomTrainer, LocalProcessBackendConfig

backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)

trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)

Container Backend

Use Docker/Podman for multi-node distributed training:

from kubeflow.trainer import TrainerClient, CustomTrainer, ContainerBackendConfig

backend_config = ContainerBackendConfig(
    container_runtime="docker",
)
client = TrainerClient(backend_config=backend_config)

trainer = CustomTrainer(func=train_model, num_nodes=4)
job_name = client.train(trainer=trainer)

Kubernetes Backend

Production environment with the Kubernetes backend:

from kubeflow.trainer import TrainerClient, CustomTrainer, KubernetesBackendConfig

backend_config = KubernetesBackendConfig(namespace="kubeflow")
client = TrainerClient(backend_config=backend_config)

trainer = CustomTrainer(func=train_model, num_nodes=4)
job_name = client.train(trainer=trainer)

Job Management

All backends support the same job management operations through the TrainerClient interface using the same set of APIs.

Listing Jobs

jobs = client.list_jobs()

for job in jobs:
    print(f"Job: {job.name}, Status: {job.status}")

Viewing Logs

for log_line in client.get_job_logs(job_name, node_index=0, follow=True):
    print(log_line, end='')

for node_index in range(trainer.num_nodes):
    print(f"\n=== Logs from node {node_index} ===")
    for log_line in client.get_job_logs(job_name, node_index=node_index):
        print(log_line, end='')

Waiting for Job Completion

from kubeflow.trainer.constants import constants

job = client.wait_for_job_status(
    job_name,
    status={constants.TRAINJOB_COMPLETE},
    timeout=600
)

print(f"Job completed with status: {job.status}")

Deleting Jobs

client.delete_job(job_name)

This removes:

  • Job metadata

  • Networks created for the job (Container backends)

  • All containers/processes for the job

Working with Runtimes

Runtimes provide pre-configured training environments with specific frameworks and settings.

Listing Available Runtimes

runtimes = client.list_runtimes()
for runtime in runtimes:
    print(f"Runtime: {runtime.name}")

Using a Specific Runtime

job_name = client.train(
    trainer=trainer,
    runtime="torch-distributed"
)

Custom Runtime Sources (Container Backends)

By default, the Container Backends load runtimes from:

  1. Fallback - Built-in default images (e.g., pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime)

  2. GitHub - github://kubeflow/trainer (official runtimes, cached for 24 hours)

You can customize where runtimes are loaded from using the runtime_source configuration:

from kubeflow.trainer import ContainerBackendConfig, TrainingRuntimeSource

backend_config = ContainerBackendConfig(
    container_runtime="docker",
    runtime_source=TrainingRuntimeSource(sources=[
        "github://kubeflow/trainer",
        "github://myorg/myrepo/path/to/runtimes",
        "https://example.com/custom-runtime.yaml",
        "file:///absolute/path/to/runtime.yaml",
        "/absolute/path/to/runtime.yaml",
    ])
)

client = TrainerClient(backend_config=backend_config)

Source Priority: Sources are checked in order. If a runtime is not found in any source, the system falls back to the default image for the framework.

Runtime YAML Example:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-custom
  labels:
    trainer.kubeflow.org/framework: torch
spec:
  mlPolicy:
    numNodes: 1
    torch:
      numProcPerNode: auto
  template:
    spec:
      replicatedJobs:
        - name: node
          template:
            spec:
              template:
                spec:
                  containers:
                    - name: node
                      image: myregistry.com/pytorch-custom:latest

Switching Between Container Backends

The unified Container Backend API makes it easy to switch between Docker and Podman:

backend_config = ContainerBackendConfig(
    container_runtime="docker",
)

backend_config = ContainerBackendConfig(
    container_runtime="podman",
)

backend_config = ContainerBackendConfig(
    container_runtime=None,  # Auto-detect (tries Docker first, then Podman)
)

Key Points:

  • This progression allows you to test locally first, validate with containers, then deploy to production

  • Only the backend configuration import and instantiation changes

  • Job management operations (list_jobs(), get_job_logs(), delete_job()) work the same across all backends

  • Your training function (func=train_model) doesn’t change

Next Steps

Choose the backend that best fits your needs: