Local Process Backend¶
How to run TrainJobs with native Python processes and virtual environments
Overview¶
The Local Process Backend allows you to execute TrainJobs directly on your local machine using native Python processes and virtual environments. This backend is ideal for:
Quick prototyping and development
Testing training scripts without container overhead
Environments where Docker/Podman is not available
Debugging training code locally
The Local Process Backend creates isolated Python virtual environments for each TrainJob, automatically installs required dependencies, and manages the lifecycle of training processes in background threads with real-time log streaming.
Note
Only single-node training is currently supported.
Prerequisites¶
Python 3.9 or later
pip (Python package installer)
Kubeflow SDK: Install the base package:
pip install kubeflow
Sufficient disk space for virtual environments
Required Python packages for your training framework (e.g., PyTorch, TensorFlow)
Basic Example¶
Here’s a simple example using the Local Process Backend:
from kubeflow.trainer import CustomTrainer, TrainerClient, LocalProcessBackendConfig
def train_model():
import torch
import time
print("Starting training...")
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(5):
loss = torch.nn.functional.mse_loss(
model(torch.randn(32, 10)),
torch.randn(32, 1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}/5, Loss: {loss.item():.4f}")
print("Training completed!")
backend_config = LocalProcessBackendConfig(
cleanup_venv=True
)
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(
func=train_model,
num_nodes=1,
)
job_name = client.train(trainer=trainer)
print(f"TrainJob started: {job_name}")
job = client.wait_for_job_status(
job_name,
)
print(f"Job completed with status: {job.status}")
Configuration Options¶
LocalProcessBackendConfig¶
The LocalProcessBackendConfig class provides configuration options for the Local Process Backend:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Whether to automatically remove virtual environments after job completion. Set to |
Example:
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig(
cleanup_venv=False
)
Working with Runtimes¶
The Local Process Backend has a fixed set of built-in runtimes (unlike Container Backends which load runtimes from external sources).
Supported Runtimes¶
Runtime |
Framework |
Description |
Packages |
|---|---|---|---|
|
PyTorch |
PyTorch training with torchrun |
|
Job Management¶
For common job management operations (listing jobs, viewing logs, deleting jobs), see the Job Management section in the overview.
Checking Job Status¶
The Local Process Backend also supports checking detailed job status:
job = client.get_job(job_name)
print(f"Status: {job.status}")
print(f"Created: {job.created}")
print(f"Completed: {job.completed}")
Advanced Usage¶
Custom Training with Dependencies¶
You can specify additional packages to install in the training environment:
from kubeflow.trainer import CustomTrainer, TrainerClient, LocalProcessBackendConfig
def train_with_dependencies():
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
print("Training with scikit-learn...")
X = np.random.rand(100, 4)
y = np.random.randint(0, 2, 100)
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X, y)
print(f"Model accuracy: {clf.score(X, y):.2f}")
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(
func=train_with_dependencies,
packages_to_install=["numpy", "pandas", "scikit-learn"],
pip_index_urls=["https://pypi.org/simple"]
)
job_name = client.train(trainer=trainer)
Environment Variables¶
Pass custom environment variables to your TrainJob:
trainer = CustomTrainer(
func=train_model,
env={
"CUDA_VISIBLE_DEVICES": "0",
"OMP_NUM_THREADS": "4",
"CUSTOM_VAR": "value"
}
)
Debugging Failed Jobs¶
When cleanup_venv=False, you can inspect the virtual environment after job failure:
backend_config = LocalProcessBackendConfig(cleanup_venv=False)
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)
try:
job = client.wait_for_job_status(
job_name,
status={constants.TRAINJOB_COMPLETE, constants.TRAINJOB_FAILED},
timeout=300
)
if job.status == constants.TRAINJOB_FAILED:
print("Job failed. Logs:")
for log_line in client.get_job_logs(job_name):
print(log_line, end='')
print(f"\nVirtual environment preserved for job: {job_name}")
except Exception as e:
print(f"Error: {e}")
finally:
client.delete_job(job_name)
Architecture¶
The Local Process Backend operates by orchestrating native OS processes. It bypasses container runtimes like Docker, instead managing the lifecycle of your training script through isolated Python Virtual Environments (venvs).
How It Works¶
Understanding the internal workflow helps with debugging and optimization:
1. Job Creation¶
A unique job name is generated (e.g.,
a1b2c3d4e5f)A temporary directory is created:
/tmp/a1b2c3d4e5f_xyz/A Python virtual environment is set up with isolation
2. Environment Setup¶
python -m venv --without-pip /tmp/a1b2c3d4e5f_xyz/
source /tmp/a1b2c3d4e5f_xyz/bin/activate
python -m ensurepip --upgrade --default-pip
3. Package Dependency Resolution¶
The backend implements intelligent package dependency management:
Package Sources:
When you use a runtime (e.g., torch-distributed), the backend installs packages from two sources:
Runtime packages: Built-in packages defined by the runtime itself
For Local Process Backend, runtimes are hardcoded in the SDK (see Supported Runtimes)
Example: The
torch-distributedruntime automatically includestorchNote: Unlike Container Backends which load runtimes from external sources (GitHub/YAML files), Local Process Backend uses a fixed set of runtimes
Trainer packages: Additional packages you specify in your
CustomTrainerSpecified via the
packages_to_installparameterExample:
packages_to_install=["pandas", "scikit-learn"]
Dependency Resolution Rules:
When packages from both sources are combined:
Trainer Override: If you specify a package in
packages_to_installthat also exists in the runtime, your version takes precedenceCase-Insensitive Matching: Package names are normalized (PEP 503)
Duplicate Detection: Prevents duplicate packages in trainer dependencies
Order Preservation: Runtime packages come first, then trainer packages (except where overridden)
Example:
# Runtime packages: ["torch==1.9.0", "numpy"]
# Trainer packages: ["torch==2.0.0", "scipy"]
# Result: ["numpy", "torch==2.0.0", "scipy"]
# torch==2.0.0 from trainer overrides torch==1.9.0 from runtime
4. Training Code Preparation¶
The training function source code is extracted and written to a Python file:
# Written to: /tmp/a1b2c3d4e5f_xyz/train_a1b2c3d4e5f.py
def train_model():
import torch
print("Starting PyTorch training...")
# ... your code ...
train_model() # Auto-generated function call
5. Execution¶
For PyTorch framework:
/tmp/a1b2c3d4e5f_xyz/bin/torchrun train_a1b2c3d4e5f.py
For other frameworks:
/tmp/a1b2c3d4e5f_xyz/bin/python train_a1b2c3d4e5f.py
6. Cleanup (Optional)¶
When cleanup_venv=True:
rm -rf /tmp/a1b2c3d4e5f_xyz/
Best Practices¶
Perfect For¶
Local Development: Developing and testing training code
Experimentation: Quick iteration on training algorithms
CI/CD Pipelines: Automated testing of TrainJobs
Educational Use: Learning distributed training concepts
Prototyping: Validating ideas before cluster deployment
Not Suitable For¶
Production Training: Large-scale distributed TrainJobs
Multi-Node Training: Training across multiple machines
Resource Management: Fine-grained GPU/memory allocation
Long-Running Jobs: TrainJobs that run for days/weeks
High Availability: Mission-critical training pipelines
General Best Practices¶
Use for Development: The Local Process Backend is best suited for development and testing. For production workloads, consider using the Container Backend or Kubernetes Backend.
Clean Up Resources: Set
cleanup_venv=True(default) to avoid filling disk space with virtual environments.Test Before Containerizing: Use the Local Process Backend to quickly validate your training code before moving to containerized environments.
Monitor Resource Usage: Since jobs run directly on your machine, monitor CPU, memory, and disk usage to avoid resource exhaustion.
Specify Dependencies Explicitly: Use
packages_to_installto ensure all required packages are installed in the isolated environment.
Troubleshooting¶
Common Issues¶
Error |
Cause |
Solution |
|---|---|---|
|
Using BuiltinTrainer |
Use CustomTrainer instead |
|
Invalid runtime name |
Use |
|
Missing Python |
Install Python or ensure it’s in PATH |
|
Job doesn’t exist |
Check job name spelling |
Job Status Flow¶
Created → Running → Complete
↓ ↓
Failed ← Failed
Virtual Environment Creation Fails¶
Problem: Error creating virtual environment.
Solution: Ensure you have sufficient disk space and that Python’s venv module is installed:
python -m venv --help
Package Installation Errors¶
Problem: Required packages fail to install in the virtual environment.
Solution:
Check your internet connection
Verify that package names are correct
Use
packages_to_installto explicitly specify packagesEnsure pip is up to date:
pip install --upgrade pip
Jobs Not Cleaning Up¶
Problem: Virtual environments remain after job completion.
Solution: Verify that cleanup_venv=True in your config, or manually delete jobs:
client.delete_job(job_name)
Permission Errors¶
Problem: Permission denied when creating virtual environments.
Solution: Ensure you have write permissions to the temp directory. On Unix-like systems, check:
ls -ld /tmp
Debug Mode¶
Enable debug logging for detailed execution information:
import logging
logging.basicConfig(level=logging.DEBUG)
backend_config = LocalProcessBackendConfig(cleanup_venv=False)
client = TrainerClient(backend_config=backend_config)
Limitations¶
Single Machine Only: Only runs on local machine, no distributed training across multiple nodes. The
num_nodesparameter is ignored.CustomTrainer Only: Does not support BuiltinTrainer configurations.
No GPU Scheduling: Cannot manage GPU allocation across multiple jobs.
Process Isolation: Jobs are isolated by virtual environment, not containers.
Limited Scaling: Not suitable for large-scale production training.
System Dependencies: Training code must be compatible with your local Python environment and operating system.
Switching Between Backends¶
For information about switching between Local Process, Container (Docker/Podman), and Kubernetes backends, see the Switching Between Backends section in the overview.
Next Steps¶
Try the MNIST example notebook for a complete end-to-end example
Learn about the Container Backend with Docker for containerized training
Learn about the Container Backend with Podman for rootless containerized training