Podman Backend¶
Overview¶
The Container Backend with Podman enables you to run distributed TrainJobs in isolated containers using Podman, a daemonless container engine. Podman offers several advantages over Docker:
Daemonless Architecture: No background daemon required, reducing attack surface
Rootless Containers: Run containers without root privileges for enhanced security
Full Container Isolation: Each training process runs in its own container with isolated filesystem, network, and resources
Multi-Node Support: Run distributed training across multiple containers with automatic DNS-enabled networking
Docker Compatibility: Compatible with Docker images and Docker CLI syntax
systemd Integration: Better integration with systemd for service management
The Podman backend uses the same adapter pattern as Docker, providing a unified interface for container operations.
Prerequisites¶
Required Software & Initial Setup¶
Podman 3.0+: Install Podman for your platform by following the podman installation instructions
Kubeflow SDK: Install with Podman support:
pip install "kubeflow[podman]"
Verify Installation¶
podman version
podman ps
Custom Socket Location (Optional)¶
By default, Podman uses different socket locations than Docker. You can specify a custom socket:
# Start Podman with custom socket (macOS/Linux)
podman system service --time=0 unix:///tmp/podman.sock
# Or use systemd (Linux)
systemctl --user enable --now podman.socket
Basic Example¶
Here’s a simple example using the Podman Container Backend:
from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig
def train_model():
"""Simple training function."""
import torch
import os
rank = int(os.environ.get('RANK', '0'))
world_size = int(os.environ.get('WORLD_SIZE', '1'))
print(f"Training on rank {rank}/{world_size}")
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(5):
loss = torch.nn.functional.mse_loss(
model(torch.randn(32, 10)),
torch.randn(32, 1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"[Rank {rank}] Epoch {epoch + 1}/5, Loss: {loss.item():.4f}")
print(f"[Rank {rank}] Training completed!")
backend_config = ContainerBackendConfig(
container_runtime="podman",
pull_policy="IfNotPresent",
auto_remove=True
)
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(
func=train_model,
num_nodes=2
)
job_name = client.train(trainer=trainer)
print(f"TrainJob started: {job_name}")
job = client.wait_for_job_status(job_name)
print(f"Job completed with status: {job.status}")
Configuration Options¶
ContainerBackendConfig for Podman¶
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Force specific runtime: |
|
|
|
Image pull policy: |
|
|
|
Automatically remove containers and networks after job completion or deletion. Set to |
|
|
|
Override Podman socket URL (e.g., |
|
|
GitHub sources |
Configuration for training runtime sources. See “Custom Runtime Sources” section below. |
Configuration Examples¶
Basic Podman Configuration¶
backend_config = ContainerBackendConfig(
container_runtime="podman",
)
Custom Socket Location¶
# macOS with Podman machine
backend_config = ContainerBackendConfig(
container_runtime="podman",
container_host="unix:///tmp/podman.sock"
)
# Linux rootless (user-specific socket)
import os
uid = os.getuid()
backend_config = ContainerBackendConfig(
container_runtime="podman",
container_host=f"unix:///run/user/{uid}/podman/podman.sock"
)
Always Pull Latest Image¶
backend_config = ContainerBackendConfig(
container_runtime="podman",
pull_policy="Always"
)
Keep Containers for Debugging¶
backend_config = ContainerBackendConfig(
container_runtime="podman",
auto_remove=False
)
Architecture¶
The Container Backend with Podman uses a local orchestration layer to manage TrainJobs within Podman containers. This ensures environment parity between your local machine and production Kubernetes clusters.
Multi-Node Distributed Training¶
The Podman backend automatically sets up networking and environment variables for distributed training:
from kubeflow.trainer import CustomTrainer, TrainerClient, ContainerBackendConfig
def distributed_train():
"""PyTorch distributed training example."""
import os
import torch
import torch.distributed as dist
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
print(f"Initializing process group: rank={rank}, world_size={world_size}")
dist.init_process_group(
backend='gloo',
rank=rank,
world_size=world_size
)
model = torch.nn.Linear(10, 1)
ddp_model = torch.nn.parallel.DistributedDataParallel(model)
for epoch in range(5):
print(f"[Rank {rank}] Training epoch {epoch + 1}")
dist.destroy_process_group()
print(f"[Rank {rank}] Training complete")
backend_config = ContainerBackendConfig(
container_runtime="podman",
)
client = TrainerClient(backend_config=backend_config)
trainer = CustomTrainer(
func=distributed_train,
num_nodes=4
)
job_name = client.train(trainer=trainer)
Podman-Specific Networking¶
Podman creates networks with DNS enabled by default, allowing containers to resolve each other by hostname. The backend implementation uses IP addresses for the MASTER_ADDR environment variable to ensure reliable communication:
IP Address Resolution¶
The Podman backend automatically retrieves the IP address of the rank-0 container using podman inspect:
podman inspect --format '{{.NetworkSettings.Networks.<network-name>.IPAddress}}' <container-name>
This IP address is then set as MASTER_ADDR for all nodes in the job, ensuring that:
Communication works even if DNS resolution has timing issues
The master address is available immediately when containers start
Distributed training frameworks (PyTorch, TensorFlow) can connect reliably
DNS Resolution¶
While DNS is enabled and containers can resolve each other by hostname, the backend uses IP addresses for reliability:
def test_networking():
import os
import socket
rank = int(os.environ['RANK'])
master_addr = os.environ['MASTER_ADDR']
print(f"Rank {rank}: My hostname is {socket.gethostname()}")
print(f"Rank {rank}: Master address (IP): {master_addr}")
if rank == 0:
container_name = os.environ.get('HOSTNAME')
print(f"Rank {rank}: My IP address: {socket.gethostbyname(container_name)}")
if rank != 0:
import subprocess
result = subprocess.run(['ping', '-c', '1', master_addr], capture_output=True)
print(f"Ping to master IP: {result.returncode == 0}")
Network Architecture¶
For a job with num_nodes=3, the Podman backend:
Creates a dedicated network:
<job-name>-netwith DNS enabledLaunches rank-0 container and waits for it to be running
Inspects rank-0 container to get its IP address
Sets
MASTER_ADDRto this IP for all containersLaunches remaining containers (rank 1, 2, …) connected to the same network
This approach combines the benefits of DNS (hostname resolution) with the reliability of IP addresses for critical communication paths.
Job Management¶
For common job management operations (listing jobs, viewing logs, deleting jobs), see the Job Management section in the overview.
Inspecting Containers with Podman CLI¶
When auto_remove=False, you can inspect containers:
podman ps -a --filter "label=kubeflow.org/job-name=<job-name>"
podman inspect <job-name>-node-0
podman logs <job-name>-node-0
podman exec -it <job-name>-node-0 /bin/bash
podman --url unix:///tmp/podman.sock logs <job-name>-node-0
Working with Runtimes¶
For information about using runtimes and custom runtime sources, see the Working with Runtimes section in the overview.
Troubleshooting¶
Podman Service Not Running (macOS)¶
Error: ConnectionRefusedError: [Errno 61] Connection refused
Solution:
podman machine list
podman machine start
podman machine stop
podman machine start --now
Socket Not Found (Linux)¶
Error: FileNotFoundError: [Errno 2] No such file or directory: '/run/user/1000/podman/podman.sock'
Solution:
systemctl --user start podman.socket
systemctl --user enable podman.socket
ls -la /run/user/$(id -u)/podman/podman.sock
Permission Denied (Rootless)¶
Error: Error: container_linux.go:380: starting container process caused...
Solution:
sudo sysctl -w user.max_user_namespaces=15000
echo "user.max_user_namespaces=15000" | sudo tee -a /etc/sysctl.conf
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $USER
DNS Resolution Issues¶
Error: Containers cannot resolve each other’s hostnames
Solution:
podman network inspect <job-name>-net | grep dns_enabled
Containers Not Removed¶
Problem: Containers remain after job completion
Solution:
backend_config = ContainerBackendConfig(
container_runtime="podman",
auto_remove=True
)
client.delete_job(job_name)
Or:
podman rm -f $(podman ps -aq --filter "label=kubeflow.org/job-name=<job-name>")
Next Steps¶
Try the MNIST example notebook for a complete end-to-end example
Learn about the Container Backend with Docker for Docker-specific features
Learn about the Local Process Backend for non-containerized local execution