User Guides

Documentation for AI practitioners and ML engineers using Kubeflow Trainer for distributed training.

This section contains guides for running distributed training workloads with various ML frameworks using Kubeflow Trainer.


Distributed Training Frameworks

PyTorch

Distributed PyTorch training with FSDP, DDP, and more

PyTorch Guide
PyTorch on AMD ROCm

PyTorch distributed training on AMD ROCm GPUs

PyTorch on AMD ROCm Guide
JAX

Distributed JAX training with jax.distributed

JAX Guide
JAX on TPU

JAX distributed training on Google Cloud TPUs

JAX on TPU Guide
DeepSpeed

Large-scale training with DeepSpeed ZeRO optimization

DeepSpeed Guide
XGBoost

Distributed XGBoost training on Kubernetes

XGBoost Guide
Megatron

Megatron-Core with Tensor Parallelism for large transformers

Megatron Guide
MLX

Training on Apple Silicon with MLX framework

MLX Guide
Flux

HPC workloads with Flux Framework integration

Flux Guide

Data and Fine-Tuning

Distributed Data Cache

High-performance distributed data caching for training

Distributed Data Cache
Builtin Trainers

Pre-built training workflows (TorchTune and more)

Builtin Trainer Guide

Job Lifecycle

Configure TrainJob Lifecycle

Active deadlines, suspend/resume for TrainJobs

Configure TrainJob Lifecycle

Local Development

Local Execution Overview

Run TrainJobs locally before deploying to Kubernetes

Execute TrainJobs Locally
Docker Backend

Execute training jobs in Docker containers locally

Docker Backend
Podman Backend

Execute training jobs with Podman (Docker alternative)

Podman Backend
Process Backend

Run training jobs as local processes for quick iteration

Local Process Backend