User Guides¶
Documentation for AI practitioners and ML engineers using Kubeflow Trainer for distributed training.
This section contains guides for running distributed training workloads with various ML frameworks using Kubeflow Trainer.
Distributed Training Frameworks¶
Distributed PyTorch training with FSDP, DDP, and more
PyTorch distributed training on AMD ROCm GPUs
Distributed JAX training with jax.distributed
JAX distributed training on Google Cloud TPUs
Large-scale training with DeepSpeed ZeRO optimization
Distributed XGBoost training on Kubernetes
Megatron-Core with Tensor Parallelism for large transformers
Training on Apple Silicon with MLX framework
HPC workloads with Flux Framework integration
Data and Fine-Tuning¶
High-performance distributed data caching for training
Pre-built training workflows (TorchTune and more)
Job Lifecycle¶
Active deadlines, suspend/resume for TrainJobs
Local Development¶
Run TrainJobs locally before deploying to Kubernetes
Execute training jobs in Docker containers locally
Execute training jobs with Podman (Docker alternative)
Run training jobs as local processes for quick iteration