Overview¶

Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable large language model (LLM) fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more.

What is Kubeflow Trainer?¶

Kubeflow Trainer brings MPI to Kubernetes for multi-node, multi-GPU distributed jobs across HPC clusters. It integrates seamlessly with the Cloud Native AI ecosystem through tools like:

Kueue for topology-aware scheduling and multi-cluster dispatch
JobSet and LeaderWorkerSet for orchestration
Coscheduling for gang scheduling with the Kubernetes scheduler
Volcano for batch scheduling
YuniKorn for resource optimization
KAI Scheduler for GPU-aware gang scheduling

The platform features distributed data caching using Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes, maximizing training performance.

Kubeflow Trainer Tech Stack

Who is This For?¶

Kubeflow Trainer documentation is organized around three key personas:

User Personas

AI Practitioners¶

ML engineers and data scientists who use the Kubeflow Python SDK and TrainJob APIs to train and fine-tune models at scale.

What you’ll find:

Training guides for PyTorch, JAX, DeepSpeed, MLX
LLM fine-tuning blueprints with TorchTune
Local execution backends for development

Platform Administrators¶

DevOps engineers and cluster operators who deploy and manage Kubeflow Trainer on Kubernetes clusters.

What you’ll find:

Installation and configuration guides
Runtime and policy management
Integration with schedulers (Kueue, Volcano)
Extension framework architecture

Contributors¶

Open source developers who want to contribute to the Kubeflow Trainer project.

What you’ll find:

Architecture documentation
Development workflow
Contributing guidelines
Community resources

Why Use Kubeflow Trainer?¶

Simple, Scalable, and Built for LLM Fine-Tuning¶

Train models with a single Kubernetes CRD (TrainJob) across any supported framework. Scale from single-GPU workloads to massive multi-node distributed training with minimal code changes.

Extensible and Portable¶

Run anywhere: public clouds, on-premises, or hybrid environments. The plugin-based architecture allows custom ML policies, runtimes, and schedulers to be added without modifying the core platform.

Distributed AI Data Caching¶

Optimize data loading with Apache Arrow and Apache DataFusion for high-performance, zero-copy tensor streaming. The distributed cache reduces training time by eliminating data loading bottlenecks.

LLM Fine-Tuning Blueprints¶

Pre-built templates for generative AI fine-tuning with TorchTune, supporting popular models like Llama and Qwen. Configuration-driven workflows eliminate boilerplate code.

Optimized GPU Efficiency¶

Intelligent data streaming and caching maximize GPU utilization, reducing training costs and time. Supports efficient model parallelism with PyTorch FSDP and DeepSpeed ZeRO.

Native Kubernetes Integrations¶

Achieve optimal GPU utilization and coordinated scheduling for large-scale AI workloads. Kubeflow Trainer seamlessly integrates with Kubernetes ecosystem projects like Kueue, Coscheduling, Volcano, or YuniKorn.

AI Lifecycle with Kubeflow Trainer