Kubeflow Trainer
The Kubernetes-native platform for distributed AI training and LLM fine-tuning at any scale.
What is Kubeflow Trainer?
Kubeflow Trainer is a Kubernetes-native platform for distributed AI model training and LLM fine-tuning. It provides a single TrainJob CRD and a unified Python SDK across PyTorch, JAX, DeepSpeed, MLX, HuggingFace, Megatron, and XGBoost. Train locally with Docker or scale to multi-node GPU clusters on any Kubernetes environment — without changing your code. It features distributed data caching with Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes.
Why Trainer?
Multi-Framework
One API for PyTorch, JAX, DeepSpeed, MLX, HuggingFace, Megatron, XGBoost, and more. Swap frameworks without rewriting orchestration code.
Distributed Training
Scale from a single GPU to multi-node clusters. Automatic setup of DDP, FSDP, parameter servers, and gang-scheduling across nodes.
Local & Cloud
Develop and test locally with Docker or Podman, then deploy the same TrainJob to any Kubernetes cluster — zero code changes needed.
LLM Fine-Tuning
First-class support for LoRA, QLoRA, and full fine-tuning via TorchTune. Bring your own HuggingFace model and dataset URIs.
Extensible Runtimes
Use built-in TrainingRuntimes or build your own. Plugin architecture lets platform teams customize scheduling, networking, and resource management.
Production Ready
Used in production by organizations across the CNCF ecosystem. See ADOPTERS.md for adopters. Backed by the Kubeflow community with enterprise support.
Supported Frameworks
Documentation
Browse guides by role — from your first TrainJob to production deployment and contribution.
Learn about Kubeflow Trainer, who it's for, and why you should use it
Getting StartedInstallation, first TrainJob, and quickstart tutorials
User GuidesDocumentation for AI practitioners and ML engineers using Kubeflow Trainer
Operator GuidesDocumentation for platform administrators deploying and managing Kubeflow Trainer
Contributor GuidesArchitecture, development workflow, and how to extend Kubeflow Trainer
Legacy Kubeflow Training Operator (v1)Kubeflow Training Operator v1 documentation — archived guides, installation, and migration to v2
Train in 5 Lines
from kubeflow.trainer import TrainerClient, CustomTrainer
client = TrainerClient()
trainer = CustomTrainer(func=my_train_func, num_nodes=4)
client.train(trainer=trainer)
Same code runs locally with Docker or on any Kubernetes cluster. See the full quickstart →
Join the Community
We are an open and welcoming community of developers, data scientists, and organizations — backed by the Cloud Native Computing Foundation.