Open Source · CNCF Project

Kubeflow Trainer

The Kubernetes-native platform for distributed AI training and LLM fine-tuning at any scale.

Deploy anywhere you run Kubernetes — or train locally with Docker.

What is Kubeflow Trainer?

Kubeflow Trainer is a Kubernetes-native platform for distributed AI model training and LLM fine-tuning. It provides a single TrainJob CRD and a unified Python SDK across PyTorch, JAX, DeepSpeed, MLX, HuggingFace, Megatron, and XGBoost. Train locally with Docker or scale to multi-node GPU clusters on any Kubernetes environment — without changing your code. It features distributed data caching with Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes.

Why Trainer?

Multi-Framework

One API for PyTorch, JAX, DeepSpeed, MLX, HuggingFace, Megatron, XGBoost, and more. Swap frameworks without rewriting orchestration code.

Distributed Training

Scale from a single GPU to multi-node clusters. Automatic setup of DDP, FSDP, parameter servers, and gang-scheduling across nodes.

Local & Cloud

Develop and test locally with Docker or Podman, then deploy the same TrainJob to any Kubernetes cluster — zero code changes needed.

LLM Fine-Tuning

First-class support for LoRA, QLoRA, and full fine-tuning via TorchTune. Bring your own HuggingFace model and dataset URIs.

Extensible Runtimes

Use built-in TrainingRuntimes or build your own. Plugin architecture lets platform teams customize scheduling, networking, and resource management.

Production Ready

Used in production by organizations across the CNCF ecosystem. See ADOPTERS.md for adopters. Backed by the Kubeflow community with enterprise support.

Supported Frameworks

PyTorch
JAX
DeepSpeed
MLX
HuggingFace
Megatron
XGBoost
TorchTune

Documentation

Browse guides by role — from your first TrainJob to production deployment and contribution.

Train in 5 Lines

python
from kubeflow.trainer import TrainerClient, CustomTrainer

client = TrainerClient()
trainer = CustomTrainer(func=my_train_func, num_nodes=4)

client.train(trainer=trainer)

Same code runs locally with Docker or on any Kubernetes cluster. See the full quickstart →

Join the Community

We are an open and welcoming community of developers, data scientists, and organizations — backed by the Cloud Native Computing Foundation.