Open Source · CNCF Project

Kubeflow Trainer

The Kubernetes-native platform for distributed AI training and LLM fine-tuning at any scale.

Deploy anywhere you run Kubernetes — or train locally with Docker.

What is Kubeflow Trainer?

Kubeflow Trainer is a Kubernetes-native platform for distributed AI model training and LLM fine-tuning. It provides a single TrainJob CRD and a unified Python SDK across PyTorch, JAX, DeepSpeed, MLX, HuggingFace, Megatron, and XGBoost. Train locally with Docker or scale to multi-node GPU clusters on any Kubernetes environment — without changing your code. It features distributed data caching with Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes.

Why Trainer?

Multi-Framework

One API for PyTorch, JAX, DeepSpeed, MLX, HuggingFace, Megatron, XGBoost, and more. Swap frameworks without rewriting orchestration code.

Distributed Training

Scale from a single GPU to multi-node clusters. Automatic setup of DDP, FSDP, parameter servers, and gang-scheduling across nodes.

Local & Cloud

Develop and test locally with Docker or Podman, then deploy the same TrainJob to any Kubernetes cluster — zero code changes needed.

LLM Fine-Tuning

First-class support for LoRA, QLoRA, and full fine-tuning via TorchTune. Bring your own HuggingFace model and dataset URIs.

Extensible Runtimes

Use built-in TrainingRuntimes or build your own. Plugin architecture lets platform teams customize scheduling, networking, and resource management.

Production Ready

Used in production by organizations across the CNCF ecosystem. See ADOPTERS.md for adopters. Backed by the Kubeflow community with enterprise support.

Supported Frameworks

PyTorch

JAX

DeepSpeed

MLX

HuggingFace

Megatron

XGBoost

TorchTune

Documentation

Browse guides by role — from your first TrainJob to production deployment and contribution.

Overview

Learn about Kubeflow Trainer, who it's for, and why you should use it

Getting Started

Installation, first TrainJob, and quickstart tutorials

User Guides

Documentation for AI practitioners and ML engineers using Kubeflow Trainer

Operator Guides

Documentation for platform administrators deploying and managing Kubeflow Trainer

Contributor Guides

Architecture, development workflow, and how to extend Kubeflow Trainer

Legacy Kubeflow Training Operator (v1)

Kubeflow Training Operator v1 documentation — archived guides, installation, and migration to v2

Train in 5 Lines

python

from kubeflow.trainer import TrainerClient, CustomTrainer

client = TrainerClient()
trainer = CustomTrainer(func=my_train_func, num_nodes=4)

client.train(trainer=trainer)

Same code runs locally with Docker or on any Kubernetes cluster. See the full quickstart →

Join the Community

We are an open and welcoming community of developers, data scientists, and organizations — backed by the Cloud Native Computing Foundation.

GitHubStar, fork, and contribute Slack#kubeflow-trainer on CNCF Slack CNCF SlackJoin the CNCF Slack workspace Mailing Listkubeflow-discuss Meeting NotesTrainer & Katib community calls RecordingsWatch past community meetings Community CalendarView all Kubeflow meetings Add to CalendarSubscribe via iCal BlogLatest news and tutorials