Overview

Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable large language model (LLM) fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more.

What is Kubeflow Trainer?

Kubeflow Trainer brings MPI to Kubernetes for multi-node, multi-GPU distributed jobs across HPC clusters. It integrates seamlessly with the Cloud Native AI ecosystem through tools like:

The platform features distributed data caching using Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes, maximizing training performance.

Kubeflow Trainer Tech Stack

Who is This For?

Kubeflow Trainer documentation is organized around three key personas:

User Personas

AI Practitioners

ML engineers and data scientists who use the Kubeflow Python SDK and TrainJob APIs to train and fine-tune models at scale.

What you’ll find:

  • Training guides for PyTorch, JAX, DeepSpeed, MLX

  • LLM fine-tuning blueprints with TorchTune

  • Local execution backends for development

Platform Administrators

DevOps engineers and cluster operators who deploy and manage Kubeflow Trainer on Kubernetes clusters.

What you’ll find:

  • Installation and configuration guides

  • Runtime and policy management

  • Integration with schedulers (Kueue, Volcano)

  • Extension framework architecture

Contributors

Open source developers who want to contribute to the Kubeflow Trainer project.

What you’ll find:

  • Architecture documentation

  • Development workflow

  • Contributing guidelines

  • Community resources

Why Use Kubeflow Trainer?

Simple, Scalable, and Built for LLM Fine-Tuning

Train models with a single Kubernetes CRD (TrainJob) across any supported framework. Scale from single-GPU workloads to massive multi-node distributed training with minimal code changes.

Extensible and Portable

Run anywhere: public clouds, on-premises, or hybrid environments. The plugin-based architecture allows custom ML policies, runtimes, and schedulers to be added without modifying the core platform.

Distributed AI Data Caching

Optimize data loading with Apache Arrow and Apache DataFusion for high-performance, zero-copy tensor streaming. The distributed cache reduces training time by eliminating data loading bottlenecks.

LLM Fine-Tuning Blueprints

Pre-built templates for generative AI fine-tuning with TorchTune, supporting popular models like Llama and Qwen. Configuration-driven workflows eliminate boilerplate code.

Optimized GPU Efficiency

Intelligent data streaming and caching maximize GPU utilization, reducing training costs and time. Supports efficient model parallelism with PyTorch FSDP and DeepSpeed ZeRO.

Native Kubernetes Integrations

Achieve optimal GPU utilization and coordinated scheduling for large-scale AI workloads. Kubeflow Trainer seamlessly integrates with Kubernetes ecosystem projects like Kueue, Coscheduling, Volcano, or YuniKorn.

AI Lifecycle with Kubeflow Trainer

Learn More

Watch the KubeCon + CloudNativeCon 2024 introduction to Kubeflow Trainer:

Next Steps

Ready to get started? Run your first Kubeflow TrainJob by following the Getting Started guide.

Getting Started

Install Kubeflow Trainer and run your first distributed training job

Getting Started
User Guides

Learn how to train with PyTorch, JAX, DeepSpeed, MLX, and more

User Guides
Operator Guides

Deploy and manage Kubeflow Trainer in production

Operator Guides
Examples Repository

Explore complete training examples on GitHub

https://github.com/kubeflow/trainer/tree/master/examples