Architecture¶

The Training Operator Architecture

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

What is the Training Operator Architecture?¶

The original design was drafted in April 2021 and is available here for reference. The goal was to provide a unified Kubernetes operator that supports multiple machine learning/deep learning frameworks. This was done by having a “Frontend” operator that decomposes the job into different configurable Kubernetes components (e.g., Role, PodTemplate, Fault-Tolerance, etc.), watches all Role Customer Resources, and manages pod performance. The dedicated “Backend” operator was not implemented and instead consolidated to the “Frontend” operator.

The benefits of this approach were:

A Single Source of Truth (SSOT) for other Kubeflow components to interact with
Simpler Kubeflow releases
Unlocked production grade features like manifests and metadata support
Shared testing and release infrastructure

The V1 Training Operator architecture diagram can be seen in the diagram below:

The diagram displays PyTorchJob and its configured communication methods but it is worth mentioning that each framework can have its own appraoch(es) to communicating across pods. Additionally, each framework can have its own set of configurable resources.

As a concrete example, PyTorch has several Communication Backends available, see the source code documentation for the full list. ).