KAI Scheduler¶
This guide describes how to enable gang scheduling and advanced resource management with the NVIDIA KAI Scheduler in Kubeflow Trainer.
By integrating KAI Scheduler, you ensure “all-or-nothing” scheduling for distributed training jobs. This means the job only starts if all requested GPU resources are available simultaneously, preventing resource deadlocks in multi-node training.
Prerequisites¶
Install KAI Scheduler: Follow the KAI Installation Guide to set up the scheduler and the
podgrouperservice in your Kubernetes cluster.Define a Queue: KAI uses queues to manage resources. Ensure you have a KAI Queue created (e.g.,
training-queue) or use thedefault-queuecreated during installation.
Enable KAI Plugin¶
KAI scheduling can be enabled by setting the schedulerName to kai-scheduler in the pod template
of your TrainingRuntime or ClusterTrainingRuntime specification.
Note
KAI integrates externally via its PodGrouper component, which monitors pods requesting the kai-scheduler.
Example: ClusterTrainingRuntime with KAI¶
You can enforce KAI scheduling at the runtime level. This ensures that every job using this runtime automatically utilizes KAI gang-scheduling.
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: pytorch-kai-runtime
spec:
mlPolicy:
torch:
numNodes: 1
template:
spec:
schedulerName: kai-scheduler
containers:
- name: train
image: pytorch/pytorch:latest
Example: TrainJob with KAI¶
Once your runtime is created, you can submit a TrainJob that references it. You can also add the
kai.scheduler/queue label to your job to route it to a specific resource queue in KAI.
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: pytorch-kai-job
labels:
kai.scheduler/queue: "prod-queue" # KAI Scheduler uses this to route the job
spec:
runtimeRef:
name: pytorch-kai-runtime
trainer:
numNodes: 4
resourcesPerNode:
limits:
nvidia.com/gpu: 1
How it Works¶
When a TrainJob is created using a runtime configured with the kai-scheduler:
Metadata Propagation: The Trainer Operator applies the necessary labels and annotations to the underlying
JobSet.Pod Grouping: The KAI
podgroupercomponent detects the training pods via theOwnerReferencechain and automatically creates a KAIPodGroupresource.Gang Scheduling: The KAI Scheduler identifies the
PodGroupand ensures all replicas (workers) are scheduled at once on nodes assigned to the specified queue.