LLM Fine-Tuning with Training Operator

How Training Operator performs fine-tuning on Kubernetes

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

This page shows how Training Operator implements the API to fine-tune LLMs.

Architecture

In the following diagram you can see how train Python API works:

Fine-Tune API for LLMs

Once user executes train API, Training Operator creates PyTorchJob with appropriate resources to fine-tune LLM.

Storage initializer InitContainer is added to the PyTorchJob worker 0 to download pre-trained model and dataset with provided parameters.

PVC with ReadOnlyMany access mode it attached to each PyTorchJob worker to distribute model and dataset across Pods. Note: Your Kubernetes cluster must support volumes withReadOnlyMany access mode, otherwise you can use a single PyTorchJob worker.

Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters.

Training Operator implementstrain API with these pre-created components:

Model Provider

Model provider downloads pre-trained model. Currently, Training Operator supports HuggingFace model provider that downloads model from HuggingFace Hub.

You can implement your own model provider by using this abstract base class

Dataset Provider

Dataset provider downloads dataset. Currently, Training Operator supports AWS S3 and HuggingFace dataset providers.

You can implement your own dataset provider by using this abstract base class

LLM Trainer

Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports HuggingFace trainer to fine-tune LLMs.

You can implement your own trainer for other ML use-cases such as image classification, voice recognition, etc.