How to Fine-Tune LLMs with Kubeflow

Overview of the LLM fine-tuning API in the Training Operator

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

This page describes how to use a train API from the Training Python SDK that simplifies the ability to fine-tune LLMs with distributed PyTorchJob workers.

If you want to learn more about how the fine-tuning API fits in the Kubeflow ecosystem, head to the explanation guide.

Prerequisites

You need to install the Training Python SDK with fine-tuning support to run this API.

How to use the Fine-Tuning API?

You need to provide the following parameters to use the train API:

  • Number of PyTorch workers and resources per workers.

  • Trainer parameters.

  • Dataset parameters.

  • Pre-trained model parameters.

For example, you can use the train API to fine-tune the BERT model using the Yelp Review dataset from HuggingFace Hub with the code below:

import transformers
from peft import LoraConfig

from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
    HuggingFaceModelParams,
    HuggingFaceTrainerParams,
    HuggingFaceDatasetParams,
)

TrainingClient().train(
    name="fine-tune-bert",
    # BERT model URI and type of Transformer to train it.
    model_provider_parameters=HuggingFaceModelParams(
        model_uri="hf://google-bert/bert-base-cased",
        transformer_type=transformers.AutoModelForSequenceClassification,
    ),
    # Use 3000 samples from Yelp dataset.
    dataset_provider_parameters=HuggingFaceDatasetParams(
        repo_id="yelp_review_full",
        split="train[:3000]",
    ),
    # Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
    trainer_parameters=HuggingFaceTrainerParams(
        training_parameters=transformers.TrainingArguments(
            output_dir="test_trainer",
            save_strategy="no",
            eval_strategy="no",
            do_eval=False,
            disable_tqdm=True,
            log_level="info",
        ),
        # Set LoRA config to reduce number of trainable model parameters.
        lora_config=LoraConfig(
            r=8,
            lora_alpha=8,
            lora_dropout=0.1,
            bias="none",
        ),
    ),
    num_workers=4,  # nnodes parameter for torchrun command.
    num_procs_per_worker=2,  # nproc-per-node parameter for torchrun command.
    resources_per_worker={
        "gpu": 2,
        "cpu": 5,
        "memory": "10G",
    },
)

After you execute train, the Training Operator will orchestrate the appropriate PyTorchJob resources to fine-tune the LLM.

Dataset and Model Parameter Classes

HuggingFaceModelParams

Description

The HuggingFaceModelParams dataclass holds configuration parameters for initializing Hugging Face models with validation checks.

Attribute

Type

Description

model_uri

str

URI or path to the Hugging Face model (must not be empty).

transformer_type

TRANSFORMER_TYPES

Specifies the model type for various NLP/ML tasks.

access_token

Optional[str] (default: None)

Token for accessing private models on Hugging Face.

num_labels

Optional[int] (default: None)

Number of output labels (used for classification tasks).

Supported Transformer Types (TRANSFORMER_TYPES)

Model Type

Task

AutoModelForSequenceClassification

Text classification

AutoModelForTokenClassification

Named entity recognition

AutoModelForQuestionAnswering

Question answering

AutoModelForCausalLM

Text generation (causal)

AutoModelForMaskedLM

Masked language modeling

AutoModelForImageClassification

Image classification

Example Usage

from transformers import AutoModelForSequenceClassification
from kubeflow.storage_initializer.hugging_face import HuggingFaceModelParams

params = HuggingFaceModelParams(
    model_uri="bert-base-uncased",
    transformer_type=AutoModelForSequenceClassification,
    access_token="huggingface_access_token",
    num_labels=2  # For binary classification
)

HuggingFaceDatasetParams

Description

The HuggingFaceDatasetParams class holds configuration parameters for loading datasets from Hugging Face with validation checks.

Attribute

Type

Description

repo_id

str

Identifier of the dataset repository on Hugging Face (must not be empty).

access_token

Optional[str] (default: None)

Token for accessing private datasets on Hugging Face.

split

Optional[str] (default: None)

Dataset split to load (e.g., "train", "test").

Example Usage

from kubeflow.storage_initializer.hugging_face import HuggingFaceDatasetParams

dataset_params = HuggingFaceDatasetParams(
    repo_id="imdb",            # Public dataset repository ID on Hugging Face
    split="train",             # Dataset split to load
    access_token=None          # Not needed for public datasets
)

HuggingFaceTrainerParams

Description

The HuggingFaceTrainerParams class is used to define parameters for the training process in the Hugging Face framework. It includes the training arguments and LoRA configuration to optimize model training.

Parameter

Type

Description

training_parameters

transformers.TrainingArguments

Contains the training arguments like learning rate, epochs, batch size, etc.

lora_config

LoraConfig

LoRA configuration to reduce the number of trainable parameters in the model.

Example Usage

from transformers import TrainingArguments
from peft import LoraConfig
from kubeflow.storage_initializer.hugging_face import HuggingFaceTrainerParams

trainer_params = HuggingFaceTrainerParams(
    training_parameters=TrainingArguments(
        output_dir="results",
        learning_rate=2e-5,
        num_train_epochs=3,
        per_device_train_batch_size=8,
    ),
    lora_config=LoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
    ),
)

S3DatasetParams

Description

The S3DatasetParams class is used for loading datasets from S3-compatible object storage. It includes validation checks to ensure proper configuration.

Parameter

Type

Description

endpoint_url

str

URL of the S3-compatible storage service.

bucket_name

str

Name of the S3 bucket containing the dataset.

file_key

str

Key (path) to the dataset file within the bucket.

region_name

str, optional

The AWS region of the S3 bucket (optional).

access_key

str, optional

The access key for authentication with S3 (optional).

secret_key

str, optional

The secret key for authentication with S3 (optional).

Implementation Details

The S3DatasetParams class includes validation checks to ensure required parameters are provided and the endpoint URL is valid. The actual dataset download is handled by the S3 class which uses boto3 to interact with the S3-compatible storage.

Example Usage

from kubeflow.storage_initializer.s3 import S3DatasetParams

s3_params = S3DatasetParams(
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-dataset-bucket",
    file_key="datasets/train.csv",
    region_name="us-west-2",
    access_key="YOUR_ACCESS_KEY",
    secret_key="YOUR_SECRET_KEY"
)

Using custom images with Fine-Tuning API

Platform engineers can customize the storage initializer and trainer images by setting the STORAGE_INITIALIZER_IMAGE and TRAINER_TRANSFORMER_IMAGE environment variables before executing the train command.

For example: In your python code, set the env vars before executing train:

...
os.environ['STORAGE_INITIALIZER_IMAGE'] = 'docker.io/<username>/<custom-storage-initiailizer_image>'
os.environ['TRAINER_TRANSFORMER_IMAGE'] = 'docker.io/<username>/<custom-trainer_transformer_image>'

TrainingClient().train(...)

Next Steps

Run the example to fine-tune the TinyLlama LLM

Check this example to compare the create_job and the train Python API for fine-tuning BERT LLM.

Understand the architecture behind train API.