Prometheus Monitoring¶

Prometheus Metrics for the Training Operator

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

This guide explains how to monitor Kubeflow training jobs using Prometheus metrics. The Training Operator exposes these metrics, providing essential insights into the status of distributed machine learning workloads.

Note

Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.

Prometheus Metrics for Training Operator¶

The Training Operator includes a built-in /metrics endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.

Configuring Metrics Port¶

By default, metrics are exposed on port 8080 and can be scraped from any IP address.

If you want to change the default port for metrics exporting and limit which IP address can scrape the metrics, simply add the metrics-bind-address argument.

For example:

# deployment.yaml for the Training Operator
spec:
    containers:
    - command:
        - /manager
        image: kubeflow/training-operator
        name: training-operator
        ports:
        - containerPort: 8080
        - containerPort: 9443
            name: webhook-server
            protocol: TCP
        args:
        - "--metrics-bind-address=192.168.1.100:8082"

Explanation:

--metrics-bind-address=192.168.1.100:8082 specifies that metrics are now available on port 8082, restricted to the IP address 192.168.1.100. Alternatively, you can bind the metrics to all interfaces by using 0.0.0.0:8082.

Accessing the Metrics¶

The method to access these metrics may vary depending on your Kubernetes setup and environment. For example, use the following command for local environments:

kubectl port-forward -n kubeflow deployment/training-operator 8080:8080

Then you’ll see metrics in this format via http://localhost:8080/metrics:

# HELP training_operator_jobs_created_total Counts number of jobs created
# TYPE training_operator_jobs_created_total counter
training_operator_jobs_created_total{framework="tensorflow",job_namespace="kubeflow"} 7

List of Job Metrics¶

Metric name	Description	Labels
`training_operator_jobs_created_total`	Total number of jobs created	`namespace`, `framework`
`training_operator_jobs_deleted_total`	Total number of jobs deleted	`namespace`, `framework`
`training_operator_jobs_successful_total`	Total number of successful jobs	`namespace`, `framework`
`training_operator_jobs_failed_total`	Total number of failed jobs	`namespace`, `framework`
`training_operator_jobs_restarted_total`	Total number of restarted jobs	`namespace`, `framework`

Labels information can be interpreted as follows:

Label name	Description
`namespace`	The Kubernetes namespace where the job is running
`framework`	The machine learning framework used (e.g. TensorFlow, PyTorch)