Observability and Metrics Guide

Enable metrics for the Remediator Agent

The Remediator Agent exposes Prometheus metrics via the controller-runtime metrics endpoint. Use this guide to:

  • Understand key metrics and labels
  • Collect via OpenTelemetry Collector (recommended) and export to Grafana Cloud or any Prometheus-compatible backend
  • Quickly test locally via port-forward + curl

Prerequisites

  • Kubernetes cluster with the agent deployed in namespace go-agent-remediator-system (default from kustomize).
  • OpenTelemetry Collector (OTel Collector) installed in-cluster (deployment or daemonset). If not installed, see the example manifest below.
  • Optional: Grafana Cloud account to visualize and store metrics.
    • Grafana Cloud stack (region), OTLP endpoint, and an API token with metrics:write (and optionally traces/logs if you use them later).
    • Example OTLP endpoint: https://otlp-gateway-<region>.grafana.net/otlp

Key metrics

  • remediator_reconciles_total (counter) — labels: result=“success|error”
  • remediator_reconcile_duration_seconds (histogram) — labels: result=“success|error”
  • violations_active (gauge) — labels: cluster, application, severity
  • remediation_plans_generated_total (counter) — labels: cluster, application
  • actions_executed_total (counter) — labels: type, status=“success|error”
  • pr_opened_total (counter) — labels: repo, application, cluster
  • (Defined, pending emission hooks) pr_merged_total (counter), pr_merge_latency_seconds (histogram)

Use the OTel Collector to scrape the agent’s metrics (Prometheus receiver) and export to your destination (OTLP to Grafana Cloud shown below).

Create a Secret with your Grafana Cloud OTLP API key and region:

apiVersion: v1
kind: Secret
metadata:
  name: otel-grafana-cloud
  namespace: go-agent-remediator-system
stringData:
  GRAFANA_CLOUD_REGION: "<your-region>"            # e.g., us, eu, in
  GRAFANA_CLOUD_OTLP_API_KEY: "<your-otlp-api-key>" # token with metrics:write

Create the OTel Collector ConfigMap (accurate Prometheus scrape job and OTLP exporter):

env: &env
  - name: GRAFANA_CLOUD_REGION
    valueFrom:
      secretKeyRef:
        name: otel-grafana-cloud
        key: GRAFANA_CLOUD_REGION
  - name: GRAFANA_CLOUD_OTLP_API_KEY
    valueFrom:
      secretKeyRef:
        name: otel-grafana-cloud
        key: GRAFANA_CLOUD_OTLP_API_KEY
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: go-agent-remediator-system
data:
  config.yaml: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: go-agent-remediator
              scheme: https
              tls_config:
                insecure_skip_verify: true        # For dev; use proper certs in prod
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
                - role: endpoints
                  namespaces:
                    names: ["nirmata"]
              relabel_configs:
                - action: keep
                  source_labels: [__meta_kubernetes_service_label_control_plane]
                  regex: controller-manager
                - action: keep
                  source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
                  regex: go-agent-remediator
                - action: keep
                  source_labels: [__meta_kubernetes_endpoint_port_name]
                  regex: https
    processors:
      batch: {}
    exporters:
      otlphttp/grafana:
        # Grafana Cloud OTLP gateway
        endpoint: "https://otlp-gateway-${GRAFANA_CLOUD_REGION}.grafana.net/otlp"
        headers:
          Authorization: "Bearer ${GRAFANA_CLOUD_OTLP_API_KEY}"
        tls:
          insecure_skip_verify: false
    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [batch]
          exporters: [otlphttp/grafana]

Deploy the OTel Collector (single replica example):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: go-agent-remediator-system
spec:
  replicas: 1
  selector:
    matchLabels: { app: otel-collector }
  template:
    metadata:
      labels: { app: otel-collector }
    spec:
      serviceAccountName: go-agent-remediator-controller-manager
      containers:
        - name: otelcol
          image: otel/opentelemetry-collector:0.104.0
          args: ["--config=/conf/config.yaml"]
          env:
            - name: GRAFANA_CLOUD_REGION
              valueFrom:
                secretKeyRef: { name: otel-grafana-cloud, key: GRAFANA_CLOUD_REGION }
            - name: GRAFANA_CLOUD_OTLP_API_KEY
              valueFrom:
                secretKeyRef: { name: otel-grafana-cloud, key: GRAFANA_CLOUD_OTLP_API_KEY }
          volumeMounts:
            - name: config
              mountPath: /conf
          ports:
            - name: metrics
              containerPort: 8888
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
            items:
              - key: config.yaml
                path: config.yaml

Notes:

  • The scrape job uses Kubernetes service discovery and restricts to the controller’s Service/port via labels and port name https.
  • The bearer token comes from the pod’s ServiceAccount and satisfies the metrics endpoint’s authn/authz filter.
  • For production, configure TLS properly and remove insecure_skip_verify.

Alternative: Prometheus Operator

If you prefer Prometheus Operator, enable config/prometheus/monitor.yaml and ensure your Prometheus selects the Service/namespace. The OTel Collector pattern above is the recommended default for multi-backend and future tracing/logs.

Quick local test via port-forward + curl

Secure metrics are enabled by default. Use HTTPS, skip cert verify, and include a valid bearer token.

  1. Port-forward the controller manager Deployment:
kubectl -n go-agent-remediator-system port-forward deploy/go-agent-remediator-controller-manager 8443:8443
  1. Fetch a token from a ServiceAccount with permissions to view metrics (e.g., the controller manager SA):
SA=go-agent-remediator-controller-manager
NS=go-agent-remediator-system
SECRET=$(kubectl -n "$NS" get sa "$SA" -o jsonpath='{.secrets[0].name}')
TOKEN=$(kubectl -n "$NS" get secret "$SECRET" -o jsonpath='{.data.token}' | base64 -d)
  1. Curl the metrics endpoint (HTTPS, insecure):
curl -k -H "Authorization: Bearer $TOKEN" https://localhost:8443/metrics

If you prefer HTTP without TLS (dev only), run the manager with --metrics-secure=false and bind to :8080, then:

kubectl -n go-agent-remediator-system port-forward deploy/go-agent-remediator-controller-manager 8080:8080
curl http://localhost:8080/metrics

Example Grafana panels (PromQL)

  • Reconcile success ratio (1h):
sum(rate(remediator_reconciles_total{result="success"}[1h]))
/
sum(rate(remediator_reconciles_total[1h]))
  • Reconcile latency p95 (1h):
histogram_quantile(0.95,
  sum by (le) (rate(remediator_reconcile_duration_seconds_bucket[1h]))
)
  • Active violations by severity:
sum by (severity) (violations_active)
  • Plans generated (rate, by application):
sum by (application) (rate(remediation_plans_generated_total[1h]))
  • Actions success/failure (rate, by type):
sum by (type, status) (rate(actions_executed_total[1h]))
  • PRs opened (24h, top repos):
topk(5, sum by (repo) (rate(pr_opened_total[24h])))

Troubleshooting

  • Empty metrics: confirm metrics Service exists, OTel Collector is running, and the scrape job selects the correct Service/port.
  • 403/401 when curling: include a valid bearer token with access to the metrics endpoint.
  • TLS errors: use -k (insecure) for quick testing, or configure proper certs for the metrics endpoint and OTel Collector.