Monitoring
The platform exposes operational metrics independently of the customer-facing Activity/Usage views. This page covers what to scrape, what to alert on, and where to look when something looks off.
For the customer-facing operator view (workloads, threads, usage in the Console), see Administer → Monitoring.
Metrics
Every platform service exposes Prometheus metrics on :8080/metrics (or the port set in the chart). Standard sets:
- gRPC server metrics — request rate, error rate, latency by method (via the
go-grpc-prometheusmiddleware most services use). - HTTP server metrics — Gateway and LLM Proxy expose RED metrics.
- DB metrics — open connections, idle connections, query latency.
- Custom service metrics — see each service's metrics page in its repo.
Per-service charts ship ServiceMonitor resources for Prometheus Operator behind a value flag. With kube-prometheus-stack installed, set the flag in the relevant <service>_values block in stacks/platform/main.tf (look for monitoring or serviceMonitor) and re-apply the platform stack:
terraform -chdir=stacks/platform apply
Argo CD reconciles each service and the ServiceMonitor resources appear. Metrics start showing up in Prometheus within a few scrape intervals.
Health checks
Each service exposes:
/healthz— liveness. Returns 200 if the process is up./readyz— readiness. Returns 200 only when DB connections and dependent services are reachable.
Kubernetes uses both for pod health.
Suggested alerts
| Alert | Trigger | What it usually means |
|---|---|---|
| Gateway error rate > 1% | gateway_grpc_server_handled_total{code!="OK"} over 5m | Service downstream is failing or an authz bug landed. |
| Orchestrator no recent reconciliation | orchestrator_reconcile_last_success_timestamp older than 60s | Orchestrator stuck. Check leader election. |
| Workload start failure rate > 5% | `workload_failures_total{reason=~"start_failed | config_invalid"}` |
| Tracing ingest queue depth | tracing_ingest_queue_depth > 10000 | Tracing writer can't keep up. |
| Authorization check p95 > 100ms | authorization_check_latency_seconds_bucket | OpenFGA database hot. |
| Postgres connection saturation > 80% | per-service DB metrics | Increase pool, add connection pooling (pgBouncer). |
| LLM Proxy 429 rate > 1% | llm_proxy_upstream_status_code{status="429"} | Provider rate-limited. Adjust per-provider concurrency. |
| Notifications subscriber count cliff | notifications_active_subscribers drops sharply | UI fleet disconnected. Check ingress. |
| OpenZiti router down | OpenZiti's own metrics | Workloads can't reach Gateway / LLM Proxy. |
These are starting points — tune to your environment.
Dashboards
The platform charts ship Grafana dashboards under charts/platform/grafana/. Import them into your Grafana:
- Platform overview — request rate / error rate per service.
- Agent workloads — concurrent workloads, start failures, idle-timeout rate, per-runner distribution.
- LLM usage — token rate, cache hit rate, per-model error rate.
- Tracing throughput — spans/sec, queue depth, DB write latency.
- OpenFGA — check rate, ListObjects rate, tuple growth.
Log aggregation
Send service logs to your central log store (Elasticsearch, Loki, CloudWatch, Splunk). Each service emits structured JSON logs with at minimum:
serviceleveltimemsgidentity_id(when authenticated)method/pathcode/status
For privileged decisions (cluster admin grants, deletions, configuration changes), services emit dedicated audit events. See Logging & audit.
Tracing the platform itself
The platform's own internal traces are emitted alongside agent traces. Filter by service.name=<platform-service> in your trace explorer. This is useful for debugging Gateway-to-service latency, slow OpenFGA checks, or stuck reconciliation loops.
Capacity tracking
Track these over time to predict scaling needs:
- Concurrent agent workloads — drives runner sizing.
- Daily LLM token spend — drives provider quota and budget.
- Database size growth — Tracing and Threads grow fastest.
- Per-organization growth — number of orgs, users per org, agents per org.
A monthly capacity review against these numbers catches problems before they bite.
SLOs
Reasonable starting targets:
| SLO | Target |
|---|---|
| Gateway request availability | 99.9% |
| Workload start success rate | 99% |
| Chat message round-trip (user → agent reply) p95 | < 30s (model-dependent) |
| Tracing ingest delay | < 5s |
| Console page load p95 | < 2s |
Track these via Prometheus rules and review monthly.
Related
- Architecture overview
- Scaling
- Logging & audit
- Administer → Monitoring — customer-facing view.