Metrics & Monitoring

E.D.D.I exposes comprehensive metrics via Micrometerarrow-up-right in Prometheus format, covering conversations, tool execution, caching, rate limiting, cost tracking, multi-agent group discussions, scheduled triggers, tenant quotas, audit integrity, and JVM internals.

Quick Start — Grafana Dashboard

E.D.D.I ships with a pre-built Operations Command Center dashboard (45 panels, 9 rows) that auto-provisions into Grafana.

Enable Monitoring

# Docker Compose
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

# Or via the install wizard
./install.sh --with-monitoring     # Linux / macOS
./install.ps1 -WithMonitoring      # Windows
Service
URL
Credentials

Grafana

http://localhost:3000

admin / admin

Prometheus

http://localhost:9090

Metrics

http://localhost:7070/q/metrics

The dashboard appears automatically as the Grafana home page. Anonymous viewer access is enabled by default.

Dashboard Sections

Row
Title
Key Panels

KPI Strip

(always visible)

Uptime, Agents Deployed, Active Conversations, Messages/sec, Tool Success %, Cache Hit %, Error Rate, Cost/hr

Row 1

Platform Overview & HTTP Traffic

Request rate by status (2xx/4xx/5xx), latency P50/P95/P99, CPU usage, top 10 slowest endpoints

Row 2

Conversations

Start/end/processing rate, processing duration percentiles, active gauge, undo/redo, start vs load latency

Row 3

Tool Execution Engine

Success vs failure rate, per-tool execution duration, parallel execution stats, cached/rate-limited breakdown

Row 4

Tool Cache Performance

Hit rate %, hits vs misses, cache size, get/put duration

Row 5

Rate Limiting & Cost

Allowed vs denied, denied by tool, total cost, budget exceeded events, cost accumulation, cost by tool

Row 6

Multi-Agent Group Discussions

Started vs failed, failure rate gauge, discussion duration

Row 7

Scheduled Triggers

Poll/fire/failed, fire duration, claim conflicts, dead-lettered

Row 8

Tenant Quotas & Audit

Quota allowed vs denied, denied by type, audit entries dropped, tenant usage

Row 9

JVM & Infrastructure

Heap/non-heap memory, threads, GC, MongoDB pool, PostgreSQL Agroal pool, NATS messaging

Database-agnostic: Row 9 includes panels for both MongoDB (mongodb_driver_pool_*) and PostgreSQL (agroal_*). Whichever backend is active shows data; the other gracefully shows "No data".


Metrics Reference

All metrics are accessible at /q/metrics. Micrometer uses dot notation (e.g., eddi.tool.cache.hits); Prometheus automatically converts to underscore notation with _total suffix for counters (e.g., eddi_tool_cache_hits_total).

Conversation Metrics

Tool Execution Metrics

All execution metrics support a tool label for per-tool breakdown:

Tool Cache Metrics

Rate Limiting Metrics

Per-tool and aggregate queries:

Cost Tracking Metrics

Per-tool breakdown:

Group Discussion Metrics

Scheduled Trigger Metrics

Tenant Quota Metrics

Quota denied counters include type and tenant labels:

Audit Ledger Metrics

Deployed Agents

NATS Messaging Metrics

Only active when using the NATS messaging profile. Shows nothing under in-memory messaging.

JVM & HTTP Server (auto-exposed)

Standard Micrometer metrics for Quarkus:

Database Connection Pool (auto-exposed)

MongoDB (when eddi.datastore.type=mongo):

PostgreSQL / Agroal (when eddi.datastore.type=postgres):


Prometheus Alerts

Sample Alert Rules


REST API Endpoints

EDDI also exposes tool metrics via REST:


Monitoring Best Practices

Key Metrics to Watch

Metric
Target
Why

Cache Hit Rate

> 70%

Below this, tool calls are mostly un-cached → higher latency & cost

Tool Success Rate

> 95%

Dropping below indicates tool integration issues

P95 Latency

< 2s

Conversation responsiveness depends on tool speed

Cost Per Request

< $0.001

Runaway costs indicate misconfigured tools or abuse

Audit Drops

= 0

Any non-zero value is a compliance incident

Error Rate (HTTP 5xx)

< 1%

Proxy for overall platform health

Key PromQL Queries

Cache Hit Rate:

Tool Success Rate:

P95 Conversation Processing Latency:

Cost Per Hour:


Additional Resources

Last updated

Was this helpful?