Observability — Audit Logging, Tracing & Metrics
Purpose
Full visibility into what the system is doing, who did what, and where things went wrong. Three layers:
- Audit Logging — durable record of all user and system actions (who did what and when)
- Distributed Tracing — OpenTelemetry traces correlating requests across services
- Metrics & Dashboards — Grafana dashboards for system health, queue depth, cost, error rates
- Log Aggregation — Grafana Loki for structured log storage and querying
Related: Database —
audit_logsandevent_logscollections | Infrastructure — Grafana + Loki in Docker Compose | Tech Stack — OpenTelemetry in stack table
Audit Logging
What Gets Logged
Every meaningful action in the system produces an audit log entry. The audit log is permanent — it is never deleted or truncated.
User actions:
- Approving / rejecting activities, strategies, context files
- Creating, updating, deleting tenants, users, skills, agent configs
- Enabling / disabling agents
- Plan changes, payment events
- Logging in / out (session events)
Agent / system actions:
- Activity started, completed, failed
- Goal progress updated
- Deliverable published or sent
- Strategy auto-activated
- Monthly period created
Audit Log Schema
See Database: audit_logs for the full MongoDB document schema.
Key fields: actorType, actorId, actorName, action, resourceType, resourceId, description, before, after, traceId.
Writing Audit Logs
All audit log writes go through a single auditLog() function in packages/observability/:
import { auditLog } from '@dmagency/observability';
await auditLog({
tenantId: activity.tenantId,
actorType: 'human',
actorId: session.user.refId,
actorName: session.user.name,
action: 'approved',
resourceType: 'activity',
resourceId: activity.refId,
resourceName: activity.name,
description: `${session.user.name} approved activity "${activity.name}"`,
before: { status: 'awaiting_approval' },
after: { status: 'approved' },
traceId: context.traceId,
});This function:
- Writes to MongoDB
audit_logscollection - Emits an OpenTelemetry span event (so it appears in trace waterfall)
- Returns immediately (fire-and-forget, does not block the request)
Audit Log Visibility by App
| App | Scope | Permission |
|---|---|---|
| Dashboard | Own tenant only | Admin: full audit log. Member: own actions only. |
| DM Portal | Tenants assigned to reviewer | All human + agent actions for assigned tenants |
| Manage | All tenants + platform-level | Super admin only |
Distributed Tracing (OpenTelemetry)
What is traced
Every inbound API request starts an OpenTelemetry trace. Spans are created for:
- API route handlers
- Database queries (Prisma auto-instrumented via
@prisma/instrumentation) - BullMQ job processing
- Agent adapter dispatch calls
- Phone-home callback handling
- External API calls (SEMrush, GA4, Razorpay, SendGrid)
Setup
// packages/observability/src/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
export const sdk = new NodeSDK({
serviceName: process.env.OTEL_SERVICE_NAME || 'dmagency-api',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // Grafana Tempo or Jaeger
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();Trace ID propagation
The traceId from the active OpenTelemetry context is included in:
audit_logs.traceId— link audit event to traceevent_logs.traceId— link activity event to traceactivity_runs.traceId— link agent run to tracellm_calls.traceId— link LLM cost to trace
This means: from a single audit log entry, you can jump to the full distributed trace that caused it.
Log Aggregation (Grafana Loki)
All structured logs from all services are shipped to Grafana Loki.
Logging format
All services use pino for structured JSON logging:
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
base: {
service: process.env.OTEL_SERVICE_NAME,
env: process.env.NODE_ENV,
},
// Add trace context to every log line
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId } = span.spanContext();
return { traceId, spanId };
},
});Log shipping to Loki
Log shipping uses Promtail (deployed as a sidecar in Docker Compose / Coolify):
# promtail-config.yaml
scrape_configs:
- job_name: dmagency
static_configs:
- targets: [localhost]
labels:
app: dmagency
__path__: /var/log/dmagency/*.log
pipeline_stages:
- json:
expressions:
level: level
service: service
traceId: traceId
- labels:
level:
service:
traceId:Log labels (for Loki filtering)
| Label | Values | Use |
|---|---|---|
service | api, dashboard, dm-portal, manage | Filter by service |
level | debug, info, warn, error | Filter by severity |
tenantId | {uuid} | Filter by tenant (added to log context on authenticated requests) |
traceId | {hex} | Correlate with traces |
Grafana Dashboards
System Health Dashboard
- Service uptime: API, Dashboard, DM Portal, Manage — green/red status
- Error rate: HTTP 5xx per service (last 5 min rolling)
- Request latency: p50 / p95 / p99 per route
- Queue depth: BullMQ jobs waiting per agent role (all tenants aggregated)
- Dead letter queue: count of failed jobs in DLQ — alert if > 0
LLM Cost Dashboard
- Daily/monthly spend: by model, by tenant, by agent role
- Token usage: input vs. output tokens per model
- Cost cap alerts: tenants approaching their monthly spend cap
- Model distribution: % of calls per provider (Claude / OpenAI / Ollama)
Activity Pipeline Dashboard
- Activities by status: created / in_progress / awaiting_approval / completed / failed (last 24h)
- HITL backlog: count of activities awaiting human approval (by tenant)
- Average pipeline duration: time from period start to deliverable completion
- Failure rate: by agent role and deliverable type
Tenant Health Dashboard (Manage App)
- Active tenants: by plan
- Tenant activity volume: activities per day per tenant (top 10)
- Tenants at risk: approaching spend cap, pipeline overdue, recurring tasks not configured
Audit Log Screens (In-App)
Dashboard — Audit Log (/settings/audit-log)
Audience: Tenant admin Scope: Own tenant only
┌─────────────────────────────────────────────────────────────┐
│ Audit Log │
│ Filter: [All actions ▾] [All users ▾] [Last 30 days ▾] │
│ │
│ Apr 1 09:15 👤 Sarah approved "Q2 Blog Topic List" │
│ Apr 1 08:55 🤖 SEO Specialist completed "Research Topics" │
│ Mar 31 17:02 👤 James updated Agent Config — Copywriter │
│ Mar 31 16:44 👤 Sarah approved Strategy v2 │
│ ... │
└─────────────────────────────────────────────────────────────┘Clicking a row shows: full description, before/after state, resource link, trace ID link.
DM Portal — Audit Log (/audit)
Audience: DM reviewer / power user Scope: All assigned tenants
Same layout as Dashboard audit log, with an additional Tenant column and tenant filter. Shows agent actions across all tenants in the reviewer’s portfolio.
Manage — Platform Audit Log (/audit)
Audience: Super admin Scope: All tenants + platform-level events
Additional filters: tenant, actor type, resource type. Used for compliance investigations and support escalations.
Package Location
packages/observability/
observability/
├── src/
│ ├── tracing.ts # OpenTelemetry SDK setup
│ ├── logger.ts # Pino logger with trace context mixin
│ ├── audit.ts # auditLog() function — writes to MongoDB audit_logs
│ └── index.ts
└── package.jsonInfrastructure
Added to Docker Compose (local development):
- Grafana — dashboards and alerting (port 3100)
- Loki — log storage (port 3101)
- Promtail — log shipping sidecar
- Grafana Tempo (optional) — distributed trace storage
In production (Coolify), Grafana + Loki are deployed as separate services and secured behind Traefik with basic auth.
See Infrastructure for the full Docker Compose additions.
Open Questions
- Trace backend: Grafana Tempo vs. Jaeger for storing OpenTelemetry traces. Tempo integrates natively with Grafana but requires more setup. Jaeger is simpler. Decision needed before Phase 4.
- Audit log access control: Can a tenant member (non-admin) see the audit log at all? Or is it admin-only? Currently defaulted to admin-only in Dashboard.
- Log retention: Loki logs — how long to retain? 30 days for dev, 90 days for prod? Cost vs. compliance trade-off.