Skip to Content
FeaturesObservability — Audit Logging, Tracing & Metrics

Observability — Audit Logging, Tracing & Metrics

Purpose

Full visibility into what the system is doing, who did what, and where things went wrong. Three layers:

  1. Audit Logging — durable record of all user and system actions (who did what and when)
  2. Distributed Tracing — OpenTelemetry traces correlating requests across services
  3. Metrics & Dashboards — Grafana dashboards for system health, queue depth, cost, error rates
  4. Log Aggregation — Grafana Loki for structured log storage and querying

Related: Databaseaudit_logs and event_logs collections | Infrastructure — Grafana + Loki in Docker Compose | Tech Stack — OpenTelemetry in stack table


Audit Logging

What Gets Logged

Every meaningful action in the system produces an audit log entry. The audit log is permanent — it is never deleted or truncated.

User actions:

  • Approving / rejecting activities, strategies, context files
  • Creating, updating, deleting tenants, users, skills, agent configs
  • Enabling / disabling agents
  • Plan changes, payment events
  • Logging in / out (session events)

Agent / system actions:

  • Activity started, completed, failed
  • Goal progress updated
  • Deliverable published or sent
  • Strategy auto-activated
  • Monthly period created

Audit Log Schema

See Database: audit_logs for the full MongoDB document schema.

Key fields: actorType, actorId, actorName, action, resourceType, resourceId, description, before, after, traceId.

Writing Audit Logs

All audit log writes go through a single auditLog() function in packages/observability/:

import { auditLog } from '@dmagency/observability'; await auditLog({ tenantId: activity.tenantId, actorType: 'human', actorId: session.user.refId, actorName: session.user.name, action: 'approved', resourceType: 'activity', resourceId: activity.refId, resourceName: activity.name, description: `${session.user.name} approved activity "${activity.name}"`, before: { status: 'awaiting_approval' }, after: { status: 'approved' }, traceId: context.traceId, });

This function:

  1. Writes to MongoDB audit_logs collection
  2. Emits an OpenTelemetry span event (so it appears in trace waterfall)
  3. Returns immediately (fire-and-forget, does not block the request)

Audit Log Visibility by App

AppScopePermission
DashboardOwn tenant onlyAdmin: full audit log. Member: own actions only.
DM PortalTenants assigned to reviewerAll human + agent actions for assigned tenants
ManageAll tenants + platform-levelSuper admin only

Distributed Tracing (OpenTelemetry)

What is traced

Every inbound API request starts an OpenTelemetry trace. Spans are created for:

  • API route handlers
  • Database queries (Prisma auto-instrumented via @prisma/instrumentation)
  • BullMQ job processing
  • Agent adapter dispatch calls
  • Phone-home callback handling
  • External API calls (SEMrush, GA4, Razorpay, SendGrid)

Setup

// packages/observability/src/tracing.ts import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; export const sdk = new NodeSDK({ serviceName: process.env.OTEL_SERVICE_NAME || 'dmagency-api', traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // Grafana Tempo or Jaeger }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

Trace ID propagation

The traceId from the active OpenTelemetry context is included in:

  • audit_logs.traceId — link audit event to trace
  • event_logs.traceId — link activity event to trace
  • activity_runs.traceId — link agent run to trace
  • llm_calls.traceId — link LLM cost to trace

This means: from a single audit log entry, you can jump to the full distributed trace that caused it.


Log Aggregation (Grafana Loki)

All structured logs from all services are shipped to Grafana Loki.

Logging format

All services use pino for structured JSON logging:

import pino from 'pino'; export const logger = pino({ level: process.env.LOG_LEVEL || 'info', base: { service: process.env.OTEL_SERVICE_NAME, env: process.env.NODE_ENV, }, // Add trace context to every log line mixin() { const span = trace.getActiveSpan(); if (!span) return {}; const { traceId, spanId } = span.spanContext(); return { traceId, spanId }; }, });

Log shipping to Loki

Log shipping uses Promtail (deployed as a sidecar in Docker Compose / Coolify):

# promtail-config.yaml scrape_configs: - job_name: dmagency static_configs: - targets: [localhost] labels: app: dmagency __path__: /var/log/dmagency/*.log pipeline_stages: - json: expressions: level: level service: service traceId: traceId - labels: level: service: traceId:

Log labels (for Loki filtering)

LabelValuesUse
serviceapi, dashboard, dm-portal, manageFilter by service
leveldebug, info, warn, errorFilter by severity
tenantId{uuid}Filter by tenant (added to log context on authenticated requests)
traceId{hex}Correlate with traces

Grafana Dashboards

System Health Dashboard

  • Service uptime: API, Dashboard, DM Portal, Manage — green/red status
  • Error rate: HTTP 5xx per service (last 5 min rolling)
  • Request latency: p50 / p95 / p99 per route
  • Queue depth: BullMQ jobs waiting per agent role (all tenants aggregated)
  • Dead letter queue: count of failed jobs in DLQ — alert if > 0

LLM Cost Dashboard

  • Daily/monthly spend: by model, by tenant, by agent role
  • Token usage: input vs. output tokens per model
  • Cost cap alerts: tenants approaching their monthly spend cap
  • Model distribution: % of calls per provider (Claude / OpenAI / Ollama)

Activity Pipeline Dashboard

  • Activities by status: created / in_progress / awaiting_approval / completed / failed (last 24h)
  • HITL backlog: count of activities awaiting human approval (by tenant)
  • Average pipeline duration: time from period start to deliverable completion
  • Failure rate: by agent role and deliverable type

Tenant Health Dashboard (Manage App)

  • Active tenants: by plan
  • Tenant activity volume: activities per day per tenant (top 10)
  • Tenants at risk: approaching spend cap, pipeline overdue, recurring tasks not configured

Audit Log Screens (In-App)

Dashboard — Audit Log (/settings/audit-log)

Audience: Tenant admin Scope: Own tenant only

┌─────────────────────────────────────────────────────────────┐ │ Audit Log │ │ Filter: [All actions ▾] [All users ▾] [Last 30 days ▾] │ │ │ │ Apr 1 09:15 👤 Sarah approved "Q2 Blog Topic List" │ │ Apr 1 08:55 🤖 SEO Specialist completed "Research Topics" │ │ Mar 31 17:02 👤 James updated Agent Config — Copywriter │ │ Mar 31 16:44 👤 Sarah approved Strategy v2 │ │ ... │ └─────────────────────────────────────────────────────────────┘

Clicking a row shows: full description, before/after state, resource link, trace ID link.

DM Portal — Audit Log (/audit)

Audience: DM reviewer / power user Scope: All assigned tenants

Same layout as Dashboard audit log, with an additional Tenant column and tenant filter. Shows agent actions across all tenants in the reviewer’s portfolio.

Manage — Platform Audit Log (/audit)

Audience: Super admin Scope: All tenants + platform-level events

Additional filters: tenant, actor type, resource type. Used for compliance investigations and support escalations.


Package Location

packages/observability/

observability/ ├── src/ │ ├── tracing.ts # OpenTelemetry SDK setup │ ├── logger.ts # Pino logger with trace context mixin │ ├── audit.ts # auditLog() function — writes to MongoDB audit_logs │ └── index.ts └── package.json

Infrastructure

Added to Docker Compose (local development):

  • Grafana — dashboards and alerting (port 3100)
  • Loki — log storage (port 3101)
  • Promtail — log shipping sidecar
  • Grafana Tempo (optional) — distributed trace storage

In production (Coolify), Grafana + Loki are deployed as separate services and secured behind Traefik with basic auth.

See Infrastructure for the full Docker Compose additions.


Open Questions

  • Trace backend: Grafana Tempo vs. Jaeger for storing OpenTelemetry traces. Tempo integrates natively with Grafana but requires more setup. Jaeger is simpler. Decision needed before Phase 4.
  • Audit log access control: Can a tenant member (non-admin) see the audit log at all? Or is it admin-only? Currently defaulted to admin-only in Dashboard.
  • Log retention: Loki logs — how long to retain? 30 days for dev, 90 days for prod? Cost vs. compliance trade-off.

© 2026 Leadmetrics — Internal use only