Observability — Audit Logging, Tracing & Metrics

Purpose

Full visibility into what the system is doing, who did what, and where things went wrong. Three layers:

Audit Logging — durable record of all user and system actions (who did what and when)
Distributed Tracing — OpenTelemetry traces correlating requests across services
Metrics & Dashboards — Grafana dashboards for system health, queue depth, cost, error rates
Log Aggregation — Grafana Loki for structured log storage and querying

Related: Database — audit_logs and event_logs collections | Infrastructure — Grafana + Loki in Docker Compose | Tech Stack — OpenTelemetry in stack table

Audit Logging

What Gets Logged

Every meaningful action in the system produces an audit log entry. The audit log is permanent — it is never deleted or truncated.

User actions:

Approving / rejecting activities, strategies, context files
Creating, updating, deleting tenants, users, skills, agent configs
Enabling / disabling agents
Plan changes, payment events
Logging in / out (session events)

Agent / system actions:

Activity started, completed, failed
Goal progress updated
Deliverable published or sent
Strategy auto-activated
Monthly period created

Audit Log Schema

See Database: audit_logs for the full MongoDB document schema.

Key fields: actorType, actorId, actorName, action, resourceType, resourceId, description, before, after, traceId.

Writing Audit Logs

All audit log writes go through a single auditLog() function in packages/observability/:


import { auditLog } from '@dmagency/observability';
 
await auditLog({
  tenantId:     activity.tenantId,
  actorType:    'human',
  actorId:      session.user.refId,
  actorName:    session.user.name,
  action:       'approved',
  resourceType: 'activity',
  resourceId:   activity.refId,
  resourceName: activity.name,
  description:  `${session.user.name} approved activity "${activity.name}"`,
  before:       { status: 'awaiting_approval' },
  after:        { status: 'approved' },
  traceId:      context.traceId,
});

This function:

Writes to MongoDB audit_logs collection
Emits an OpenTelemetry span event (so it appears in trace waterfall)
Returns immediately (fire-and-forget, does not block the request)

Audit Log Visibility by App

App	Scope	Permission
Dashboard	Own tenant only	Admin: full audit log. Member: own actions only.
DM Portal	Tenants assigned to reviewer	All human + agent actions for assigned tenants
Manage	All tenants + platform-level	Super admin only

Distributed Tracing (OpenTelemetry)

What is traced

Every inbound API request starts an OpenTelemetry trace. Spans are created for:

API route handlers
Database queries (Prisma auto-instrumented via @prisma/instrumentation)
BullMQ job processing
Agent adapter dispatch calls
Phone-home callback handling
External API calls (SEMrush, GA4, Razorpay, SendGrid)

Setup


// packages/observability/src/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
 
export const sdk = new NodeSDK({
  serviceName: process.env.OTEL_SERVICE_NAME || 'dmagency-api',
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,  // Grafana Tempo or Jaeger
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
 
sdk.start();

Trace ID propagation

The traceId from the active OpenTelemetry context is included in:

audit_logs.traceId — link audit event to trace
event_logs.traceId — link activity event to trace
activity_runs.traceId — link agent run to trace
llm_calls.traceId — link LLM cost to trace

This means: from a single audit log entry, you can jump to the full distributed trace that caused it.

Log Aggregation (Grafana Loki)

All structured logs from all services are shipped to Grafana Loki.

Logging format

All services use pino for structured JSON logging:


import pino from 'pino';
 
export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: {
    service: process.env.OTEL_SERVICE_NAME,
    env:     process.env.NODE_ENV,
  },
  // Add trace context to every log line
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const { traceId, spanId } = span.spanContext();
    return { traceId, spanId };
  },
});

Log shipping to Loki

Log shipping uses Promtail (deployed as a sidecar in Docker Compose / Coolify):


# promtail-config.yaml
scrape_configs:
  - job_name: dmagency
    static_configs:
      - targets: [localhost]
        labels:
          app: dmagency
          __path__: /var/log/dmagency/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            service: service
            traceId: traceId
      - labels:
          level:
          service:
          traceId:

Log labels (for Loki filtering)

Label	Values	Use
`service`	`api`, `dashboard`, `dm-portal`, `manage`	Filter by service
`level`	`debug`, `info`, `warn`, `error`	Filter by severity
`tenantId`	`{uuid}`	Filter by tenant (added to log context on authenticated requests)
`traceId`	`{hex}`	Correlate with traces

Grafana Dashboards

System Health Dashboard

Service uptime: API, Dashboard, DM Portal, Manage — green/red status
Error rate: HTTP 5xx per service (last 5 min rolling)
Request latency: p50 / p95 / p99 per route
Queue depth: BullMQ jobs waiting per agent role (all tenants aggregated)
Dead letter queue: count of failed jobs in DLQ — alert if > 0

LLM Cost Dashboard

Daily/monthly spend: by model, by tenant, by agent role
Token usage: input vs. output tokens per model
Cost cap alerts: tenants approaching their monthly spend cap
Model distribution: % of calls per provider (Claude / OpenAI / Ollama)

Activity Pipeline Dashboard

Activities by status: created / in_progress / awaiting_approval / completed / failed (last 24h)
HITL backlog: count of activities awaiting human approval (by tenant)
Average pipeline duration: time from period start to deliverable completion
Failure rate: by agent role and deliverable type

Tenant Health Dashboard (Manage App)

Active tenants: by plan
Tenant activity volume: activities per day per tenant (top 10)
Tenants at risk: approaching spend cap, pipeline overdue, recurring tasks not configured

Audit Log Screens (In-App)

Dashboard — Audit Log (`/settings/audit-log`)

Audience: Tenant admin Scope: Own tenant only


┌─────────────────────────────────────────────────────────────┐
│  Audit Log                                                  │
│  Filter: [All actions ▾] [All users ▾] [Last 30 days ▾]   │
│                                                             │
│  Apr 1 09:15  👤 Sarah approved "Q2 Blog Topic List"        │
│  Apr 1 08:55  🤖 SEO Specialist completed "Research Topics" │
│  Mar 31 17:02 👤 James updated Agent Config — Copywriter    │
│  Mar 31 16:44 👤 Sarah approved Strategy v2                 │
│  ...                                                        │
└─────────────────────────────────────────────────────────────┘

Clicking a row shows: full description, before/after state, resource link, trace ID link.

DM Portal — Audit Log (`/audit`)

Audience: DM reviewer / power user Scope: All assigned tenants

Same layout as Dashboard audit log, with an additional Tenant column and tenant filter. Shows agent actions across all tenants in the reviewer’s portfolio.

Manage — Platform Audit Log (`/audit`)

Audience: Super admin Scope: All tenants + platform-level events

Additional filters: tenant, actor type, resource type. Used for compliance investigations and support escalations.

Package Location

packages/observability/


observability/
├── src/
│   ├── tracing.ts       # OpenTelemetry SDK setup
│   ├── logger.ts        # Pino logger with trace context mixin
│   ├── audit.ts         # auditLog() function — writes to MongoDB audit_logs
│   └── index.ts
└── package.json

Infrastructure

Added to Docker Compose (local development):

Grafana — dashboards and alerting (port 3100)
Loki — log storage (port 3101)
Promtail — log shipping sidecar
Grafana Tempo (optional) — distributed trace storage

In production (Coolify), Grafana + Loki are deployed as separate services and secured behind Traefik with basic auth.

See Infrastructure for the full Docker Compose additions.

Open Questions

Trace backend: Grafana Tempo vs. Jaeger for storing OpenTelemetry traces. Tempo integrates natively with Grafana but requires more setup. Jaeger is simpler. Decision needed before Phase 4.
Audit log access control: Can a tenant member (non-admin) see the audit log at all? Or is it admin-only? Currently defaulted to admin-only in Dashboard.
Log retention: Loki logs — how long to retain? 30 days for dev, 90 days for prod? Cost vs. compliance trade-off.