Architecture Review v1 — Leadmetrics v3

Date: 2026-04-25 Reviewer: Claude Code Scope: Full monorepo — API, portals, worker servers, packages, database, real-time, security

Historical note: This review was written when Better Auth was still in use for the Dashboard. Better Auth was removed in late April 2026 — all three portals now use a unified Fastify JWT system (HS256) via @leadmetrics/middleware. The dual-auth risk (R-xxx) identified in this review is now resolved.

How to read this document: Every identified risk or gap is followed immediately by a Fix block with concrete implementation steps, affected files, and code examples. The Summary table cross-references each risk by ID (R-1 … R-21).

1. System Topology

Leadmetrics v3 is a multi-tenant AI marketing automation platform delivered as a pnpm monorepo with five distinct compute layers:

Layer	Technology	Process count
Portals	Next.js 14 (App Router)	4 apps (dashboard :3000, manage :3001, dm :3002, knowledgebase :3004)
REST API	Fastify 4	1 process (:3003)
Agent workers	BullMQ + Claude/Codex/Gemini	1 monolithic process (44 workers)
Support servers	BullMQ	6 dedicated processes (billing, notifications, ragengine, reporting, scheduler, search-indexer)
Mobile	React Native (Expo)	1 app (dashboard-mobile)

Infrastructure dependencies: PostgreSQL (primary DB), MongoDB (audit logs), Redis (queues, pub/sub, presence cache), Typesense (full-text search), Qdrant (vector search / RAG).

Workspace layout (pnpm-workspace.yaml):


apps/*              — Next.js portals + Fastify API
apps/servers/*      — Background server processes
packages/*          — Shared libraries (db, ui, common, queue, billing, …)
packages/adapters/* — LLM adapters (claude-local, codex-local, gemini-local)
packages/providers/* — 20+ third-party integrations

2. API Layer

2.1 Router Organization

All routes live in apps/api/src/routers/. The 60+ route files are organized into versioned namespaces registered in apps/api/src/index.ts:

Prefix	Domain	Example routes
`/auth/v1`	Auth	login, register, refresh, profile
`/admin/v1`	Superadmin	tenants, users, billing, audit, templates
`/tenant/v1`	Tenant-scoped	blog, social, calendar, insights, channels
`/dm/v1`	DM reviewer	activities, blog, social, strategy, contacts
`/media/v1`	Media	image search, asset upload
`/aichat/v1`	AI Chat	LangGraph conversation
`/mobile/v1`	Mobile API	client-facing mobile endpoints
`/pg/v1`	PG callbacks	payment gateway webhooks

Assessment: Domain grouping is clear and consistent. The /v1 prefix on all namespaces is good practice. However, brandVoiceRouter appears in app.ts (test entry) but is missing from index.ts (production) — see R-3.

2.2 Entry Point Split (app.ts vs index.ts) — R-3

There are two entry points:

index.ts — production server; registers Helmet, Swagger/OpenAPI, Socket.IO, MongoDB, rate limiting
app.ts — test server; lighter setup, no Swagger, no Socket.IO, no MongoDB

The split is intentional for fast test startup but causes silent drift. Any router registered in index.ts but absent from app.ts will 404 in integration tests — tests pass (route missing = no test written) rather than fail (handler misbehaves). This has burned the team before (feedback_api_app_ts_registration.md).

Fix (R-3): Extract router registration into a shared function consumed by both entry points.


// apps/api/src/lib/register-routers.ts  (NEW FILE)
import type { FastifyInstance } from "fastify";
import { authRouter }         from "../routers/auth";
import { adminRouter }        from "../routers/admin";
import { tenantRouter }       from "../routers/tenant";
import { dmRouter }           from "../routers/dm";
import { brandVoiceRouter }   from "../routers/brand-voice";
// … all other routers
 
export async function registerRouters(fastify: FastifyInstance): Promise<void> {
  await fastify.register(authRouter,       { prefix: "/auth/v1" });
  await fastify.register(adminRouter,      { prefix: "/admin/v1" });
  await fastify.register(tenantRouter,     { prefix: "/tenant/v1" });
  await fastify.register(dmRouter,         { prefix: "/dm/v1" });
  await fastify.register(brandVoiceRouter, { prefix: "/tenant/v1/brand-voice" });
  // … all other routers
}

Both index.ts and app.ts then call await registerRouters(fastify) — one source of truth, zero drift.

2.3 Authentication Guards — R-1

Auth is opt-in per route handler. There is no global preHandler that rejects unauthenticated requests. Each handler explicitly calls one of:

requireTenantUser(req, reply) — verifies JWT, returns { sub, tenantId, role, appAccess }
requireSuperAdmin(req, reply) — verifies JWT, asserts role === "super_admin"
requireDMAccess(req, reply) — verifies JWT, asserts role in ["reviewer", "admin", "super_admin"]

Risk (R-1): A new route that forgets the guard call is publicly accessible — the TypeScript compiler does not enforce guard presence. With 60+ route files this is a latent exposure risk.

Fix (R-1): Register auth guards as Fastify preHandler hooks at the router/plugin level, not inside individual handlers.
// apps/api/src/routers/dm/index.ts
import { FastifyPluginAsync } from "fastify";
import { requireDMAccess } from "../../lib/auth";
 
export const dmRouter: FastifyPluginAsync = async (fastify) => {
  // Single guard covers every route registered in this plugin scope.
  fastify.addHook("preHandler", requireDMAccess);
 
  await fastify.register(dmActivitiesRouter);
  await fastify.register(dmBlogRouter);
  // … all dm sub-routers — all protected automatically
};
Apply the same pattern to tenantRouter (→ requireTenantUser) and adminRouter (→ requireSuperAdmin). Per-handler guard calls become unnecessary and can be removed, eliminating 60+ call sites. Any new route added to the plugin automatically inherits the guard.

2.4 Error Handling

The centralized error handler in lib/fastify-setup.ts normalizes all errors to { error: { code, message } }. Per-route custom errors use apiError(reply, status, code, message), maintaining the same shape throughout.

Gap — R-18: The rate-limit plugin throws a non-Error object, which is forwarded verbatim by the !(err instanceof Error) branch of the error handler. The resulting 429 body is { statusCode, error, message } — a different shape from every other API error. Client code must handle two shapes.

Fix (R-18): Intercept rate-limit errors in the error handler and normalize them:


// apps/api/src/lib/fastify-setup.ts — inside registerErrorHandler
fastify.setErrorHandler((err, _req, reply) => {
  if (err.validation) {
    return reply.status(400).send({ error: { code: "VALIDATION_ERROR", message: err.message } });
  }
 
  const status = (err as any).statusCode ?? 500;
 
  // Normalize @fastify/rate-limit (and other plugin) errors to the standard shape.
  if (!(err instanceof Error)) {
    return reply.status(status).send({
      error: { code: "RATE_LIMITED", message: (err as any).message ?? "Too many requests" },
    });
  }
 
  if (status >= 500) fastify.log.error({ err }, "[api] internal server error");
  return reply.status(status).send({ error: { code: "INTERNAL_ERROR", message: err.message } });
});

Gap: Fastify’s validation errors expose internal schema field paths in the message. Add a sanitization step for production environments:

Fix: Strip internal field paths from validation error messages before they reach the client:


if (err.validation) {
  const safe = process.env.NODE_ENV === "production"
    ? "Request validation failed"
    : err.message;
  return reply.status(400).send({ error: { code: "VALIDATION_ERROR", message: safe } });
}

2.5 Request/Response Configuration

From index.ts:52-58:

connectionTimeout: 10_000 — protects against slow-loris ✓
requestTimeout: 30_000 — may terminate streaming SSE responses from /aichat/v1/chat
bodyLimit: 512KB — reasonable; file uploads bypass via @fastify/multipart ✓
trustProxy — reads from env, supports CIDR notation ✓

Gap: The 30s requestTimeout will cut off long-running SSE streams. BullMQ jobs are async so job execution is unaffected, but any endpoint that holds an open connection (SSE, long-poll) will be terminated.

Fix: Disable the timeout per-route for streaming endpoints:


// apps/api/src/routers/ai-chat.ts
fastify.get("/stream", {
  config: { rawBody: true },
  // Override the global 30s timeout for this SSE endpoint.
  // BullMQ job timeout is independently controlled via lockDuration.
  handler: async (req, reply) => {
    reply.raw.setTimeout(0); // no timeout for this connection
    // … SSE logic
  }
});

2.6 OpenAPI / Swagger — incomplete schemas

Swagger is registered in production (index.ts) but not in app.ts (acceptable — docs don’t need to run in tests).

Gap: Not all routes define JSON Schema for their responses. Routes without a schema are invisible in the generated OpenAPI spec. Additionally, response schemas without additionalProperties: true silently strip fields from serialized responses.

Fix: Enforce schema coverage via ESLint:


// eslint.config.mjs — add rule
"no-restricted-syntax": ["warn", {
  selector: "CallExpression[callee.property.name='get'][arguments.length=1]",
  message: "Fastify GET routes must define a schema object as the second argument."
}]

For existing response schemas, add a global default:


// apps/api/src/lib/fastify-setup.ts
fastify.addSchema({
  $id: "defaultResponse",
  type: "object",
  additionalProperties: true,  // prevents field stripping on undocumented responses
});

3. Queue and Worker Architecture

3.1 Queue Design and Missing Dead-Letter Queue — R-6

Queues use the double-underscore naming convention (agent__{role}) and share a sensible default retry policy:


attempts: 4
backoff: exponential, 5s initial
removeOnComplete: { count: 100 }
removeOnFail: { count: 50 }

Risk (R-6): There is no dead-letter queue (DLQ). Jobs that exhaust all 4 attempts are marked “failed” in BullMQ and written to AgentRun, but:

No one is alerted when a job permanently fails
With removeOnFail: { count: 50 }, the 51st failed payload is silently discarded
Failed jobs cannot be replayed without manual Redis intervention

Fix (R-6): Register a global failed event handler on the agent server that enqueues an internal DM-team notification when a job exhausts all attempts:


// apps/servers/agents/src/index.ts — add after worker start calls
import { QueueEvents } from "bullmq";
import { enqueueNotification } from "@leadmetrics/queue";
 
const CRITICAL_QUEUES = [
  "agent__blog-writer",
  "agent__strategy-writer",
  "agent__setup",
  "agent__social-post-writer",
];
 
for (const queueName of CRITICAL_QUEUES) {
  const events = new QueueEvents(queueName, { connection: redis });
  events.on("failed", async ({ jobId, attemptsMade, failedReason }) => {
    // Only alert on final failure (not intermediate retries).
    const queue = new Queue(queueName, { connection: redis });
    const job   = await queue.getJob(jobId);
    if (!job || attemptsMade < (job.opts.attempts ?? 4)) return;
 
    await enqueueNotification({
      channel:   "email",
      recipient: { email: process.env.ALERTS_EMAIL ?? "ops@leadmetrics.ai" },
      template:  "agent_failure",
      data: {
        queueName,
        jobId,
        tenantId: job.data.tenantId,
        reason:   failedReason,
        payload:  JSON.stringify(job.data).slice(0, 500),
      },
    });
  });
}

Also increase removeOnFail to keep evidence longer:


// packages/queue/src/queues.ts — update defaultJobOptions
removeOnFail: { count: 500, age: 7 * 24 * 60 * 60 }, // keep 500 entries or 7 days

3.2 Deduplication Strategy

Two deduplication patterns:

Static jobId + TTL — for idempotent setup operations ✓
Timestamp jobId — for revision chains that must always run ✓

Assessment: Well-designed. The two-pattern approach correctly distinguishes idempotent from non-idempotent operations. No action needed.

3.3 Agent Worker Server — Monolithic Process — R-8

All 44 workers run in a single Node.js process. Each holds its own Worker instance sharing the same process memory and event loop.

Strengths: Simple deployment, clean graceful shutdown, maxStalledCount: 2 for Windows cold-start.

Risk (R-8): A memory leak in one worker (e.g., large context accumulating in blog-writer) OOM-kills all 44 workers simultaneously. CPU-bound workers (e.g., website-crawler) can starve I/O-bound workers in the same event loop. The sequential shutdown drains one worker at a time — on a busy server with 5-min Claude jobs, SIGTERM-to-exit could exceed Kubernetes’ terminationGracePeriodSeconds.

Fix (R-8): Split the monolith into two processes grouped by resource profile:

Process A — server-agents-llm (long-running, CPU/token-intensive):
setup, strategy-writer, strategy, activity, blog-writer, blog-faq-writer,
social-post-writer, social-post-designer, landing-page-writer, content-repurposer,
gbp-post-writer, custom-report-writer, backlink-outreach-writer, seo-optimizer,
campaign-brief-writer, linkedin-ads-writer, review-campaign-writer,
meta-ads-optimizer, linkedin-ads-optimizer, brand-narrative-analyst
Process B — server-agents-util (short-running, I/O-bound):
keyword-researcher, content-brief-writer, social-calendar-planner, email-writer,
google-ads-writer, meta-ads-writer, report-writer, topic-researcher,
research-note-writer, review-response-writer, site-auditor, backlink-researcher,
website-crawler, ads-analyst, anomaly-detector, all insight workers,
ai-visibility-monitor, ai-visibility-seeder, gsc-keywords-snapshot,
github-source-sync, newsletter-sender, search-term-classifier,
campaign-optimizer-runner, seo-outreach-optimizer, review-campaign-optimizer
This means a runaway LLM job cannot starve search-sync or notification workers. Separate Docker containers also allow independent scaling (server-agents-llm needs more memory; server-agents-util needs more CPU cores).

To handle the sequential shutdown problem, drain workers in parallel:
// Replace sequential awaits with parallel drain
await Promise.all([
  stopBlogWriterWorkers(),
  stopStrategyWriterWorkers(),
  stopActivityWorkers(),
  // … all workers
]);

3.4 Job Payload Typing — No Runtime Validation — R-13 (queue variant)

Job payload types are statically typed in packages/queue/src/types.ts. This is good for compile-time safety.

Gap: There is no runtime validation at enqueue or dequeue time. A caller that passes a malformed payload (e.g., tenantId: undefined) gets past TypeScript if the calling code has loose types, and the worker discovers the issue deep into execution — after credits are reserved and AgentRun is created.

Fix: Add Zod schema validation inside each enqueueXxx function:


// packages/queue/src/schemas.ts  (NEW FILE)
import { z } from "zod";
 
export const ActivityJobDataSchema = z.object({
  tenantId:    z.string().uuid(),
  activityId:  z.string().uuid(),
  wakeReason:  z.enum(["initial", "rejection", "revision"]).optional(),
});
 
// packages/queue/src/queues.ts — inside enqueueBlogWriter
export async function enqueueBlogWriter(data: ActivityJobData) {
  ActivityJobDataSchema.parse(data); // throws ZodError on invalid payload
  const queue = getQueue("agent__blog-writer");
  // … rest of enqueue logic
}

The same Zod schema can be reused as a TypeScript type via z.infer<>, removing the separate interface.

3.5 Event Flow — Excess Redis Connections

agent-events.ts creates its own IORedis connection (lines 52-63), separate from the BullMQ connection managed by packages/queue. Each worker process holds at minimum 2 Redis connections: one for BullMQ and one for event publishing.

Fix: Accept the Redis client as a parameter in publishAgentEvent instead of creating a private connection. Callers pass the same IORedis instance used by BullMQ:


// packages/agents/src/agent-events.ts
let _publisher: IORedis | null = null;
 
export function setPublisher(client: IORedis): void {
  _publisher = client;
}
 
function getPublisher(): IORedis {
  if (!_publisher) {
    // Lazy fallback for tests that don't call setPublisher()
    _publisher = new IORedis(process.env.REDIS_URL ?? "redis://localhost:6379", {
      lazyConnect: true, maxRetriesPerRequest: 3, enableOfflineQueue: false,
    });
  }
  return _publisher;
}
 
// apps/servers/agents/src/index.ts — after Redis connects
import { setPublisher } from "@leadmetrics/agents/src/agent-events";
setPublisher(redis); // reuse the same IORedis instance

This reduces the agent server from 2N to N+1 Redis connections (one shared publisher, one BullMQ connection per worker type that uses a separate queue connection — BullMQ manages its own).

4. Authentication Architecture

4.1 Three Auth Systems in One Platform

System	Used by	Token	Storage
JWT (Fastify)	API, DM, Manage	15-min access token + 7-day refresh token	HTTP-only cookie
Better Auth	Dashboard user records	Session-based	`Session` table in PostgreSQL
Next.js middleware	All portals	Reads the JWT cookie	No additional storage

Users are stored in the Better Auth schema (User, Session, Account), but all API access uses JWT tokens issued by the Fastify auth route.

Risk: The two systems can drift. A Better Auth session invalidated server-side still allows API access until the JWT expires (15 min).

Fix: When Better Auth invalidates a session (e.g., password change, logout from all devices), also write the current JWT jti (JWT ID) to a Redis revocation set:
// apps/api/src/routers/auth.ts — inside the logout/invalidate handler
import { redis } from "../plugins/redis";
 
// Add jti to revocation set with TTL matching token expiry
await redis.setex(`revoked_token:${payload.jti}`, 15 * 60, "1");
 
// apps/api/src/lib/auth.ts — inside requireTenantUser
const revoked = await redis.exists(`revoked_token:${payload.jti}`);
if (revoked) return apiError(reply, 401, "UNAUTHORIZED", "Token has been revoked.");
The Redis key expires automatically when the token would have expired anyway, so there is no unbounded revocation list growth. This is a low-overhead addition requiring only one Redis call per request.

4.2 Token Claims


interface AccessTokenPayload {
  sub: string;        // user id
  tenantId?: string;  // absent for super_admin
  role: string;
  appAccess: string[];
}

The optional tenantId means every route author must handle the super_admin special case. This is a footgun for new contributors.

Fix: Make the distinction explicit by using two separate token interfaces, one for tenant users and one for super admins:
// packages/middleware/src/jwt.ts
interface TenantUserPayload {
  sub: string; tenantId: string; role: "member" | "admin" | "reviewer"; appAccess: string[];
}
interface SuperAdminPayload {
  sub: string; tenantId?: never; role: "super_admin"; appAccess: string[];
}
type AccessTokenPayload = TenantUserPayload | SuperAdminPayload;
requireTenantUser() returns TenantUserPayload (TypeScript narrows to guaranteed tenantId), while requireSuperAdmin() returns SuperAdminPayload. New route authors get compile-time feedback if they mix them up.

4.3 Token Refresh (DM Portal) — R-7

The DM portal middleware performs a synchronous call to POST /auth/v1/refresh on every request when the access token is expired. With a 15-minute token lifetime, a DM reviewer active for 2 hours triggers ~8 refresh calls. Crucially, the refresh is inline in Next.js middleware — it adds the API round-trip latency to every page navigation after expiry.

Risk (R-7): If the API is unreachable (rolling deploy, Redis restart), all DM portal navigations redirect to /login, logging out active users.

Fix (R-7) — two-part:

Part 1: Extend the DM portal access token TTL from 15 min to 60 min. The DM portal has read-heavy usage (review, approve) and lower security sensitivity than the dashboard (no billing, no settings changes). This reduces refresh calls by 4× with no architectural change.
// apps/api/src/routers/auth.ts — inside the DM login flow
const dmAccessToken = signAccessToken(payload, "1h"); // was "15m"
Part 2: Make the middleware resilient to API outages — allow the request through rather than redirecting when the refresh call fails:
// apps/dm/src/middleware.ts
try {
  const refreshed = await fetch(`${API_URL}/auth/v1/refresh`, { ... });
  if (refreshed.ok) { /* set new cookie */ }
  // If refresh failed with 401/403, redirect to login.
  else if (refreshed.status === 401 || refreshed.status === 403) {
    return NextResponse.redirect(new URL("/login", request.url));
  }
  // For 5xx / network errors, allow the request through.
  // The next API call from the page will return 401 and trigger client-side logout.
} catch {
  // API unreachable — let the request proceed; client-side auth handles 401s.
}

4.4 No Token Revocation — R-5

Neither access tokens nor refresh tokens can be revoked before expiry:

Compromised access token → valid up to 15 min
Compromised refresh token → valid up to 7 days
Revoking a user requires waiting for token expiry or rotating JWT_SECRET globally

Fix (R-5): Redis-backed token revocation (same as §4.1 fix, extended to refresh tokens):


// packages/middleware/src/jwt.ts — add jti claim at issue time
import { v4 as uuidv4 } from "uuid";
 
export function signAccessToken(payload: AccessTokenPayload, expiresIn = "15m") {
  return jwt.sign({ ...payload, jti: uuidv4() }, process.env.JWT_SECRET!, { expiresIn });
}
 
export function signRefreshToken(userId: string, expiresIn = "7d") {
  return jwt.sign({ sub: userId, jti: uuidv4() }, process.env.JWT_REFRESH_SECRET!, { expiresIn });
}
 
// apps/api/src/routers/auth.ts — revoke on logout
export const logoutRoute: FastifyPluginAsync = async (fastify) => {
  fastify.post("/logout", async (req, reply) => {
    const payload = await requireTenantUser(req, reply);
    const redis   = fastify.redis;
 
    // Revoke both tokens
    await redis.setex(`revoked:${payload.jti}`, 15 * 60, "1"); // access token TTL
    const refreshCookie = req.cookies["refresh_token"];
    if (refreshCookie) {
      const rp = jwt.decode(refreshCookie) as any;
      if (rp?.jti) await redis.setex(`revoked:${rp.jti}`, 7 * 24 * 60 * 60, "1");
    }
 
    reply.clearCookie("access_token").clearCookie("refresh_token");
    return reply.send({ ok: true });
  });
};
 
// apps/api/src/lib/auth.ts — check revocation in requireTenantUser
const revoked = await fastify.redis.exists(`revoked:${payload.jti}`);
if (revoked) return apiError(reply, 401, "UNAUTHORIZED", "Token revoked.");

For a “revoke all sessions” (password change, security incident), store a per-user revoke_before timestamp and reject all tokens issued before that time — no need to enumerate every active token.

5. Database Layer

5.1 Schema Scale

The Prisma schema (packages/db/prisma/schema.prisma) is 3,058 lines covering 100+ models, all in one shared PostgreSQL database. Tenant isolation uses tenantId FKs on every tenant-scoped table.

Assessment: Monolithic schema is correct at this scale. Separate DB per tenant would complicate migrations. No action needed here.

5.2 Client Singleton


// packages/db/src/index.ts
const globalForPrisma = globalThis as unknown as { prisma: PrismaClient };
export const db = globalForPrisma.prisma || new PrismaClient({ ... });
if (process.env.NODE_ENV !== "production") globalForPrisma.prisma = db;

This is the standard Next.js hot-reload singleton pattern. Correct — no action needed.

Gap — R-9: No explicit Prisma connection pool configuration. Prisma’s default pool size is min(cpuCount + 1, 5). The agents server running 44 workers + the API server each create a Prisma client; concurrent DB queries can exhaust the pool, causing query queuing under load.

Fix (R-9): Set connection pool size explicitly in DATABASE_URL for each server:
# .env / deployment env vars
 
# API server (handles HTTP requests — moderate concurrency)
DATABASE_URL="postgresql://user:pass@host/db?connection_limit=20&pool_timeout=10"
 
# Agent server (44 workers, mostly async — needs larger pool)
DATABASE_URL="postgresql://user:pass@host/db?connection_limit=40&pool_timeout=30"
 
# Next.js portals (server components + actions — lower concurrency)
DATABASE_URL="postgresql://user:pass@host/db?connection_limit=10&pool_timeout=10"
Monitor pool wait time via Prisma’s query_duration metric. If pool_timeout errors appear in logs, increase connection_limit or add a PgBouncer connection pooler in front of PostgreSQL.

5.3 Dual Database Strategy

Database	Purpose	Package
PostgreSQL	All application data	`@leadmetrics/db`
MongoDB	Audit logs, activity logs	`@leadmetrics/nosqldb`

Using MongoDB exclusively for append-only logs is a reasonable separation.

Risk (R-4): agent-events.ts uses require("@leadmetrics/db") dynamically inside async functions (lines 92, 153). This pattern:

Bypasses TypeScript’s module resolution — no compile-time error if the import path changes
Fires at runtime on the hot path of every agent job
Is silently swallowed by an empty catch {} — so if Prisma client generation is missing, agent run records simply stop persisting with no developer-visible error

Fix (R-4): Break the circular dependency structurally by accepting the Prisma client as a module-level import via a setup call rather than a lazy require:


// packages/agents/src/agent-events.ts
import type { PrismaClient } from "@leadmetrics/db";
 
let _db: PrismaClient | null = null;
 
export function initAgentEvents(db: PrismaClient): void {
  _db = db;
}
 
export async function publishAgentEvent(event: AgentEvent): Promise<void> {
  // ... Redis publish logic (unchanged) ...
 
  if (_db) {
    // No dynamic require — static import, TypeScript-safe, visible to ESLint
    if (event.type === "agent:started") {
      await _db.agentRun.create({ data: { ... } });
    }
    // ...
  } else {
    // Warn on startup if initAgentEvents was never called
    console.warn("[agent-events] DB not initialized — run records will not be persisted");
  }
}
 
// apps/servers/agents/src/index.ts
import { db } from "@leadmetrics/db";
import { initAgentEvents } from "@leadmetrics/agents/src/agent-events";
 
initAgentEvents(db); // called once at server startup

This makes the missing-Prisma case visible at startup rather than silently failing per-job.

6. Multi-Tenancy

6.1 Isolation Model

All tenant data is row-filtered on a shared PostgreSQL database. Every tenant-scoped table carries a tenantId FK. Routes extract tenantId from the JWT claim and pass it to every Prisma query.

6.2 No ORM-Level Isolation Enforcement — R-2

Prisma does not enforce row-level isolation natively. Any query without an explicit tenantId filter returns data from all tenants. Current mitigations are entirely developer-discipline:

requireTenantUser() guard (opt-in — can be forgotten)
Code review (human — not guaranteed at scale)
Integration tests (cover happy path — do not test cross-tenant leakage)

Risk (R-2): This is a high-severity architectural risk. A single missing where: { tenantId } silently exposes every tenant’s data to the request caller.

Fix (R-2): Create a Prisma Client Extension that appends tenantId to every query for tenant-scoped models:


// packages/db/src/tenant-client.ts  (NEW FILE)
import { db } from "./index";
 
// List of models that are NOT tenant-scoped (global tables — no tenantId filter).
const GLOBAL_MODELS = new Set(["tenant", "user", "agentConfig", "skill", "repurposingTemplate"]);
 
export function createTenantClient(tenantId: string) {
  return db.$extends({
    query: {
      $allModels: {
        async $allOperations({ model, operation, args, query }) {
          if (GLOBAL_MODELS.has(model)) return query(args);
 
          // Inject tenantId into WHERE for all reads and writes.
          if (["findMany", "findFirst", "findUnique", "count", "aggregate"].includes(operation)) {
            (args as any).where = { ...(args as any).where, tenantId };
          }
          if (["create"].includes(operation)) {
            (args as any).data = { ...(args as any).data, tenantId };
          }
          if (["update", "updateMany", "delete", "deleteMany"].includes(operation)) {
            (args as any).where = { ...(args as any).where, tenantId };
          }
          return query(args);
        },
      },
    },
  });
}
 
// Export type for use in function signatures
export type TenantClient = ReturnType<typeof createTenantClient>;

Migration strategy — roll out incrementally per router:


// apps/api/src/routers/blog.ts
const { tenantId } = await requireTenantUser(req, reply);
const tdb = createTenantClient(tenantId); // use tdb instead of db for all queries
const post = await tdb.blogPost.findUnique({ where: { id: params.id } });
// tenantId automatically injected — cross-tenant leakage is now structurally impossible

Super-admin routes continue to use the raw db client since they need cross-tenant access.

6.3 Super Admin Cross-Tenant Access

Super-admin routes access all tenant data by explicit tenantId URL parameter, not JWT claim. Protected by requireSuperAdmin(). This is correct — no action needed.

7. Real-Time Architecture

7.1 Event Delivery Path


Worker process                API process            Browser
    │                              │                    │
    ├─ PUBLISH agent_events:T  ──► │                    │
    │  [Redis pub/sub]              ├─ psubscribe()      │
    │                              │  (single sub conn) │
    │                              ├─ emit "agent:event"─►│ Socket.IO
    │                              │  to tenant:T room  │
    ├─ SETEX agent_live:T:role ──► │  (Redis key, 30m)  │
    └─ PUBLISH notifications:T ──► ├─ emit "notification:new"─►│

The Redis-backed Socket.IO adapter enables horizontal scaling correctly. The agent_live:* Redis cache provides page-reload recovery for in-progress jobs. Both are sound design choices.

7.2 Redis Connection Count

The API process holds 4 Redis connections: redisPlugin, Socket.IO pubClient, subClient, and eventsSub. Four is reasonable and correctly closed in onClose.

7.3 Socket.IO Room Authentication — R-11

The Socket.IO namespace join logic in socket/middleware/auth.ts was not fully audited.

Risk (R-11): If the JWT is verified at handshake time but room join is not validated against the token’s tenantId, a client could call socket.join("tenant:other-tenant-id") and receive another tenant’s agent events.

Fix (R-11): Enforce room membership at join time in the namespace handler:


// apps/api/src/socket/namespaces/tenant.ts
export function registerTenantNamespace(namespace: Namespace, pubClient: IORedis) {
  namespace.use(async (socket, next) => {
    // Verify JWT at handshake
    const token = socket.handshake.auth?.token ?? socket.handshake.headers?.authorization?.split(" ")[1];
    if (!token) return next(new Error("Authentication required"));
    try {
      const payload = verifyAccessToken(token);
      socket.data.tenantId = payload.tenantId;
      socket.data.role     = payload.role;
      next();
    } catch {
      next(new Error("Invalid token"));
    }
  });
 
  namespace.on("connection", (socket) => {
    const { tenantId, role } = socket.data;
 
    if (tenantId) {
      // Only auto-join the room matching the token's tenantId
      socket.join(`tenant:${tenantId}`);
    }
 
    if (role === "super_admin" || role === "reviewer") {
      socket.join("dm:live");
    }
 
    // Reject explicit room join requests for other tenants
    socket.on("join:room", (roomName: string) => {
      if (roomName === `tenant:${tenantId}`) socket.join(roomName);
      else socket.emit("error", { code: "FORBIDDEN", message: "Cannot join this room" });
    });
  });
}

7.4 Reconnection and Client-Side Resilience

Redis pub/sub is fire-and-forget — events published while a client is disconnected are lost. Notifications are persisted to the Notification table (clients can fetch on reconnect). The agent_live:* cache (30min TTL) recovers in-progress job state.

Gap: No explicit Socket.IO reconnection handler. A client that reconnects mid-job must poll agent_live:* or the AgentRun table to recover current state.

Fix: Add a Socket.IO reconnect event handler on the client side that polls the live progress key:
// apps/dashboard/src/hooks/useAgentEvents.ts
socket.on("connect", async () => {
  // On any (re)connect, poll the live progress cache for current agent state
  const live = await fetch("/api/agent/live-status").then(r => r.json());
  if (live.running) setAgentProgress(live);
});
Add a matching API route that reads agent_live:{tenantId}:{role} from Redis and returns the cached progress object.

8. Package Boundaries and Dependency Graph

8.1 Critical Boundary: Worker Code in Next.js

packages/agents/src/index.ts deliberately does not bulk-export worker start/stop functions to prevent BullMQ from entering the Next.js webpack bundle. This is correctly implemented and must be preserved.

Risk: A developer who naively adds a worker export and imports it in a Next.js server action ships BullMQ + IORedis to the browser. There is no automated guard against this.

Fix: Add a bundle size CI check and an ESLint rule:
// eslint.config.mjs
"no-restricted-imports": ["error", {
  paths: [{
    name: "@leadmetrics/agents/src/workers",
    message: "Worker files must not be imported in Next.js apps — they pull BullMQ into the browser bundle."
  }]
}]
Additionally, add "server-only" at the top of each worker file to make Next.js throw a build error if it is imported from a client context:
// packages/agents/src/workers/blog-writer.worker.ts
import "server-only"; // throws at build time if imported in a browser bundle

8.2 Dynamic Require Pattern — R-19

Two files use require("@leadmetrics/db") dynamically:

packages/agents/src/agent-events.ts:92,153
packages/agents/src/workers/insights/insight-worker-base.ts:33

Risk (R-19): Dynamic requires bypass TypeScript analysis, ESLint import resolution, and tree-shaking. Module path renames cause silent runtime failures.

Fix (R-19): Resolved as part of the R-4 fix (§5.3). After initAgentEvents(db) is introduced, the dynamic require is eliminated entirely. For insight-worker-base.ts, import db statically at the top — the circular dependency that forced dynamic require is broken when agent-events.ts accepts db as a parameter:
// packages/agents/src/workers/insights/insight-worker-base.ts
import { db } from "@leadmetrics/db"; // static import, no dynamic require

8.3 Provider Package Proliferation — R-20

There are 20+ provider packages under packages/providers/*. Each wraps a third-party SDK. Isolation is clean, but build overhead is real — Turborepo must build all 20+ packages before any dependent app.

Risk (R-20): A provider imported at the top level of a shared package that is also used by Next.js apps could pull large SDK bundles (e.g., googleapis, @aws-sdk/*) into the browser bundle.

Fix (R-20): Add "server-only" imports to all provider packages:
// packages/providers/google/src/index.ts
import "server-only"; // Next.js will throw at build time if this is imported client-side
Additionally, run a bundle analyzer in CI to catch regressions:
# package.json (dashboard app)
"analyze": "ANALYZE=true next build"
// apps/dashboard/next.config.mjs
import withBundleAnalyzer from "@next/bundle-analyzer";
export default withBundleAnalyzer({ enabled: process.env.ANALYZE === "true" })({ ... });
Run pnpm analyze in CI on PRs that touch packages/providers/* or packages/common.

9. Observability and Error Handling

9.1 Logging — No Correlation IDs — R-12

Structured JSON logging is in place via @leadmetrics/logger across all services. Good.

Gap (R-12): No request correlation IDs. A user request enters the API, enqueues a BullMQ job, and is processed by a worker across three separate log streams. Correlating these by timestamp + tenantId + runId is manual and error-prone.

Fix (R-12): Generate a requestId at the API boundary and thread it through the job payload:


// apps/api/src/index.ts — add requestId generation
import { randomUUID } from "crypto";
 
fastify.addHook("onRequest", async (req) => {
  req.id = req.headers["x-request-id"] as string ?? randomUUID();
});
 
fastify.addHook("onSend", async (req, reply) => {
  reply.header("x-request-id", req.id);
});
 
// packages/queue/src/types.ts — add to base job data type
interface BaseJobData {
  tenantId:  string;
  requestId?: string; // threaded from API request
}
 
// In route handlers — pass requestId when enqueueing
await enqueueBlogWriter({ tenantId, activityId, requestId: req.id });
 
// In workers — include in all log calls
log.info({ tenantId, activityId, requestId: data.requestId }, "Blog writer started");

With requestId in every log line across API + queue + worker, a single grep requestId=abc-123 correlates the full lifecycle.

9.2 No Distributed Tracing — R-15

Individual logs are structured but there is no cross-service trace graph.

Fix (R-15): Add OpenTelemetry SDK instrumentation:


pnpm add @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-otlp-http


// apps/api/src/tracing.ts  (NEW FILE — import before all other modules)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
 
const sdk = new NodeSDK({
  serviceName:    "leadmetrics-api",
  traceExporter:  new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

The auto-instrumentations cover Fastify, Prisma, IORedis, and HTTP calls automatically. Add the same setup to each server process. Use Jaeger, Grafana Tempo, or Datadog APM as the trace backend.

9.3 No Metrics Collection — R-16

There is no way to measure API latency percentiles, queue depth, credit consumption rate, or error rates.

Fix (R-16): Add Prometheus metrics via fastify-metrics:


// apps/api/src/index.ts
import metrics from "fastify-metrics";
 
await fastify.register(metrics, {
  endpoint: "/metrics", // scraped by Prometheus
  defaultMetrics: { enabled: true },
  routeMetrics: {
    enabled: true,
    groupStatusCodes: false,
    overrides: { histogram: { buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] } },
  },
});
 
// Custom business metrics
const creditCounter = fastify.metrics.client.Counter({
  name: "leadmetrics_credits_consumed_total",
  help: "Total credits consumed",
  labelNames: ["tenantId", "creditType"],
});
 
// Call from consumeCredits()
creditCounter.inc({ tenantId, creditType });

Add BullMQ queue depth metrics by polling queue.getJobCounts() on an interval and exposing via a /metrics/queues endpoint. Visualize in Grafana.

9.4 Health Checks — Incomplete Readiness — R-10

The /health/ready endpoint only pings PostgreSQL. Redis, MongoDB, and Typesense are not checked.

Risk (R-10): A deployment where Redis is down but PostgreSQL is healthy passes the readiness check. The API reports ready but real-time events fail silently and all job enqueueing fails.

Fix (R-10): Expand the readiness check to cover all critical dependencies:


// apps/api/src/index.ts — replace /health/ready handler
fastify.get("/health/ready", async (_req, reply) => {
  const checks: Record<string, "ok" | "error"> = {};
 
  // PostgreSQL
  try { await db.$queryRaw`SELECT 1`; checks.postgres = "ok"; }
  catch { checks.postgres = "error"; }
 
  // Redis
  try { await fastify.redis.ping(); checks.redis = "ok"; }
  catch { checks.redis = "error"; }
 
  // MongoDB
  try { await mongoose.connection.db.admin().ping(); checks.mongo = "ok"; }
  catch { checks.mongo = "error"; }
 
  // Typesense (optional — not critical path for most requests)
  try {
    await typesenseClient.health.retrieve();
    checks.typesense = "ok";
  } catch { checks.typesense = "error"; }
 
  const healthy = Object.values(checks).every(v => v === "ok");
  return reply.status(healthy ? 200 : 503).send({ status: healthy ? "ok" : "degraded", checks });
});

9.5 Job Failure Alerting — covered under R-6 (§3.1)

9.6 Unhandled API Process Errors

The agent server registers unhandledRejection and uncaughtException handlers (correct). The API server relies solely on Fastify’s route-level error handler.

Risk: An uncaught promise rejection in the API (outside a route handler — e.g., in a plugin startup or a background interval) prints a warning in Node.js ≥15 but does not crash the process, leaving the API in a potentially inconsistent state.

Fix: Add the same global handlers to the API server:


// apps/api/src/index.ts — inside bootstrap(), after fastify.listen()
process.on("unhandledRejection", (reason) => {
  fastify.log.error({ reason }, "Unhandled promise rejection in API process");
  // Do NOT exit — Fastify's route handlers catch route-level errors;
  // only exit for truly unrecoverable state.
});
 
process.on("uncaughtException", (err) => {
  fastify.log.error({ err }, "Uncaught exception in API process — shutting down");
  process.exit(1);
});

10. Security

10.1 Transport and Headers

@fastify/helmet is registered with CSP disabled (correct for a JSON API) and all other defaults active: HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy. CORS is env-driven. This is appropriate.

Risk: The origin === "null" allowance in CORS is intentional for OAuth (LinkedIn), but any HTML page loaded from file:// or a sandboxed iframe can send credentialed requests with null origin. In production (HTTPS-only), this is low risk.

Fix (documentation): Add a code comment and document this as an explicit trade-off:


// apps/api/src/lib/fastify-setup.ts
if (!origin || origin === "null" || allowedOrigins.has(origin)) {
  // "null" origin: OAuth providers (LinkedIn, Google) redirect back to the app
  // with Origin: null when navigating from HTTPS → HTTP (or cross-scheme).
  // In production this is safe because the browser only sends null-origin
  // requests from HTTPS origins and our cookies are SameSite=Lax.
  cb(null, true);
}

10.2 Secret Management — R-17

Secrets are managed exclusively via environment variables with no vault, no rotation mechanism, and no secret scanning in CI.

Risk (R-17): A leaked DATABASE_URL + encryption key would expose all OAuth tokens stored in the Account model. A leaked JWT_SECRET allows forging arbitrary tokens. A leaked INTERNAL_API_SECRET allows calling billing endpoints from outside.

Fix (R-17) — three layers:

Layer 1 — Secret scanning in CI (immediate, zero-code change):
# .github/workflows/security.yml
- name: Scan for secrets
  uses: trufflesecurity/trufflehog@main
  with:
    path: ./
    base: ${{ github.event.repository.default_branch }}
    head: HEAD
    extra_args: --only-verified
Layer 2 — Rotation-ready secret naming (near-term): Add a JWT_SECRET_VERSION env var and sign tokens with ${JWT_SECRET_VERSION}:${JWT_SECRET}. Verification attempts with the current version, falls back to JWT_SECRET_PREV for a 24h rotation window. This allows zero-downtime secret rotation without invalidating all live tokens.

Layer 3 — Secrets manager (medium-term): Integrate DigitalOcean Secrets or HashiCorp Vault:
// packages/config/src/secrets.ts
export async function loadSecrets(): Promise<void> {
  if (process.env.VAULT_ADDR) {
    const vault = new VaultClient({ endpoint: process.env.VAULT_ADDR, token: process.env.VAULT_TOKEN });
    const secrets = await vault.read("secret/leadmetrics");
    process.env.JWT_SECRET       = secrets.data.jwt_secret;
    process.env.DATABASE_URL     = secrets.data.database_url;
    // … all other secrets
  }
  // Otherwise fall back to env vars (dev / staging)
}

10.3 Input Validation — R-13

Not all routes define Fastify JSON Schemas. Routes without schemas accept any input shape.

Fix (R-13): Enforce schema presence using an ESLint rule and add a Fastify plugin that rejects requests with extra fields on unschema’d routes:


// apps/api/src/plugins/strict-schema.ts  (NEW FILE)
import fp from "fastify-plugin";
 
export default fp(async (fastify) => {
  fastify.addHook("preValidation", async (req) => {
    // In production, if a route has no schema, log a warning so we know to add one
    if (!req.routeOptions.schema && process.env.NODE_ENV === "production") {
      req.log.warn({ url: req.url, method: req.method }, "Route has no schema — add one");
    }
  });
});

For SQL injection via $queryRaw, enforce tagged-template usage via ESLint:


// eslint.config.mjs
"no-restricted-syntax": ["error", {
  selector: "CallExpression[callee.property.name='$queryRaw'][arguments.0.type!='TaggedTemplateExpression']",
  message: "$queryRaw must use tagged template literals (db.$queryRaw`...`), not string interpolation."
}]

10.4 Rate Limiting — Auth Endpoints Unprotected — R-14

@fastify/rate-limit applies a global 300 req/min limit. Auth endpoints (/auth/v1/login, /auth/v1/register, /auth/v1/refresh) share this limit with all other traffic — a brute-force attack on login consumes only a fraction of the global budget.

Fix (R-14): Register a tighter per-route rate limit on auth endpoints:


// apps/api/src/routers/auth.ts
fastify.post("/login", {
  config: {
    rateLimit: {
      max:       10,          // 10 attempts per window
      timeWindow: "1 minute",
      keyGenerator: (req) => `login:${req.ip}`, // per-IP limit
      errorResponseBuilder: () => ({
        error: { code: "RATE_LIMITED", message: "Too many login attempts. Try again in 1 minute." }
      }),
    },
  },
  handler: loginHandler,
});
 
fastify.post("/register", {
  config: { rateLimit: { max: 5, timeWindow: "1 minute" } },
  handler: registerHandler,
});
 
fastify.post("/refresh", {
  config: { rateLimit: { max: 30, timeWindow: "1 minute" } }, // higher — DM portal refreshes every 15min
  handler: refreshHandler,
});

Also add account lockout after N consecutive failures by storing a failure counter in Redis per email:


const key   = `login_fails:${email}`;
const fails = await redis.incr(key);
await redis.expire(key, 15 * 60); // reset counter after 15 min
if (fails > 10) return apiError(reply, 429, "ACCOUNT_LOCKED", "Account temporarily locked.");

11. Build and CI/CD

11.1 Turborepo Configuration — R-21

turbo.json defines build, dev, lint, typecheck, and test variants. The build task correctly uses dependsOn: ["^build"].

Gap (R-21): test:unit has dependsOn: ["^build"], which forces a full monorepo build before unit tests can run. For a developer iterating on a single worker file, this adds minutes of unnecessary build time.

Fix (R-21):


// turbo.json
{
  "tasks": {
    "test:unit": {
      "dependsOn": [],   // unit tests have no build deps — pure function tests
      "cache": false
    },
    "test:integration": {
      "dependsOn": ["^build"],  // integration tests need compiled packages
      "cache": false
    },
    "clean": {
      "cache": false
    }
  }
}

Add a clean script to each package’s package.json:


"scripts": {
  "clean": "rm -rf .next dist .turbo"
}

Then turbo run clean clears all build artifacts across the monorepo in one command.

11.2 Test Architecture

Four tiers: test:unit (Vitest), test:integration (Vitest + test DB), test:e2e (Playwright + full stack), test:coverage. The distinct tiers are good practice.

Gap: No end-to-end test for the BullMQ job lifecycle: enqueue → worker picks up → output saved → event published. Workers are unit-tested for pure functions but the BullMQ lifecycle itself (retry, backoff, deduplication, failure) is not exercised in CI.

Fix: Add a “worker integration” test tier that spins up a real BullMQ worker against the test Redis instance:


// packages/agents/src/workers/__tests__/blog-writer.integration.test.ts
import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { Queue, Worker } from "bullmq";
import { processBlogWriterJob } from "../blog-writer.worker";
 
describe("blog-writer BullMQ lifecycle", () => {
  let queue: Queue;
  let worker: Worker;
 
  beforeAll(async () => {
    queue  = new Queue("agent__blog-writer", { connection: testRedis });
    worker = new Worker("agent__blog-writer", processBlogWriterJob, { connection: testRedis });
  });
 
  afterAll(async () => { await worker.close(); await queue.close(); });
 
  it("completes a job and writes BlogPost to DB", async () => {
    const job = await queue.add("test", { tenantId: TEST_TENANT, activityId: TEST_ACTIVITY });
    await job.waitUntilFinished(new QueueEvents("agent__blog-writer", { connection: testRedis }));
    const post = await db.blogPost.findFirst({ where: { activityId: TEST_ACTIVITY } });
    expect(post?.status).toBe("dm_review");
  });
});

12. Summary: Strengths and Risks

Strengths

#	Strength
S-1	Clear domain-driven router organization — versioned prefixes, consistent naming
S-2	BullMQ queue-per-role design — correct tenant isolation via payload, not queue name
S-3	Deliberate package boundaries — agents package prevents worker code from entering Next.js bundles
S-4	Centralized error handler — uniform `{ error: { code, message } }` shape across all routes
S-5	Redis-backed Socket.IO — multi-instance-safe real-time events
S-6	Live progress caching — `agent_live:*` Redis key survives page reload
S-7	Graceful shutdown — agent server drains all workers before exit
S-8	Dual deduplication strategy — static jobId for idempotent ops, timestamp jobId for revisions
S-9	Structured logging — JSON logs with `service` field across all processes
S-10	Split health checks — liveness + readiness endpoints with DB ping

Risk Register

ID	Risk	Severity	Location	Fix section
R-1	No default-deny auth — forgotten guard = public endpoint	🔴 High	All route handlers	§2.3
R-2	No ORM-level tenant isolation — missing `where: { tenantId }` = data leak	🔴 High	All Prisma queries	§6.2
R-3	`app.ts` vs `index.ts` router drift — missing routes 404 silently in tests	🔴 High	`apps/api/src/app.ts`	§2.2
R-4	Dynamic `require` swallows Prisma errors — agent run records silently drop	🔴 High	`packages/agents/src/agent-events.ts:92`	§5.3
R-5	No token revocation — compromised refresh token valid 7 days	🔴 High	JWT auth system	§4.4
R-6	No DLQ / permanent failure alerting — job failures are silent	🟡 Medium	`packages/queue/src/queues.ts`	§3.1
R-7	DM portal synchronous token refresh — API outage logs out all DM users	🟡 Medium	`apps/dm/src/middleware.ts`	§4.3
R-8	Single agent process for 44 workers — one OOM kills everything	🟡 Medium	`apps/servers/agents/src/index.ts`	§3.3
R-9	No Prisma connection pool config — default pool queues under load	🟡 Medium	`packages/db/src/index.ts`	§5.2
R-10	Readiness check misses Redis/MongoDB — partial outage reports healthy	🟡 Medium	`apps/api/src/index.ts:191`	§9.4
R-11	Socket.IO room join auth not audited — potential cross-tenant event leakage	🟡 Medium	`apps/api/src/socket/`	§7.3
R-12	No request correlation IDs — impossible to trace request across API + worker	🟡 Medium	Entire stack	§9.1
R-13	Missing input schemas on some routes — arbitrary shapes reach the DB	🟡 Medium	`apps/api/src/routers/*`	§10.3
R-14	Auth rate limit too permissive — brute-force on `/auth/v1/login` not throttled	🟡 Medium	`apps/api/src/plugins/rate-limit.ts`	§10.4
R-15	No distributed tracing (OpenTelemetry)	🟢 Low	All services	§9.2
R-16	No metrics collection (Prometheus)	🟢 Low	All services	§9.3
R-17	No secret rotation strategy or scanning	🟢 Low	Ops/infra	§10.2
R-18	Rate-limit 429 has different error shape than other API errors	🟢 Low	`lib/fastify-setup.ts`	§2.4
R-19	Dynamic require in agent-events bypasses TypeScript analysis	🟢 Low	`agent-events.ts`	§8.2
R-20	Provider packages may leak into Next.js bundles — no bundle analysis in CI	🟢 Low	`packages/providers/*`	§8.3
R-21	`test:unit` depends on full build — slow local iteration	🟢 Low	`turbo.json`	§11.1

13. Prioritized Fix Roadmap

Sprint 1 — Critical (block a security incident)

Task	Risk	Effort
Extract `registerRouters()` shared function for app.ts + index.ts	R-3	1h
Add preHandler auth guards at router-plugin level	R-1	2h
Audit Socket.IO room join and enforce tenantId validation	R-11	2h
Add `jti` + Redis revocation to JWT sign/verify	R-5	3h
Replace dynamic `require` in agent-events.ts with `initAgentEvents(db)`	R-4	2h

Sprint 2 — High Impact (prevent data leakage and silent failures)

Task	Risk	Effort
Implement `createTenantClient()` Prisma extension, roll out to dm/ routes first	R-2	1 day
Extend DM access token TTL to 60min + make middleware resilient to API outage	R-7	2h
Add per-route rate limits for `/auth/v1/login`, `/register`, `/refresh`	R-14	1h
Expand `/health/ready` to check Redis + MongoDB + Typesense	R-10	1h
Add QueueEvents `failed` listener with DM-team email notification	R-6	2h
Add `requestId` to all log lines and BullMQ job payloads	R-12	3h

Sprint 3 — Medium Term (scalability and observability)

Task	Risk	Effort
Configure explicit Prisma pool sizes per service	R-9	30min
Split agent server into `server-agents-llm` + `server-agents-util`	R-8	1 day
Add Zod validation to all `enqueueXxx` call sites	R-13 (queue)	3h
Add OpenTelemetry auto-instrumentation to API + agent server	R-15	1 day
Add Prometheus metrics (`fastify-metrics` + queue depth polling)	R-16	1 day

Sprint 4 — Polish (reduce developer friction)

Task	Risk	Effort
Normalize rate-limit 429 response shape in error handler	R-18	30min
Add `"server-only"` to all provider packages	R-20	1h
Add `truffleHog` secret scanning to CI	R-17	30min
Remove `dependsOn: ["^build"]` from `test:unit` in turbo.json	R-21	5min
Add BullMQ lifecycle integration test	§11.2	3h

All file references verified against the live codebase as of 2026-04-25.