Architecture Review v1 — Leadmetrics v3
Date: 2026-04-25 Reviewer: Claude Code Scope: Full monorepo — API, portals, worker servers, packages, database, real-time, security
Historical note: This review was written when Better Auth was still in use for the Dashboard. Better Auth was removed in late April 2026 — all three portals now use a unified Fastify JWT system (HS256) via
@leadmetrics/middleware. The dual-auth risk (R-xxx) identified in this review is now resolved.
How to read this document: Every identified risk or gap is followed immediately by a Fix block with concrete implementation steps, affected files, and code examples. The Summary table cross-references each risk by ID (R-1 … R-21).
1. System Topology
Leadmetrics v3 is a multi-tenant AI marketing automation platform delivered as a pnpm monorepo with five distinct compute layers:
| Layer | Technology | Process count |
|---|---|---|
| Portals | Next.js 14 (App Router) | 4 apps (dashboard :3000, manage :3001, dm :3002, knowledgebase :3004) |
| REST API | Fastify 4 | 1 process (:3003) |
| Agent workers | BullMQ + Claude/Codex/Gemini | 1 monolithic process (44 workers) |
| Support servers | BullMQ | 6 dedicated processes (billing, notifications, ragengine, reporting, scheduler, search-indexer) |
| Mobile | React Native (Expo) | 1 app (dashboard-mobile) |
Infrastructure dependencies: PostgreSQL (primary DB), MongoDB (audit logs), Redis (queues, pub/sub, presence cache), Typesense (full-text search), Qdrant (vector search / RAG).
Workspace layout (pnpm-workspace.yaml):
apps/* — Next.js portals + Fastify API
apps/servers/* — Background server processes
packages/* — Shared libraries (db, ui, common, queue, billing, …)
packages/adapters/* — LLM adapters (claude-local, codex-local, gemini-local)
packages/providers/* — 20+ third-party integrations2. API Layer
2.1 Router Organization
All routes live in apps/api/src/routers/. The 60+ route files are organized into versioned namespaces registered in apps/api/src/index.ts:
| Prefix | Domain | Example routes |
|---|---|---|
/auth/v1 | Auth | login, register, refresh, profile |
/admin/v1 | Superadmin | tenants, users, billing, audit, templates |
/tenant/v1 | Tenant-scoped | blog, social, calendar, insights, channels |
/dm/v1 | DM reviewer | activities, blog, social, strategy, contacts |
/media/v1 | Media | image search, asset upload |
/aichat/v1 | AI Chat | LangGraph conversation |
/mobile/v1 | Mobile API | client-facing mobile endpoints |
/pg/v1 | PG callbacks | payment gateway webhooks |
Assessment: Domain grouping is clear and consistent. The /v1 prefix on all namespaces is good practice. However, brandVoiceRouter appears in app.ts (test entry) but is missing from index.ts (production) — see R-3.
2.2 Entry Point Split (app.ts vs index.ts) — R-3
There are two entry points:
index.ts— production server; registers Helmet, Swagger/OpenAPI, Socket.IO, MongoDB, rate limitingapp.ts— test server; lighter setup, no Swagger, no Socket.IO, no MongoDB
The split is intentional for fast test startup but causes silent drift. Any router registered in index.ts but absent from app.ts will 404 in integration tests — tests pass (route missing = no test written) rather than fail (handler misbehaves). This has burned the team before (feedback_api_app_ts_registration.md).
Fix (R-3): Extract router registration into a shared function consumed by both entry points.
// apps/api/src/lib/register-routers.ts (NEW FILE) import type { FastifyInstance } from "fastify"; import { authRouter } from "../routers/auth"; import { adminRouter } from "../routers/admin"; import { tenantRouter } from "../routers/tenant"; import { dmRouter } from "../routers/dm"; import { brandVoiceRouter } from "../routers/brand-voice"; // … all other routers export async function registerRouters(fastify: FastifyInstance): Promise<void> { await fastify.register(authRouter, { prefix: "/auth/v1" }); await fastify.register(adminRouter, { prefix: "/admin/v1" }); await fastify.register(tenantRouter, { prefix: "/tenant/v1" }); await fastify.register(dmRouter, { prefix: "/dm/v1" }); await fastify.register(brandVoiceRouter, { prefix: "/tenant/v1/brand-voice" }); // … all other routers }Both
index.tsandapp.tsthen callawait registerRouters(fastify)— one source of truth, zero drift.
2.3 Authentication Guards — R-1
Auth is opt-in per route handler. There is no global preHandler that rejects unauthenticated requests. Each handler explicitly calls one of:
requireTenantUser(req, reply)— verifies JWT, returns{ sub, tenantId, role, appAccess }requireSuperAdmin(req, reply)— verifies JWT, assertsrole === "super_admin"requireDMAccess(req, reply)— verifies JWT, asserts role in["reviewer", "admin", "super_admin"]
Risk (R-1): A new route that forgets the guard call is publicly accessible — the TypeScript compiler does not enforce guard presence. With 60+ route files this is a latent exposure risk.
Fix (R-1): Register auth guards as Fastify
preHandlerhooks at the router/plugin level, not inside individual handlers.// apps/api/src/routers/dm/index.ts import { FastifyPluginAsync } from "fastify"; import { requireDMAccess } from "../../lib/auth"; export const dmRouter: FastifyPluginAsync = async (fastify) => { // Single guard covers every route registered in this plugin scope. fastify.addHook("preHandler", requireDMAccess); await fastify.register(dmActivitiesRouter); await fastify.register(dmBlogRouter); // … all dm sub-routers — all protected automatically };Apply the same pattern to
tenantRouter(→requireTenantUser) andadminRouter(→requireSuperAdmin). Per-handler guard calls become unnecessary and can be removed, eliminating 60+ call sites. Any new route added to the plugin automatically inherits the guard.
2.4 Error Handling
The centralized error handler in lib/fastify-setup.ts normalizes all errors to { error: { code, message } }. Per-route custom errors use apiError(reply, status, code, message), maintaining the same shape throughout.
Gap — R-18: The rate-limit plugin throws a non-Error object, which is forwarded verbatim by the !(err instanceof Error) branch of the error handler. The resulting 429 body is { statusCode, error, message } — a different shape from every other API error. Client code must handle two shapes.
Fix (R-18): Intercept rate-limit errors in the error handler and normalize them:
// apps/api/src/lib/fastify-setup.ts — inside registerErrorHandler fastify.setErrorHandler((err, _req, reply) => { if (err.validation) { return reply.status(400).send({ error: { code: "VALIDATION_ERROR", message: err.message } }); } const status = (err as any).statusCode ?? 500; // Normalize @fastify/rate-limit (and other plugin) errors to the standard shape. if (!(err instanceof Error)) { return reply.status(status).send({ error: { code: "RATE_LIMITED", message: (err as any).message ?? "Too many requests" }, }); } if (status >= 500) fastify.log.error({ err }, "[api] internal server error"); return reply.status(status).send({ error: { code: "INTERNAL_ERROR", message: err.message } }); });
Gap: Fastify’s validation errors expose internal schema field paths in the message. Add a sanitization step for production environments:
Fix: Strip internal field paths from validation error messages before they reach the client:
if (err.validation) { const safe = process.env.NODE_ENV === "production" ? "Request validation failed" : err.message; return reply.status(400).send({ error: { code: "VALIDATION_ERROR", message: safe } }); }
2.5 Request/Response Configuration
From index.ts:52-58:
connectionTimeout: 10_000— protects against slow-loris ✓requestTimeout: 30_000— may terminate streaming SSE responses from/aichat/v1/chatbodyLimit: 512KB— reasonable; file uploads bypass via@fastify/multipart✓trustProxy— reads from env, supports CIDR notation ✓
Gap: The 30s requestTimeout will cut off long-running SSE streams. BullMQ jobs are async so job execution is unaffected, but any endpoint that holds an open connection (SSE, long-poll) will be terminated.
Fix: Disable the timeout per-route for streaming endpoints:
// apps/api/src/routers/ai-chat.ts fastify.get("/stream", { config: { rawBody: true }, // Override the global 30s timeout for this SSE endpoint. // BullMQ job timeout is independently controlled via lockDuration. handler: async (req, reply) => { reply.raw.setTimeout(0); // no timeout for this connection // … SSE logic } });
2.6 OpenAPI / Swagger — incomplete schemas
Swagger is registered in production (index.ts) but not in app.ts (acceptable — docs don’t need to run in tests).
Gap: Not all routes define JSON Schema for their responses. Routes without a schema are invisible in the generated OpenAPI spec. Additionally, response schemas without additionalProperties: true silently strip fields from serialized responses.
Fix: Enforce schema coverage via ESLint:
// eslint.config.mjs — add rule "no-restricted-syntax": ["warn", { selector: "CallExpression[callee.property.name='get'][arguments.length=1]", message: "Fastify GET routes must define a schema object as the second argument." }]For existing response schemas, add a global default:
// apps/api/src/lib/fastify-setup.ts fastify.addSchema({ $id: "defaultResponse", type: "object", additionalProperties: true, // prevents field stripping on undocumented responses });
3. Queue and Worker Architecture
3.1 Queue Design and Missing Dead-Letter Queue — R-6
Queues use the double-underscore naming convention (agent__{role}) and share a sensible default retry policy:
attempts: 4
backoff: exponential, 5s initial
removeOnComplete: { count: 100 }
removeOnFail: { count: 50 }Risk (R-6): There is no dead-letter queue (DLQ). Jobs that exhaust all 4 attempts are marked “failed” in BullMQ and written to AgentRun, but:
- No one is alerted when a job permanently fails
- With
removeOnFail: { count: 50 }, the 51st failed payload is silently discarded - Failed jobs cannot be replayed without manual Redis intervention
Fix (R-6): Register a global
failedevent handler on the agent server that enqueues an internal DM-team notification when a job exhausts all attempts:// apps/servers/agents/src/index.ts — add after worker start calls import { QueueEvents } from "bullmq"; import { enqueueNotification } from "@leadmetrics/queue"; const CRITICAL_QUEUES = [ "agent__blog-writer", "agent__strategy-writer", "agent__setup", "agent__social-post-writer", ]; for (const queueName of CRITICAL_QUEUES) { const events = new QueueEvents(queueName, { connection: redis }); events.on("failed", async ({ jobId, attemptsMade, failedReason }) => { // Only alert on final failure (not intermediate retries). const queue = new Queue(queueName, { connection: redis }); const job = await queue.getJob(jobId); if (!job || attemptsMade < (job.opts.attempts ?? 4)) return; await enqueueNotification({ channel: "email", recipient: { email: process.env.ALERTS_EMAIL ?? "ops@leadmetrics.ai" }, template: "agent_failure", data: { queueName, jobId, tenantId: job.data.tenantId, reason: failedReason, payload: JSON.stringify(job.data).slice(0, 500), }, }); }); }Also increase
removeOnFailto keep evidence longer:// packages/queue/src/queues.ts — update defaultJobOptions removeOnFail: { count: 500, age: 7 * 24 * 60 * 60 }, // keep 500 entries or 7 days
3.2 Deduplication Strategy
Two deduplication patterns:
- Static
jobId+ TTL — for idempotent setup operations ✓ - Timestamp
jobId— for revision chains that must always run ✓
Assessment: Well-designed. The two-pattern approach correctly distinguishes idempotent from non-idempotent operations. No action needed.
3.3 Agent Worker Server — Monolithic Process — R-8
All 44 workers run in a single Node.js process. Each holds its own Worker instance sharing the same process memory and event loop.
Strengths: Simple deployment, clean graceful shutdown, maxStalledCount: 2 for Windows cold-start.
Risk (R-8): A memory leak in one worker (e.g., large context accumulating in blog-writer) OOM-kills all 44 workers simultaneously. CPU-bound workers (e.g., website-crawler) can starve I/O-bound workers in the same event loop. The sequential shutdown drains one worker at a time — on a busy server with 5-min Claude jobs, SIGTERM-to-exit could exceed Kubernetes’ terminationGracePeriodSeconds.
Fix (R-8): Split the monolith into two processes grouped by resource profile:
Process A —
server-agents-llm(long-running, CPU/token-intensive):setup, strategy-writer, strategy, activity, blog-writer, blog-faq-writer, social-post-writer, social-post-designer, landing-page-writer, content-repurposer, gbp-post-writer, custom-report-writer, backlink-outreach-writer, seo-optimizer, campaign-brief-writer, linkedin-ads-writer, review-campaign-writer, meta-ads-optimizer, linkedin-ads-optimizer, brand-narrative-analystProcess B —
server-agents-util(short-running, I/O-bound):keyword-researcher, content-brief-writer, social-calendar-planner, email-writer, google-ads-writer, meta-ads-writer, report-writer, topic-researcher, research-note-writer, review-response-writer, site-auditor, backlink-researcher, website-crawler, ads-analyst, anomaly-detector, all insight workers, ai-visibility-monitor, ai-visibility-seeder, gsc-keywords-snapshot, github-source-sync, newsletter-sender, search-term-classifier, campaign-optimizer-runner, seo-outreach-optimizer, review-campaign-optimizerThis means a runaway LLM job cannot starve search-sync or notification workers. Separate Docker containers also allow independent scaling (
server-agents-llmneeds more memory;server-agents-utilneeds more CPU cores).To handle the sequential shutdown problem, drain workers in parallel:
// Replace sequential awaits with parallel drain await Promise.all([ stopBlogWriterWorkers(), stopStrategyWriterWorkers(), stopActivityWorkers(), // … all workers ]);
3.4 Job Payload Typing — No Runtime Validation — R-13 (queue variant)
Job payload types are statically typed in packages/queue/src/types.ts. This is good for compile-time safety.
Gap: There is no runtime validation at enqueue or dequeue time. A caller that passes a malformed payload (e.g., tenantId: undefined) gets past TypeScript if the calling code has loose types, and the worker discovers the issue deep into execution — after credits are reserved and AgentRun is created.
Fix: Add Zod schema validation inside each
enqueueXxxfunction:// packages/queue/src/schemas.ts (NEW FILE) import { z } from "zod"; export const ActivityJobDataSchema = z.object({ tenantId: z.string().uuid(), activityId: z.string().uuid(), wakeReason: z.enum(["initial", "rejection", "revision"]).optional(), }); // packages/queue/src/queues.ts — inside enqueueBlogWriter export async function enqueueBlogWriter(data: ActivityJobData) { ActivityJobDataSchema.parse(data); // throws ZodError on invalid payload const queue = getQueue("agent__blog-writer"); // … rest of enqueue logic }The same Zod schema can be reused as a TypeScript type via
z.infer<>, removing the separate interface.
3.5 Event Flow — Excess Redis Connections
agent-events.ts creates its own IORedis connection (lines 52-63), separate from the BullMQ connection managed by packages/queue. Each worker process holds at minimum 2 Redis connections: one for BullMQ and one for event publishing.
Fix: Accept the Redis client as a parameter in
publishAgentEventinstead of creating a private connection. Callers pass the sameIORedisinstance used by BullMQ:// packages/agents/src/agent-events.ts let _publisher: IORedis | null = null; export function setPublisher(client: IORedis): void { _publisher = client; } function getPublisher(): IORedis { if (!_publisher) { // Lazy fallback for tests that don't call setPublisher() _publisher = new IORedis(process.env.REDIS_URL ?? "redis://localhost:6379", { lazyConnect: true, maxRetriesPerRequest: 3, enableOfflineQueue: false, }); } return _publisher; } // apps/servers/agents/src/index.ts — after Redis connects import { setPublisher } from "@leadmetrics/agents/src/agent-events"; setPublisher(redis); // reuse the same IORedis instanceThis reduces the agent server from 2N to N+1 Redis connections (one shared publisher, one BullMQ connection per worker type that uses a separate queue connection — BullMQ manages its own).
4. Authentication Architecture
4.1 Three Auth Systems in One Platform
| System | Used by | Token | Storage |
|---|---|---|---|
| JWT (Fastify) | API, DM, Manage | 15-min access token + 7-day refresh token | HTTP-only cookie |
| Better Auth | Dashboard user records | Session-based | Session table in PostgreSQL |
| Next.js middleware | All portals | Reads the JWT cookie | No additional storage |
Users are stored in the Better Auth schema (User, Session, Account), but all API access uses JWT tokens issued by the Fastify auth route.
Risk: The two systems can drift. A Better Auth session invalidated server-side still allows API access until the JWT expires (15 min).
Fix: When Better Auth invalidates a session (e.g., password change, logout from all devices), also write the current JWT
jti(JWT ID) to a Redis revocation set:// apps/api/src/routers/auth.ts — inside the logout/invalidate handler import { redis } from "../plugins/redis"; // Add jti to revocation set with TTL matching token expiry await redis.setex(`revoked_token:${payload.jti}`, 15 * 60, "1"); // apps/api/src/lib/auth.ts — inside requireTenantUser const revoked = await redis.exists(`revoked_token:${payload.jti}`); if (revoked) return apiError(reply, 401, "UNAUTHORIZED", "Token has been revoked.");The Redis key expires automatically when the token would have expired anyway, so there is no unbounded revocation list growth. This is a low-overhead addition requiring only one Redis call per request.
4.2 Token Claims
interface AccessTokenPayload {
sub: string; // user id
tenantId?: string; // absent for super_admin
role: string;
appAccess: string[];
}The optional tenantId means every route author must handle the super_admin special case. This is a footgun for new contributors.
Fix: Make the distinction explicit by using two separate token interfaces, one for tenant users and one for super admins:
// packages/middleware/src/jwt.ts interface TenantUserPayload { sub: string; tenantId: string; role: "member" | "admin" | "reviewer"; appAccess: string[]; } interface SuperAdminPayload { sub: string; tenantId?: never; role: "super_admin"; appAccess: string[]; } type AccessTokenPayload = TenantUserPayload | SuperAdminPayload;
requireTenantUser()returnsTenantUserPayload(TypeScript narrows to guaranteedtenantId), whilerequireSuperAdmin()returnsSuperAdminPayload. New route authors get compile-time feedback if they mix them up.
4.3 Token Refresh (DM Portal) — R-7
The DM portal middleware performs a synchronous call to POST /auth/v1/refresh on every request when the access token is expired. With a 15-minute token lifetime, a DM reviewer active for 2 hours triggers ~8 refresh calls. Crucially, the refresh is inline in Next.js middleware — it adds the API round-trip latency to every page navigation after expiry.
Risk (R-7): If the API is unreachable (rolling deploy, Redis restart), all DM portal navigations redirect to /login, logging out active users.
Fix (R-7) — two-part:
Part 1: Extend the DM portal access token TTL from 15 min to 60 min. The DM portal has read-heavy usage (review, approve) and lower security sensitivity than the dashboard (no billing, no settings changes). This reduces refresh calls by 4× with no architectural change.
// apps/api/src/routers/auth.ts — inside the DM login flow const dmAccessToken = signAccessToken(payload, "1h"); // was "15m"Part 2: Make the middleware resilient to API outages — allow the request through rather than redirecting when the refresh call fails:
// apps/dm/src/middleware.ts try { const refreshed = await fetch(`${API_URL}/auth/v1/refresh`, { ... }); if (refreshed.ok) { /* set new cookie */ } // If refresh failed with 401/403, redirect to login. else if (refreshed.status === 401 || refreshed.status === 403) { return NextResponse.redirect(new URL("/login", request.url)); } // For 5xx / network errors, allow the request through. // The next API call from the page will return 401 and trigger client-side logout. } catch { // API unreachable — let the request proceed; client-side auth handles 401s. }
4.4 No Token Revocation — R-5
Neither access tokens nor refresh tokens can be revoked before expiry:
- Compromised access token → valid up to 15 min
- Compromised refresh token → valid up to 7 days
- Revoking a user requires waiting for token expiry or rotating
JWT_SECRETglobally
Fix (R-5): Redis-backed token revocation (same as §4.1 fix, extended to refresh tokens):
// packages/middleware/src/jwt.ts — add jti claim at issue time import { v4 as uuidv4 } from "uuid"; export function signAccessToken(payload: AccessTokenPayload, expiresIn = "15m") { return jwt.sign({ ...payload, jti: uuidv4() }, process.env.JWT_SECRET!, { expiresIn }); } export function signRefreshToken(userId: string, expiresIn = "7d") { return jwt.sign({ sub: userId, jti: uuidv4() }, process.env.JWT_REFRESH_SECRET!, { expiresIn }); } // apps/api/src/routers/auth.ts — revoke on logout export const logoutRoute: FastifyPluginAsync = async (fastify) => { fastify.post("/logout", async (req, reply) => { const payload = await requireTenantUser(req, reply); const redis = fastify.redis; // Revoke both tokens await redis.setex(`revoked:${payload.jti}`, 15 * 60, "1"); // access token TTL const refreshCookie = req.cookies["refresh_token"]; if (refreshCookie) { const rp = jwt.decode(refreshCookie) as any; if (rp?.jti) await redis.setex(`revoked:${rp.jti}`, 7 * 24 * 60 * 60, "1"); } reply.clearCookie("access_token").clearCookie("refresh_token"); return reply.send({ ok: true }); }); }; // apps/api/src/lib/auth.ts — check revocation in requireTenantUser const revoked = await fastify.redis.exists(`revoked:${payload.jti}`); if (revoked) return apiError(reply, 401, "UNAUTHORIZED", "Token revoked.");For a “revoke all sessions” (password change, security incident), store a per-user
revoke_beforetimestamp and reject all tokens issued before that time — no need to enumerate every active token.
5. Database Layer
5.1 Schema Scale
The Prisma schema (packages/db/prisma/schema.prisma) is 3,058 lines covering 100+ models, all in one shared PostgreSQL database. Tenant isolation uses tenantId FKs on every tenant-scoped table.
Assessment: Monolithic schema is correct at this scale. Separate DB per tenant would complicate migrations. No action needed here.
5.2 Client Singleton
// packages/db/src/index.ts
const globalForPrisma = globalThis as unknown as { prisma: PrismaClient };
export const db = globalForPrisma.prisma || new PrismaClient({ ... });
if (process.env.NODE_ENV !== "production") globalForPrisma.prisma = db;This is the standard Next.js hot-reload singleton pattern. Correct — no action needed.
Gap — R-9: No explicit Prisma connection pool configuration. Prisma’s default pool size is min(cpuCount + 1, 5). The agents server running 44 workers + the API server each create a Prisma client; concurrent DB queries can exhaust the pool, causing query queuing under load.
Fix (R-9): Set connection pool size explicitly in
DATABASE_URLfor each server:# .env / deployment env vars # API server (handles HTTP requests — moderate concurrency) DATABASE_URL="postgresql://user:pass@host/db?connection_limit=20&pool_timeout=10" # Agent server (44 workers, mostly async — needs larger pool) DATABASE_URL="postgresql://user:pass@host/db?connection_limit=40&pool_timeout=30" # Next.js portals (server components + actions — lower concurrency) DATABASE_URL="postgresql://user:pass@host/db?connection_limit=10&pool_timeout=10"Monitor pool wait time via Prisma’s
query_durationmetric. Ifpool_timeouterrors appear in logs, increaseconnection_limitor add a PgBouncer connection pooler in front of PostgreSQL.
5.3 Dual Database Strategy
| Database | Purpose | Package |
|---|---|---|
| PostgreSQL | All application data | @leadmetrics/db |
| MongoDB | Audit logs, activity logs | @leadmetrics/nosqldb |
Using MongoDB exclusively for append-only logs is a reasonable separation.
Risk (R-4): agent-events.ts uses require("@leadmetrics/db") dynamically inside async functions (lines 92, 153). This pattern:
- Bypasses TypeScript’s module resolution — no compile-time error if the import path changes
- Fires at runtime on the hot path of every agent job
- Is silently swallowed by an empty
catch {}— so if Prisma client generation is missing, agent run records simply stop persisting with no developer-visible error
Fix (R-4): Break the circular dependency structurally by accepting the Prisma client as a module-level import via a setup call rather than a lazy require:
// packages/agents/src/agent-events.ts import type { PrismaClient } from "@leadmetrics/db"; let _db: PrismaClient | null = null; export function initAgentEvents(db: PrismaClient): void { _db = db; } export async function publishAgentEvent(event: AgentEvent): Promise<void> { // ... Redis publish logic (unchanged) ... if (_db) { // No dynamic require — static import, TypeScript-safe, visible to ESLint if (event.type === "agent:started") { await _db.agentRun.create({ data: { ... } }); } // ... } else { // Warn on startup if initAgentEvents was never called console.warn("[agent-events] DB not initialized — run records will not be persisted"); } } // apps/servers/agents/src/index.ts import { db } from "@leadmetrics/db"; import { initAgentEvents } from "@leadmetrics/agents/src/agent-events"; initAgentEvents(db); // called once at server startupThis makes the missing-Prisma case visible at startup rather than silently failing per-job.
6. Multi-Tenancy
6.1 Isolation Model
All tenant data is row-filtered on a shared PostgreSQL database. Every tenant-scoped table carries a tenantId FK. Routes extract tenantId from the JWT claim and pass it to every Prisma query.
6.2 No ORM-Level Isolation Enforcement — R-2
Prisma does not enforce row-level isolation natively. Any query without an explicit tenantId filter returns data from all tenants. Current mitigations are entirely developer-discipline:
requireTenantUser()guard (opt-in — can be forgotten)- Code review (human — not guaranteed at scale)
- Integration tests (cover happy path — do not test cross-tenant leakage)
Risk (R-2): This is a high-severity architectural risk. A single missing where: { tenantId } silently exposes every tenant’s data to the request caller.
Fix (R-2): Create a Prisma Client Extension that appends
tenantIdto every query for tenant-scoped models:// packages/db/src/tenant-client.ts (NEW FILE) import { db } from "./index"; // List of models that are NOT tenant-scoped (global tables — no tenantId filter). const GLOBAL_MODELS = new Set(["tenant", "user", "agentConfig", "skill", "repurposingTemplate"]); export function createTenantClient(tenantId: string) { return db.$extends({ query: { $allModels: { async $allOperations({ model, operation, args, query }) { if (GLOBAL_MODELS.has(model)) return query(args); // Inject tenantId into WHERE for all reads and writes. if (["findMany", "findFirst", "findUnique", "count", "aggregate"].includes(operation)) { (args as any).where = { ...(args as any).where, tenantId }; } if (["create"].includes(operation)) { (args as any).data = { ...(args as any).data, tenantId }; } if (["update", "updateMany", "delete", "deleteMany"].includes(operation)) { (args as any).where = { ...(args as any).where, tenantId }; } return query(args); }, }, }, }); } // Export type for use in function signatures export type TenantClient = ReturnType<typeof createTenantClient>;Migration strategy — roll out incrementally per router:
// apps/api/src/routers/blog.ts const { tenantId } = await requireTenantUser(req, reply); const tdb = createTenantClient(tenantId); // use tdb instead of db for all queries const post = await tdb.blogPost.findUnique({ where: { id: params.id } }); // tenantId automatically injected — cross-tenant leakage is now structurally impossibleSuper-admin routes continue to use the raw
dbclient since they need cross-tenant access.
6.3 Super Admin Cross-Tenant Access
Super-admin routes access all tenant data by explicit tenantId URL parameter, not JWT claim. Protected by requireSuperAdmin(). This is correct — no action needed.
7. Real-Time Architecture
7.1 Event Delivery Path
Worker process API process Browser
│ │ │
├─ PUBLISH agent_events:T ──► │ │
│ [Redis pub/sub] ├─ psubscribe() │
│ │ (single sub conn) │
│ ├─ emit "agent:event"─►│ Socket.IO
│ │ to tenant:T room │
├─ SETEX agent_live:T:role ──► │ (Redis key, 30m) │
└─ PUBLISH notifications:T ──► ├─ emit "notification:new"─►│The Redis-backed Socket.IO adapter enables horizontal scaling correctly. The agent_live:* Redis cache provides page-reload recovery for in-progress jobs. Both are sound design choices.
7.2 Redis Connection Count
The API process holds 4 Redis connections: redisPlugin, Socket.IO pubClient, subClient, and eventsSub. Four is reasonable and correctly closed in onClose.
7.3 Socket.IO Room Authentication — R-11
The Socket.IO namespace join logic in socket/middleware/auth.ts was not fully audited.
Risk (R-11): If the JWT is verified at handshake time but room join is not validated against the token’s tenantId, a client could call socket.join("tenant:other-tenant-id") and receive another tenant’s agent events.
Fix (R-11): Enforce room membership at join time in the namespace handler:
// apps/api/src/socket/namespaces/tenant.ts export function registerTenantNamespace(namespace: Namespace, pubClient: IORedis) { namespace.use(async (socket, next) => { // Verify JWT at handshake const token = socket.handshake.auth?.token ?? socket.handshake.headers?.authorization?.split(" ")[1]; if (!token) return next(new Error("Authentication required")); try { const payload = verifyAccessToken(token); socket.data.tenantId = payload.tenantId; socket.data.role = payload.role; next(); } catch { next(new Error("Invalid token")); } }); namespace.on("connection", (socket) => { const { tenantId, role } = socket.data; if (tenantId) { // Only auto-join the room matching the token's tenantId socket.join(`tenant:${tenantId}`); } if (role === "super_admin" || role === "reviewer") { socket.join("dm:live"); } // Reject explicit room join requests for other tenants socket.on("join:room", (roomName: string) => { if (roomName === `tenant:${tenantId}`) socket.join(roomName); else socket.emit("error", { code: "FORBIDDEN", message: "Cannot join this room" }); }); }); }
7.4 Reconnection and Client-Side Resilience
Redis pub/sub is fire-and-forget — events published while a client is disconnected are lost. Notifications are persisted to the Notification table (clients can fetch on reconnect). The agent_live:* cache (30min TTL) recovers in-progress job state.
Gap: No explicit Socket.IO reconnection handler. A client that reconnects mid-job must poll agent_live:* or the AgentRun table to recover current state.
Fix: Add a Socket.IO
reconnectevent handler on the client side that polls the live progress key:// apps/dashboard/src/hooks/useAgentEvents.ts socket.on("connect", async () => { // On any (re)connect, poll the live progress cache for current agent state const live = await fetch("/api/agent/live-status").then(r => r.json()); if (live.running) setAgentProgress(live); });Add a matching API route that reads
agent_live:{tenantId}:{role}from Redis and returns the cached progress object.
8. Package Boundaries and Dependency Graph
8.1 Critical Boundary: Worker Code in Next.js
packages/agents/src/index.ts deliberately does not bulk-export worker start/stop functions to prevent BullMQ from entering the Next.js webpack bundle. This is correctly implemented and must be preserved.
Risk: A developer who naively adds a worker export and imports it in a Next.js server action ships BullMQ + IORedis to the browser. There is no automated guard against this.
Fix: Add a bundle size CI check and an ESLint rule:
// eslint.config.mjs "no-restricted-imports": ["error", { paths: [{ name: "@leadmetrics/agents/src/workers", message: "Worker files must not be imported in Next.js apps — they pull BullMQ into the browser bundle." }] }]Additionally, add
"server-only"at the top of each worker file to make Next.js throw a build error if it is imported from a client context:// packages/agents/src/workers/blog-writer.worker.ts import "server-only"; // throws at build time if imported in a browser bundle
8.2 Dynamic Require Pattern — R-19
Two files use require("@leadmetrics/db") dynamically:
packages/agents/src/agent-events.ts:92,153packages/agents/src/workers/insights/insight-worker-base.ts:33
Risk (R-19): Dynamic requires bypass TypeScript analysis, ESLint import resolution, and tree-shaking. Module path renames cause silent runtime failures.
Fix (R-19): Resolved as part of the R-4 fix (§5.3). After
initAgentEvents(db)is introduced, the dynamic require is eliminated entirely. Forinsight-worker-base.ts, importdbstatically at the top — the circular dependency that forced dynamic require is broken whenagent-events.tsacceptsdbas a parameter:// packages/agents/src/workers/insights/insight-worker-base.ts import { db } from "@leadmetrics/db"; // static import, no dynamic require
8.3 Provider Package Proliferation — R-20
There are 20+ provider packages under packages/providers/*. Each wraps a third-party SDK. Isolation is clean, but build overhead is real — Turborepo must build all 20+ packages before any dependent app.
Risk (R-20): A provider imported at the top level of a shared package that is also used by Next.js apps could pull large SDK bundles (e.g., googleapis, @aws-sdk/*) into the browser bundle.
Fix (R-20): Add
"server-only"imports to all provider packages:// packages/providers/google/src/index.ts import "server-only"; // Next.js will throw at build time if this is imported client-sideAdditionally, run a bundle analyzer in CI to catch regressions:
# package.json (dashboard app) "analyze": "ANALYZE=true next build"// apps/dashboard/next.config.mjs import withBundleAnalyzer from "@next/bundle-analyzer"; export default withBundleAnalyzer({ enabled: process.env.ANALYZE === "true" })({ ... });Run
pnpm analyzein CI on PRs that touchpackages/providers/*orpackages/common.
9. Observability and Error Handling
9.1 Logging — No Correlation IDs — R-12
Structured JSON logging is in place via @leadmetrics/logger across all services. Good.
Gap (R-12): No request correlation IDs. A user request enters the API, enqueues a BullMQ job, and is processed by a worker across three separate log streams. Correlating these by timestamp + tenantId + runId is manual and error-prone.
Fix (R-12): Generate a
requestIdat the API boundary and thread it through the job payload:// apps/api/src/index.ts — add requestId generation import { randomUUID } from "crypto"; fastify.addHook("onRequest", async (req) => { req.id = req.headers["x-request-id"] as string ?? randomUUID(); }); fastify.addHook("onSend", async (req, reply) => { reply.header("x-request-id", req.id); }); // packages/queue/src/types.ts — add to base job data type interface BaseJobData { tenantId: string; requestId?: string; // threaded from API request } // In route handlers — pass requestId when enqueueing await enqueueBlogWriter({ tenantId, activityId, requestId: req.id }); // In workers — include in all log calls log.info({ tenantId, activityId, requestId: data.requestId }, "Blog writer started");With
requestIdin every log line across API + queue + worker, a singlegrep requestId=abc-123correlates the full lifecycle.
9.2 No Distributed Tracing — R-15
Individual logs are structured but there is no cross-service trace graph.
Fix (R-15): Add OpenTelemetry SDK instrumentation:
pnpm add @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-otlp-http// apps/api/src/tracing.ts (NEW FILE — import before all other modules) import { NodeSDK } from "@opentelemetry/sdk-node"; import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node"; import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http"; const sdk = new NodeSDK({ serviceName: "leadmetrics-api", traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();The auto-instrumentations cover Fastify, Prisma, IORedis, and HTTP calls automatically. Add the same setup to each server process. Use Jaeger, Grafana Tempo, or Datadog APM as the trace backend.
9.3 No Metrics Collection — R-16
There is no way to measure API latency percentiles, queue depth, credit consumption rate, or error rates.
Fix (R-16): Add Prometheus metrics via
fastify-metrics:// apps/api/src/index.ts import metrics from "fastify-metrics"; await fastify.register(metrics, { endpoint: "/metrics", // scraped by Prometheus defaultMetrics: { enabled: true }, routeMetrics: { enabled: true, groupStatusCodes: false, overrides: { histogram: { buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] } }, }, }); // Custom business metrics const creditCounter = fastify.metrics.client.Counter({ name: "leadmetrics_credits_consumed_total", help: "Total credits consumed", labelNames: ["tenantId", "creditType"], }); // Call from consumeCredits() creditCounter.inc({ tenantId, creditType });Add BullMQ queue depth metrics by polling
queue.getJobCounts()on an interval and exposing via a/metrics/queuesendpoint. Visualize in Grafana.
9.4 Health Checks — Incomplete Readiness — R-10
The /health/ready endpoint only pings PostgreSQL. Redis, MongoDB, and Typesense are not checked.
Risk (R-10): A deployment where Redis is down but PostgreSQL is healthy passes the readiness check. The API reports ready but real-time events fail silently and all job enqueueing fails.
Fix (R-10): Expand the readiness check to cover all critical dependencies:
// apps/api/src/index.ts — replace /health/ready handler fastify.get("/health/ready", async (_req, reply) => { const checks: Record<string, "ok" | "error"> = {}; // PostgreSQL try { await db.$queryRaw`SELECT 1`; checks.postgres = "ok"; } catch { checks.postgres = "error"; } // Redis try { await fastify.redis.ping(); checks.redis = "ok"; } catch { checks.redis = "error"; } // MongoDB try { await mongoose.connection.db.admin().ping(); checks.mongo = "ok"; } catch { checks.mongo = "error"; } // Typesense (optional — not critical path for most requests) try { await typesenseClient.health.retrieve(); checks.typesense = "ok"; } catch { checks.typesense = "error"; } const healthy = Object.values(checks).every(v => v === "ok"); return reply.status(healthy ? 200 : 503).send({ status: healthy ? "ok" : "degraded", checks }); });
9.5 Job Failure Alerting — covered under R-6 (§3.1)
9.6 Unhandled API Process Errors
The agent server registers unhandledRejection and uncaughtException handlers (correct). The API server relies solely on Fastify’s route-level error handler.
Risk: An uncaught promise rejection in the API (outside a route handler — e.g., in a plugin startup or a background interval) prints a warning in Node.js ≥15 but does not crash the process, leaving the API in a potentially inconsistent state.
Fix: Add the same global handlers to the API server:
// apps/api/src/index.ts — inside bootstrap(), after fastify.listen() process.on("unhandledRejection", (reason) => { fastify.log.error({ reason }, "Unhandled promise rejection in API process"); // Do NOT exit — Fastify's route handlers catch route-level errors; // only exit for truly unrecoverable state. }); process.on("uncaughtException", (err) => { fastify.log.error({ err }, "Uncaught exception in API process — shutting down"); process.exit(1); });
10. Security
10.1 Transport and Headers
@fastify/helmet is registered with CSP disabled (correct for a JSON API) and all other defaults active: HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy. CORS is env-driven. This is appropriate.
Risk: The origin === "null" allowance in CORS is intentional for OAuth (LinkedIn), but any HTML page loaded from file:// or a sandboxed iframe can send credentialed requests with null origin. In production (HTTPS-only), this is low risk.
Fix (documentation): Add a code comment and document this as an explicit trade-off:
// apps/api/src/lib/fastify-setup.ts if (!origin || origin === "null" || allowedOrigins.has(origin)) { // "null" origin: OAuth providers (LinkedIn, Google) redirect back to the app // with Origin: null when navigating from HTTPS → HTTP (or cross-scheme). // In production this is safe because the browser only sends null-origin // requests from HTTPS origins and our cookies are SameSite=Lax. cb(null, true); }
10.2 Secret Management — R-17
Secrets are managed exclusively via environment variables with no vault, no rotation mechanism, and no secret scanning in CI.
Risk (R-17): A leaked DATABASE_URL + encryption key would expose all OAuth tokens stored in the Account model. A leaked JWT_SECRET allows forging arbitrary tokens. A leaked INTERNAL_API_SECRET allows calling billing endpoints from outside.
Fix (R-17) — three layers:
Layer 1 — Secret scanning in CI (immediate, zero-code change):
# .github/workflows/security.yml - name: Scan for secrets uses: trufflesecurity/trufflehog@main with: path: ./ base: ${{ github.event.repository.default_branch }} head: HEAD extra_args: --only-verifiedLayer 2 — Rotation-ready secret naming (near-term): Add a
JWT_SECRET_VERSIONenv var and sign tokens with${JWT_SECRET_VERSION}:${JWT_SECRET}. Verification attempts with the current version, falls back toJWT_SECRET_PREVfor a 24h rotation window. This allows zero-downtime secret rotation without invalidating all live tokens.Layer 3 — Secrets manager (medium-term): Integrate DigitalOcean Secrets or HashiCorp Vault:
// packages/config/src/secrets.ts export async function loadSecrets(): Promise<void> { if (process.env.VAULT_ADDR) { const vault = new VaultClient({ endpoint: process.env.VAULT_ADDR, token: process.env.VAULT_TOKEN }); const secrets = await vault.read("secret/leadmetrics"); process.env.JWT_SECRET = secrets.data.jwt_secret; process.env.DATABASE_URL = secrets.data.database_url; // … all other secrets } // Otherwise fall back to env vars (dev / staging) }
10.3 Input Validation — R-13
Not all routes define Fastify JSON Schemas. Routes without schemas accept any input shape.
Fix (R-13): Enforce schema presence using an ESLint rule and add a Fastify plugin that rejects requests with extra fields on unschema’d routes:
// apps/api/src/plugins/strict-schema.ts (NEW FILE) import fp from "fastify-plugin"; export default fp(async (fastify) => { fastify.addHook("preValidation", async (req) => { // In production, if a route has no schema, log a warning so we know to add one if (!req.routeOptions.schema && process.env.NODE_ENV === "production") { req.log.warn({ url: req.url, method: req.method }, "Route has no schema — add one"); } }); });For SQL injection via
$queryRaw, enforce tagged-template usage via ESLint:// eslint.config.mjs "no-restricted-syntax": ["error", { selector: "CallExpression[callee.property.name='$queryRaw'][arguments.0.type!='TaggedTemplateExpression']", message: "$queryRaw must use tagged template literals (db.$queryRaw`...`), not string interpolation." }]
10.4 Rate Limiting — Auth Endpoints Unprotected — R-14
@fastify/rate-limit applies a global 300 req/min limit. Auth endpoints (/auth/v1/login, /auth/v1/register, /auth/v1/refresh) share this limit with all other traffic — a brute-force attack on login consumes only a fraction of the global budget.
Fix (R-14): Register a tighter per-route rate limit on auth endpoints:
// apps/api/src/routers/auth.ts fastify.post("/login", { config: { rateLimit: { max: 10, // 10 attempts per window timeWindow: "1 minute", keyGenerator: (req) => `login:${req.ip}`, // per-IP limit errorResponseBuilder: () => ({ error: { code: "RATE_LIMITED", message: "Too many login attempts. Try again in 1 minute." } }), }, }, handler: loginHandler, }); fastify.post("/register", { config: { rateLimit: { max: 5, timeWindow: "1 minute" } }, handler: registerHandler, }); fastify.post("/refresh", { config: { rateLimit: { max: 30, timeWindow: "1 minute" } }, // higher — DM portal refreshes every 15min handler: refreshHandler, });Also add account lockout after N consecutive failures by storing a failure counter in Redis per
const key = `login_fails:${email}`; const fails = await redis.incr(key); await redis.expire(key, 15 * 60); // reset counter after 15 min if (fails > 10) return apiError(reply, 429, "ACCOUNT_LOCKED", "Account temporarily locked.");
11. Build and CI/CD
11.1 Turborepo Configuration — R-21
turbo.json defines build, dev, lint, typecheck, and test variants. The build task correctly uses dependsOn: ["^build"].
Gap (R-21): test:unit has dependsOn: ["^build"], which forces a full monorepo build before unit tests can run. For a developer iterating on a single worker file, this adds minutes of unnecessary build time.
Fix (R-21):
// turbo.json { "tasks": { "test:unit": { "dependsOn": [], // unit tests have no build deps — pure function tests "cache": false }, "test:integration": { "dependsOn": ["^build"], // integration tests need compiled packages "cache": false }, "clean": { "cache": false } } }Add a
cleanscript to each package’spackage.json:"scripts": { "clean": "rm -rf .next dist .turbo" }Then
turbo run cleanclears all build artifacts across the monorepo in one command.
11.2 Test Architecture
Four tiers: test:unit (Vitest), test:integration (Vitest + test DB), test:e2e (Playwright + full stack), test:coverage. The distinct tiers are good practice.
Gap: No end-to-end test for the BullMQ job lifecycle: enqueue → worker picks up → output saved → event published. Workers are unit-tested for pure functions but the BullMQ lifecycle itself (retry, backoff, deduplication, failure) is not exercised in CI.
Fix: Add a “worker integration” test tier that spins up a real BullMQ worker against the test Redis instance:
// packages/agents/src/workers/__tests__/blog-writer.integration.test.ts import { describe, it, expect, beforeAll, afterAll } from "vitest"; import { Queue, Worker } from "bullmq"; import { processBlogWriterJob } from "../blog-writer.worker"; describe("blog-writer BullMQ lifecycle", () => { let queue: Queue; let worker: Worker; beforeAll(async () => { queue = new Queue("agent__blog-writer", { connection: testRedis }); worker = new Worker("agent__blog-writer", processBlogWriterJob, { connection: testRedis }); }); afterAll(async () => { await worker.close(); await queue.close(); }); it("completes a job and writes BlogPost to DB", async () => { const job = await queue.add("test", { tenantId: TEST_TENANT, activityId: TEST_ACTIVITY }); await job.waitUntilFinished(new QueueEvents("agent__blog-writer", { connection: testRedis })); const post = await db.blogPost.findFirst({ where: { activityId: TEST_ACTIVITY } }); expect(post?.status).toBe("dm_review"); }); });
12. Summary: Strengths and Risks
Strengths
| # | Strength |
|---|---|
| S-1 | Clear domain-driven router organization — versioned prefixes, consistent naming |
| S-2 | BullMQ queue-per-role design — correct tenant isolation via payload, not queue name |
| S-3 | Deliberate package boundaries — agents package prevents worker code from entering Next.js bundles |
| S-4 | Centralized error handler — uniform { error: { code, message } } shape across all routes |
| S-5 | Redis-backed Socket.IO — multi-instance-safe real-time events |
| S-6 | Live progress caching — agent_live:* Redis key survives page reload |
| S-7 | Graceful shutdown — agent server drains all workers before exit |
| S-8 | Dual deduplication strategy — static jobId for idempotent ops, timestamp jobId for revisions |
| S-9 | Structured logging — JSON logs with service field across all processes |
| S-10 | Split health checks — liveness + readiness endpoints with DB ping |
Risk Register
| ID | Risk | Severity | Location | Fix section |
|---|---|---|---|---|
| R-1 | No default-deny auth — forgotten guard = public endpoint | 🔴 High | All route handlers | §2.3 |
| R-2 | No ORM-level tenant isolation — missing where: { tenantId } = data leak | 🔴 High | All Prisma queries | §6.2 |
| R-3 | app.ts vs index.ts router drift — missing routes 404 silently in tests | 🔴 High | apps/api/src/app.ts | §2.2 |
| R-4 | Dynamic require swallows Prisma errors — agent run records silently drop | 🔴 High | packages/agents/src/agent-events.ts:92 | §5.3 |
| R-5 | No token revocation — compromised refresh token valid 7 days | 🔴 High | JWT auth system | §4.4 |
| R-6 | No DLQ / permanent failure alerting — job failures are silent | 🟡 Medium | packages/queue/src/queues.ts | §3.1 |
| R-7 | DM portal synchronous token refresh — API outage logs out all DM users | 🟡 Medium | apps/dm/src/middleware.ts | §4.3 |
| R-8 | Single agent process for 44 workers — one OOM kills everything | 🟡 Medium | apps/servers/agents/src/index.ts | §3.3 |
| R-9 | No Prisma connection pool config — default pool queues under load | 🟡 Medium | packages/db/src/index.ts | §5.2 |
| R-10 | Readiness check misses Redis/MongoDB — partial outage reports healthy | 🟡 Medium | apps/api/src/index.ts:191 | §9.4 |
| R-11 | Socket.IO room join auth not audited — potential cross-tenant event leakage | 🟡 Medium | apps/api/src/socket/ | §7.3 |
| R-12 | No request correlation IDs — impossible to trace request across API + worker | 🟡 Medium | Entire stack | §9.1 |
| R-13 | Missing input schemas on some routes — arbitrary shapes reach the DB | 🟡 Medium | apps/api/src/routers/* | §10.3 |
| R-14 | Auth rate limit too permissive — brute-force on /auth/v1/login not throttled | 🟡 Medium | apps/api/src/plugins/rate-limit.ts | §10.4 |
| R-15 | No distributed tracing (OpenTelemetry) | 🟢 Low | All services | §9.2 |
| R-16 | No metrics collection (Prometheus) | 🟢 Low | All services | §9.3 |
| R-17 | No secret rotation strategy or scanning | 🟢 Low | Ops/infra | §10.2 |
| R-18 | Rate-limit 429 has different error shape than other API errors | 🟢 Low | lib/fastify-setup.ts | §2.4 |
| R-19 | Dynamic require in agent-events bypasses TypeScript analysis | 🟢 Low | agent-events.ts | §8.2 |
| R-20 | Provider packages may leak into Next.js bundles — no bundle analysis in CI | 🟢 Low | packages/providers/* | §8.3 |
| R-21 | test:unit depends on full build — slow local iteration | 🟢 Low | turbo.json | §11.1 |
13. Prioritized Fix Roadmap
Sprint 1 — Critical (block a security incident)
| Task | Risk | Effort |
|---|---|---|
Extract registerRouters() shared function for app.ts + index.ts | R-3 | 1h |
| Add preHandler auth guards at router-plugin level | R-1 | 2h |
| Audit Socket.IO room join and enforce tenantId validation | R-11 | 2h |
Add jti + Redis revocation to JWT sign/verify | R-5 | 3h |
Replace dynamic require in agent-events.ts with initAgentEvents(db) | R-4 | 2h |
Sprint 2 — High Impact (prevent data leakage and silent failures)
| Task | Risk | Effort |
|---|---|---|
Implement createTenantClient() Prisma extension, roll out to dm/ routes first | R-2 | 1 day |
| Extend DM access token TTL to 60min + make middleware resilient to API outage | R-7 | 2h |
Add per-route rate limits for /auth/v1/login, /register, /refresh | R-14 | 1h |
Expand /health/ready to check Redis + MongoDB + Typesense | R-10 | 1h |
Add QueueEvents failed listener with DM-team email notification | R-6 | 2h |
Add requestId to all log lines and BullMQ job payloads | R-12 | 3h |
Sprint 3 — Medium Term (scalability and observability)
| Task | Risk | Effort |
|---|---|---|
| Configure explicit Prisma pool sizes per service | R-9 | 30min |
Split agent server into server-agents-llm + server-agents-util | R-8 | 1 day |
Add Zod validation to all enqueueXxx call sites | R-13 (queue) | 3h |
| Add OpenTelemetry auto-instrumentation to API + agent server | R-15 | 1 day |
Add Prometheus metrics (fastify-metrics + queue depth polling) | R-16 | 1 day |
Sprint 4 — Polish (reduce developer friction)
| Task | Risk | Effort |
|---|---|---|
| Normalize rate-limit 429 response shape in error handler | R-18 | 30min |
Add "server-only" to all provider packages | R-20 | 1h |
Add truffleHog secret scanning to CI | R-17 | 30min |
Remove dependsOn: ["^build"] from test:unit in turbo.json | R-21 | 5min |
| Add BullMQ lifecycle integration test | §11.2 | 3h |
All file references verified against the live codebase as of 2026-04-25.