API Gateway
Status: [To Build] · Pattern: Gateway layer within Fastify (Option A)
All requests to the Fastify API pass through a gateway layer before reaching any surface router. The gateway owns every cross-cutting concern — authentication, tenant resolution, rate limiting, throttling, request logging, and audit — so that none of those concerns leak into the individual surface routers.
Why a Gateway Layer
Previously, auth, rate limiting, and logging were scattered across a shared common/ module imported by each router. The problems with that:
- Cross-cutting logic duplicated or inconsistently applied across surfaces
- Adding a new concern (e.g. audit logging) required touching every router
- No single place to enforce policy changes globally
The gateway layer is a Fastify plugin registered before all routers. Every request — regardless of surface — passes through the same ordered chain of hooks. Routers only see a fully-authenticated, rate-checked, logged request.
Request Lifecycle
Every request flows through this chain in order:
Inbound request (from Traefik)
│
▼
┌───────────────────────────────────────────────────────┐
│ GATEWAY LAYER │
│ │
│ 1. Request ID assign unique trace ID │
│ 2. IP Extraction resolve real IP behind Traefik │
│ 3. Authentication validate JWT or API key │
│ 4. Tenant Resolution resolve + attach tenantId │
│ 5. Rate Limiting sliding window check (Redis) │
│ 6. Throttling per-surface concurrency cap │
│ 7. Request Log structured log entry (pre) │
│ 8. Audit Pre-hook snapshot state (write ops only) │
│ │
└───────────────────┬───────────────────────────────────┘
│ forward
▼
┌───────────────────────────────────────────────────────┐
│ SURFACE ROUTER │
│ /dashboard/v1 /dm/v1 /admin/v1 │
│ /mobile/v1 /cli/v1 /auth/v1 /agent/v1 │
└───────────────────┬───────────────────────────────────┘
│ response
▼
┌───────────────────────────────────────────────────────┐
│ GATEWAY LAYER (post) │
│ │
│ 9. Response Log status, duration, size │
│ 10. Audit Post-hook write audit record (write ops) │
│ │
└───────────────────────────────────────────────────────┘
│
▼
Response returned to clientGateway Responsibilities
1. Request ID
Every request is assigned a ULID requestId before any other processing. It is:
- Attached to the Fastify request object (
req.requestId) - Included in every log line for this request
- Returned in the response header:
X-Request-Id: <ulid> - Used to correlate logs across services (passed to BullMQ jobs, agent callbacks, MongoDB writes)
// gateway/request-id.ts
fastify.addHook('onRequest', async (req) => {
req.requestId = ulid();
req.log = req.log.child({ requestId: req.requestId });
});2. IP Extraction
The API sits behind Traefik. The real client IP is in the X-Forwarded-For header. The gateway normalises this to req.clientIp (used by rate limiting and audit logging).
// Trust only the first IP in the chain (set by Traefik)
const xff = req.headers['x-forwarded-for'];
req.clientIp = Array.isArray(xff) ? xff[0] : xff?.split(',')[0] ?? req.ip;3. Authentication
Two credential types are accepted:
| Credential | Format | Used by |
|---|---|---|
| JWT (access token) | Authorization: Bearer <jwt> | Dashboard, DM Portal, Manage, Mobile |
| API key | Authorization: ApiKey <key> | CLI, agent callbacks, external integrations |
JWT validation flow:
- Decode header — reject if malformed
- Verify signature against
JWT_SECRET - Check
exp— reject if expired (return401withWWW-Authenticate: Bearer error="expired_token") - Attach decoded payload to
req.principal
API key validation flow:
- Look up key hash in PostgreSQL
api_keystable - Check
is_activeandexpires_at - Load associated user + role + tenant scopes
- Attach to
req.principalin the same shape as JWT principal
Agent callbacks use a short-lived task-scoped JWT (issued per run, 30-min TTL, signed with JWT_SECRET). The gateway validates these the same way as user JWTs — the sub is the runId, not a userId.
Unauthenticated paths (bypass auth hook):
POST /auth/v1/loginPOST /auth/v1/refreshGET /health
interface GatewayPrincipal {
id: string; // user ref_id or runId (agent)
type: 'human' | 'agent' | 'api_key';
role: 'admin' | 'member' | 'reviewer' | 'super_admin' | 'agent';
tenantId?: string; // absent for super_admin and cross-tenant reviewers
appAccess: string[]; // which surfaces this principal can reach
keyId?: string; // api_keys.ref_id (api_key type only)
}4. Tenant Resolution
After auth, the gateway resolves and validates the tenant context:
| Surface | Source of tenantId |
|---|---|
/dashboard/v1 | JWT tenantId field |
/mobile/v1 | JWT tenantId field |
/dm/v1 | Optional ?tenantId= query param; validated against reviewer’s assigned tenants |
/admin/v1 | Optional path param :tenantId; no restriction (super_admin only) |
/cli/v1 | Optional ?tenantId= or X-Tenant-Id header; validated against principal’s access |
/agent/v1 | From run record in PostgreSQL (looked up by runId) |
The resolved tenant record is attached to req.tenant. All downstream route handlers use req.tenant.id — they never look it up themselves.
5. Rate Limiting
Sliding window rate limits enforced per (tenantId, userId) key in Redis. Limits are configured per surface:
| Surface | Requests / minute | Burst allowance |
|---|---|---|
/dashboard/v1 | 300 | +60 (20% burst) |
/dm/v1 | 600 | +120 |
/admin/v1 | 120 | +24 |
/mobile/v1 | 200 | +40 |
/cli/v1 | 1,200 | +240 (scripts may send bursts) |
/agent/v1 | 60 per runId | — |
On every request, the gateway:
- Increments the Redis counter for
rate:{surface}:{tenantId}:{userId} - Sets TTL of 60s on first write
- If count > limit: return
429withRetry-AfterandX-RateLimit-*headers - Otherwise: attach remaining count to response headers
X-RateLimit-Limit: 300
X-RateLimit-Remaining: 247
X-RateLimit-Reset: 1704067261 # Unix timestamp when window resets
Retry-After: 14 # seconds (only on 429)6. Throttling
Rate limiting counts total requests. Throttling caps concurrent requests per surface to prevent a single client from saturating the API with slow long-running requests (e.g. large SSE connections, file uploads).
| Surface | Max concurrent per user |
|---|---|
/dashboard/v1 | 10 |
/dm/v1 | 20 |
/admin/v1 | 5 |
/mobile/v1 | 8 |
/cli/v1 | 15 |
| SSE connections | 5 per user (shared across surfaces) |
Implemented with a Redis counter incremented on request start, decremented on response end (including on error/disconnect).
7. Request Logging
Every request is logged as a structured JSON entry before the route handler runs:
{
"level": "info",
"time": "2026-04-04T09:00:00.000Z",
"requestId": "01ARZ3NDEKTSV4RRFFQ69G5FAV",
"method": "POST",
"path": "/dm/v1/approvals/01ARZ.../resolve",
"surface": "dm",
"userId": "01BRZ...",
"tenantId": "01CRZ...",
"clientIp": "203.0.113.5",
"userAgent": "Leadmetrics-CLI/1.0.0"
}And after the handler completes:
{
"level": "info",
"requestId": "01ARZ3NDEKTSV4RRFFQ69G5FAV",
"status": 200,
"durationMs": 42,
"responseBytes": 318
}Logs are written via pino (already in the stack) and shipped to Grafana Loki.
8 & 10. Audit Logging
The gateway writes an audit record for every state-changing operation (POST, PUT, PATCH, DELETE). Read-only GETs are not audited (they are covered by request logs).
Pre-hook (step 8): Before the handler runs, for update/delete operations, the gateway fetches the current state of the resource and attaches it to req.auditBefore.
Post-hook (step 10): After the handler returns successfully, the gateway writes an audit_logs record to PostgreSQL:
interface AuditLog {
id: string; // ULID
requestId: string; // from step 1
tenantId: string | null;
actorId: string; // user ref_id or 'agent:{runId}'
actorType: 'human' | 'agent' | 'api_key';
impersonating: string | null; // set if super_admin is impersonating
surface: string; // 'dashboard' | 'dm' | 'admin' | 'mobile' | 'cli' | 'agent'
method: string; // HTTP method
path: string; // full request path
action: string; // semantic label, e.g. 'approval.resolve'
resourceType: string; // e.g. 'approval', 'activity', 'tenant'
resourceId: string; // ULID of the affected resource
before: JsonValue | null; // state before change (nullable)
after: JsonValue | null; // state after change (nullable)
status: number; // HTTP response status
durationMs: number;
createdAt: Date;
}The action label is set by the route handler via req.setAuditAction('approval.resolve'). If not set, the gateway derives it from the method + path (e.g. POST /dm/v1/approvals/:id/resolve → dm.approvals.resolve).
Impersonation flagging: When a super_admin is impersonating a tenant, impersonating is set to the target tenantId. This makes every action traceable even across impersonation sessions.
Package Structure
apps/api/src/
├── gateway/
│ ├── index.ts # Fastify plugin — registers hooks in order
│ ├── request-id.ts # Step 1 — assign ULID request ID
│ ├── ip.ts # Step 2 — resolve real client IP
│ ├── auth.ts # Step 3 — JWT + API key validation
│ ├── tenant.ts # Step 4 — tenant resolution + attachment
│ ├── rate-limit.ts # Step 5 — sliding window Redis rate limiter
│ ├── throttle.ts # Step 6 — concurrent request cap
│ ├── logger.ts # Step 7 + 9 — request + response logs
│ └── audit.ts # Step 8 + 10 — audit pre/post hooks
│
├── routers/
│ ├── auth/ # /auth/v1
│ ├── dashboard/ # /dashboard/v1
│ ├── dm/ # /dm/v1
│ ├── admin/ # /admin/v1
│ ├── mobile/ # /mobile/v1
│ ├── cli/ # /cli/v1
│ └── agent/ # /agent/v1
│
├── common/ # Shared non-gateway utilities
│ ├── pagination.ts
│ ├── error.ts
│ └── sse.ts
│
└── index.ts # App bootstrap — register gateway plugin, then routersRegistration order in index.ts:
// 1. Register gateway (must be first — before any routers)
await fastify.register(gatewayPlugin);
// 2. Register surface routers (gateway hooks already in place)
await fastify.register(authRouter, { prefix: '/auth/v1' });
await fastify.register(dashboardRouter, { prefix: '/dashboard/v1' });
await fastify.register(dmRouter, { prefix: '/dm/v1' });
await fastify.register(adminRouter, { prefix: '/admin/v1' });
await fastify.register(mobileRouter, { prefix: '/mobile/v1' });
await fastify.register(cliRouter, { prefix: '/cli/v1' });
await fastify.register(agentRouter, { prefix: '/agent/v1' });Surface Access Matrix
The gateway enforces this access matrix before forwarding to any router:
| Surface prefix | Allowed roles | Allowed credential types |
|---|---|---|
/auth/v1 | — (public login/refresh endpoints) | None required |
/dashboard/v1 | admin, member | JWT |
/dm/v1 | reviewer, super_admin | JWT |
/admin/v1 | super_admin | JWT |
/mobile/v1 | admin, member | JWT |
/cli/v1 | reviewer, super_admin | API key, JWT |
/agent/v1 | agent | Task-scoped JWT |
Any mismatch returns 403 Forbidden before the router is reached.
Error Responses from the Gateway
Gateway errors use the same ApiError envelope as surface routers:
| Step | Error condition | Status | Code |
|---|---|---|---|
| Auth | Missing Authorization header | 401 | UNAUTHORIZED |
| Auth | JWT signature invalid | 401 | INVALID_TOKEN |
| Auth | JWT expired | 401 | TOKEN_EXPIRED |
| Auth | API key not found / inactive | 401 | INVALID_API_KEY |
| Tenant | tenantId in JWT not found in DB | 401 | TENANT_NOT_FOUND |
| Tenant | Reviewer not assigned to requested tenant | 403 | FORBIDDEN |
| Surface access | Role not allowed on this surface | 403 | FORBIDDEN |
| Rate limit | Request count exceeded | 429 | RATE_LIMITED |
| Throttle | Concurrent request cap exceeded | 429 | THROTTLED |
Future Upgrade — Option B: Separate Gateway Service
Future story — not in scope for the current build.
As tenant volume and traffic grow, the gateway layer can be extracted into a standalone gateway service that proxies to separately-deployed surface services. The upgrade path is:
- Extract gateway plugin into a standalone Fastify app (
apps/gateway) - Split surface routers into separate deployable services (
apps/api-dashboard,apps/api-dm,apps/api-admin, etc.) — each on its own port - Gateway proxies using
@fastify/http-proxy— routes by URL prefix to the correct downstream service after running all gateway hooks - Each service removes its own auth/rate-limit middleware (gateway now owns this exclusively)
- Coolify deploys gateway + each service as separate containers with a private Docker network between them — only the gateway is exposed to Traefik
Benefits of Option B:
- Individual surfaces can scale independently (e.g.
api-dmgets more replicas during peak review hours) - Surface services can be deployed separately without taking down the whole API
- Gateway becomes a true choke point — circuit breaking, retries, and observability all in one place
Pre-conditions before migration:
- Each surface must be independently testable
- API contracts between gateway and services must be stable
- Monitoring per-service latency to identify where scaling is actually needed