Gap 13: No Cost Circuit Breaker

Problem

AgentRun.costUsd is tracked per run but there is no alerting, rate limiting, or automatic circuit-breaking if costs spike. A runaway agent — one stuck in a tool-call loop, processing an unexpectedly large document, or retrying repeatedly — would exhaust a tenant’s credit budget or the platform’s API quota before any human notices.

The current BullMQ retry configuration (4 attempts, exponential backoff from 5s) means a failing job can run four full attempts without any cost or quality check between them.

Concrete failure scenarios

Tool call loop: An agent calls search_knowledge.js in a loop (see Gap 3 — hallucination). Each loop iteration uses input tokens. By the time the 900-second timeout kills the job, it has consumed 3× the expected token count.
Large document processing: A tenant uploads a 200-page PDF to their knowledge base. The context-file-writer ingests the entire thing into its prompt (no token budget — see Gap 8). One run costs $4 instead of the expected $0.40.
Retry storm: A transient error causes the blog-writer to fail at the last step. BullMQ retries 4 times. Each retry rebuilds the full prompt and calls the adapter. 4 failed runs at $0.20 each = $0.80 wasted.
Tenant credit exhaustion: A tenant with 100 credits remaining has 8 background jobs enqueued. All 8 start simultaneously. Credits run out mid-batch; some jobs complete, some fail, some are partially completed.

What to Build

1. Per-run cost cap in AgentConfig

Add a maximum cost threshold per agent role:


model AgentConfig {
  // existing fields
  maxCostUsdPerRun  Float?   // null = no limit; e.g. 2.0 for expensive agents
  maxTokensPerRun   Int?     // hard token limit at adapter level
}

Default caps by role (suggested):

Agent role	maxCostUsdPerRun	maxTokensPerRun
blog-writer	$1.00	50,000
strategy-writer	$3.00	100,000
context-file-writer	$2.00	80,000
social-post-writer	$0.20	10,000
insight workers	$0.50	20,000

2. Running cost estimator in adapter progress callback

The adapter already has a progress callback that fires on tool_use events and increments toolsUsed. Extend it to track running cost estimate:


// In setup.worker.ts / blog-writer.worker.ts
 
let runningCostEstimate = 0;
const MAX_COST = agentConfig.maxCostUsdPerRun ?? Infinity;
 
const progressCallback = (event: AdapterProgressEvent) => {
  if (event.type === "usage") {
    runningCostEstimate += estimateCost(
      agentConfig.model,
      event.inputTokens ?? 0,
      event.outputTokens ?? 0
    );
 
    if (runningCostEstimate > MAX_COST) {
      // Signal the adapter to abort
      abortController.abort(`Cost limit exceeded: $${runningCostEstimate.toFixed(4)} > $${MAX_COST}`);
    }
  }
};

When the abort fires, the job fails with a structured cost_limit_exceeded error (see Gap — structured errors).

3. Tenant-level credit pre-check before enqueue

Before adding any job to the queue, check that the tenant has sufficient credits to cover the estimated cost:


// packages/queue/src/enqueue-guard.ts
 
export async function enqueueWithCreditCheck(
  queue: Queue,
  jobName: string,
  jobData: unknown,
  opts: { estimatedCostUsd: number; tenantId: string; priority?: number }
) {
  const balance = await getCreditBalance(opts.tenantId);
 
  if (balance < opts.estimatedCostUsd * 1.5) { // 50% safety margin
    throw new InsufficientCreditsError(
      `Tenant ${opts.tenantId} has $${balance} credits but job requires ~$${opts.estimatedCostUsd}`
    );
  }
 
  return queue.add(jobName, jobData, { priority: opts.priority });
}

4. Platform-level daily cost circuit breaker

A global circuit breaker that trips if the platform’s total LLM spend exceeds a daily threshold:


// packages/agents/src/lib/cost-circuit-breaker.ts
 
const DAILY_PLATFORM_LIMIT_USD = 500; // configurable via PlatformSetting
 
export async function checkPlatformCircuitBreaker(): Promise<void> {
  const todaySpend = await db.agentRun.aggregate({
    _sum: { costUsd: true },
    where: {
      startedAt: { gte: startOfDay(new Date()) },
      status: { in: ["completed", "failed"] },
    },
  });
 
  const spend = todaySpend._sum.costUsd ?? 0;
 
  if (spend > DAILY_PLATFORM_LIMIT_USD) {
    // Trip the breaker — pause all LOW and BACKGROUND priority queues
    await pauseLowPriorityQueues();
    await sendAdminAlert({
      type: "cost_circuit_breaker_tripped",
      message: `Platform daily spend $${spend.toFixed(2)} exceeded limit $${DAILY_PLATFORM_LIMIT_USD}`,
    });
    throw new CircuitBreakerError(`Platform daily cost limit reached: $${spend.toFixed(2)}`);
  }
}

Only pause LOW/BACKGROUND queues — CRITICAL and HIGH priority jobs (user-facing, rejection re-runs) are allowed to continue.

5. Cost anomaly alerting

Send an alert when any single run costs more than 3× the rolling average for that agent role:


const avgCost = await getAvgRunCost(agentRole, 30); // 30-day rolling average
 
if (completedRun.costUsd > avgCost * 3) {
  await sendAdminAlert({
    type: "cost_anomaly",
    agentRole,
    runId: completedRun.id,
    tenantId: completedRun.tenantId,
    costUsd: completedRun.costUsd,
    avgCostUsd: avgCost,
    message: `Run cost $${completedRun.costUsd.toFixed(4)} is ${(completedRun.costUsd / avgCost).toFixed(1)}× the 30-day average`,
  });
}

6. Expose circuit breaker status in Execution Queue dashboard

The /dashboards/execution-queue page should show:

Today’s platform spend vs. daily limit (progress bar)
Which queues are currently paused (if circuit breaker tripped)
Per-tenant spend ranking (top 10 spenders today)
Anomalous runs flagged in the last 24 hours

Files to Change

packages/db/prisma/schema.prisma — add maxCostUsdPerRun, maxTokensPerRun to AgentConfig
New file: packages/agents/src/lib/cost-circuit-breaker.ts
New file: packages/queue/src/enqueue-guard.ts
packages/agents/src/workers/blog-writer.worker.ts — add running cost tracker to progress callback
packages/agents/src/workers/setup.worker.ts — same
apps/api/src/routers/admin/agents.ts — expose cost cap settings in PUT endpoint
apps/dashboard/src/app/(dashboard)/dashboards/execution-queue/ — circuit breaker status panel

Gap 7: Priority queue differentiation (circuit breaker only pauses LOW/BACKGROUND — requires priority to be set)
Gap 8: Context window management (token limits at prompt-build time prevent runaway costs before the run starts)
Gap 10: Dynamic model routing (routing to cheaper models reduces cost before circuit breaker is needed)