Gap 13: No Cost Circuit Breaker
Problem
AgentRun.costUsd is tracked per run but there is no alerting, rate limiting, or automatic circuit-breaking if costs spike. A runaway agent — one stuck in a tool-call loop, processing an unexpectedly large document, or retrying repeatedly — would exhaust a tenant’s credit budget or the platform’s API quota before any human notices.
The current BullMQ retry configuration (4 attempts, exponential backoff from 5s) means a failing job can run four full attempts without any cost or quality check between them.
Concrete failure scenarios
-
Tool call loop: An agent calls
search_knowledge.jsin a loop (see Gap 3 — hallucination). Each loop iteration uses input tokens. By the time the 900-second timeout kills the job, it has consumed 3× the expected token count. -
Large document processing: A tenant uploads a 200-page PDF to their knowledge base. The context-file-writer ingests the entire thing into its prompt (no token budget — see Gap 8). One run costs $4 instead of the expected $0.40.
-
Retry storm: A transient error causes the blog-writer to fail at the last step. BullMQ retries 4 times. Each retry rebuilds the full prompt and calls the adapter. 4 failed runs at $0.20 each = $0.80 wasted.
-
Tenant credit exhaustion: A tenant with 100 credits remaining has 8 background jobs enqueued. All 8 start simultaneously. Credits run out mid-batch; some jobs complete, some fail, some are partially completed.
What to Build
1. Per-run cost cap in AgentConfig
Add a maximum cost threshold per agent role:
model AgentConfig {
// existing fields
maxCostUsdPerRun Float? // null = no limit; e.g. 2.0 for expensive agents
maxTokensPerRun Int? // hard token limit at adapter level
}Default caps by role (suggested):
| Agent role | maxCostUsdPerRun | maxTokensPerRun |
|---|---|---|
| blog-writer | $1.00 | 50,000 |
| strategy-writer | $3.00 | 100,000 |
| context-file-writer | $2.00 | 80,000 |
| social-post-writer | $0.20 | 10,000 |
| insight workers | $0.50 | 20,000 |
2. Running cost estimator in adapter progress callback
The adapter already has a progress callback that fires on tool_use events and increments toolsUsed. Extend it to track running cost estimate:
// In setup.worker.ts / blog-writer.worker.ts
let runningCostEstimate = 0;
const MAX_COST = agentConfig.maxCostUsdPerRun ?? Infinity;
const progressCallback = (event: AdapterProgressEvent) => {
if (event.type === "usage") {
runningCostEstimate += estimateCost(
agentConfig.model,
event.inputTokens ?? 0,
event.outputTokens ?? 0
);
if (runningCostEstimate > MAX_COST) {
// Signal the adapter to abort
abortController.abort(`Cost limit exceeded: $${runningCostEstimate.toFixed(4)} > $${MAX_COST}`);
}
}
};When the abort fires, the job fails with a structured cost_limit_exceeded error (see Gap — structured errors).
3. Tenant-level credit pre-check before enqueue
Before adding any job to the queue, check that the tenant has sufficient credits to cover the estimated cost:
// packages/queue/src/enqueue-guard.ts
export async function enqueueWithCreditCheck(
queue: Queue,
jobName: string,
jobData: unknown,
opts: { estimatedCostUsd: number; tenantId: string; priority?: number }
) {
const balance = await getCreditBalance(opts.tenantId);
if (balance < opts.estimatedCostUsd * 1.5) { // 50% safety margin
throw new InsufficientCreditsError(
`Tenant ${opts.tenantId} has $${balance} credits but job requires ~$${opts.estimatedCostUsd}`
);
}
return queue.add(jobName, jobData, { priority: opts.priority });
}4. Platform-level daily cost circuit breaker
A global circuit breaker that trips if the platform’s total LLM spend exceeds a daily threshold:
// packages/agents/src/lib/cost-circuit-breaker.ts
const DAILY_PLATFORM_LIMIT_USD = 500; // configurable via PlatformSetting
export async function checkPlatformCircuitBreaker(): Promise<void> {
const todaySpend = await db.agentRun.aggregate({
_sum: { costUsd: true },
where: {
startedAt: { gte: startOfDay(new Date()) },
status: { in: ["completed", "failed"] },
},
});
const spend = todaySpend._sum.costUsd ?? 0;
if (spend > DAILY_PLATFORM_LIMIT_USD) {
// Trip the breaker — pause all LOW and BACKGROUND priority queues
await pauseLowPriorityQueues();
await sendAdminAlert({
type: "cost_circuit_breaker_tripped",
message: `Platform daily spend $${spend.toFixed(2)} exceeded limit $${DAILY_PLATFORM_LIMIT_USD}`,
});
throw new CircuitBreakerError(`Platform daily cost limit reached: $${spend.toFixed(2)}`);
}
}Only pause LOW/BACKGROUND queues — CRITICAL and HIGH priority jobs (user-facing, rejection re-runs) are allowed to continue.
5. Cost anomaly alerting
Send an alert when any single run costs more than 3× the rolling average for that agent role:
const avgCost = await getAvgRunCost(agentRole, 30); // 30-day rolling average
if (completedRun.costUsd > avgCost * 3) {
await sendAdminAlert({
type: "cost_anomaly",
agentRole,
runId: completedRun.id,
tenantId: completedRun.tenantId,
costUsd: completedRun.costUsd,
avgCostUsd: avgCost,
message: `Run cost $${completedRun.costUsd.toFixed(4)} is ${(completedRun.costUsd / avgCost).toFixed(1)}× the 30-day average`,
});
}6. Expose circuit breaker status in Execution Queue dashboard
The /dashboards/execution-queue page should show:
- Today’s platform spend vs. daily limit (progress bar)
- Which queues are currently paused (if circuit breaker tripped)
- Per-tenant spend ranking (top 10 spenders today)
- Anomalous runs flagged in the last 24 hours
Files to Change
packages/db/prisma/schema.prisma— addmaxCostUsdPerRun,maxTokensPerRuntoAgentConfig- New file:
packages/agents/src/lib/cost-circuit-breaker.ts - New file:
packages/queue/src/enqueue-guard.ts packages/agents/src/workers/blog-writer.worker.ts— add running cost tracker to progress callbackpackages/agents/src/workers/setup.worker.ts— sameapps/api/src/routers/admin/agents.ts— expose cost cap settings in PUT endpointapps/dashboard/src/app/(dashboard)/dashboards/execution-queue/— circuit breaker status panel
Related
- Gap 7: Priority queue differentiation (circuit breaker only pauses LOW/BACKGROUND — requires priority to be set)
- Gap 8: Context window management (token limits at prompt-build time prevent runaway costs before the run starts)
- Gap 10: Dynamic model routing (routing to cheaper models reduces cost before circuit breaker is needed)