Scalability Design
Target Scale
The system must be capable of:
- Millions of tasks/activities per day across all tenants
- Thousands of concurrent tenants
- Thousands of simultaneous agent executions
- Linear horizontal scaling — add more workers, handle more load, no architectural changes
Why the Current Architecture Scales
Control plane ≠ agent execution
Because the control plane does not run agents — it only dispatches and receives callbacks — the control plane’s load is proportional to orchestration overhead, not LLM compute time. A single control plane cluster can dispatch millions of tasks to external agent runtimes.
Agent runtimes (Claude API, OpenAI API, Ollama workers) scale independently and are not bottlenecked by the control plane.
BullMQ is horizontally scalable by design
Adding more worker processes to any agent queue increases throughput proportionally. Workers are stateless — they read from Redis, execute a task, write results to the DB, and exit.
10 copywriter worker processes × 4 concurrency each = 40 concurrent copywriter tasks
Scale to 100 workers = 400 concurrent copywriter tasks
No code changes requiredShared queues with per-tenant priority and rate limiting
All tenants share the same agent__{agentRole} queue. A surge from one tenant does not starve others because:
- Priority — jobs carry a numeric priority; urgent/normal jobs from any tenant are dequeued first regardless of tenant.
- Per-tenant rate limiter — BullMQ’s rate limiter is keyed on
tenantId, capping how many jobs a single tenant can have active at once:
// Tenant-level concurrency cap enforced at the worker level
limiter: { max: tenantConcurrencyLimit, duration: 1000, groupKey: job.data.tenantId }This replaces the old {tenantId}:agent__{agentRole} per-tenant queue scheme and matches the existing rag__ingestion queue pattern.
Scaling Strategy by Layer
1. API Layer (Fastify)
Bottlenecks: HTTP connection handling, JSON serialisation, DB query throughput.
Scaling approach:
- Horizontal: run N Fastify instances behind a load balancer (Coolify/Traefik handles this)
- SSE connections: each SSE connection is long-lived; use Redis pub/sub to fan out events to all Fastify instances so any instance can serve any client’s SSE stream
// SSE uses Redis pub/sub — any Fastify instance can serve any tenant's stream
redisSubscriber.subscribe(`tenant:${tenantId}:events`);
redisSubscriber.on('message', (channel, event) => {
// Forward to all SSE clients connected to this instance for this tenant
sseClients.get(tenantId)?.forEach(client => client.write(`data: ${event}\n\n`));
});Target: 1 Fastify instance handles ~5,000 req/s. 10 instances = 50,000 req/s. Scale out as needed.
2. BullMQ Workers
Bottlenecks: Agent adapter dispatch rate, Redis command rate.
Scaling approach:
- Each worker type is a separate horizontally-scalable process
- Worker instances are stateless Docker containers — Coolify scales them
- Redis command rate: Redis 7 handles ~1M commands/second on a single node; at scale, use Redis Cluster
docker service scale api-workers=50 # 50 worker containersTarget throughput per worker container (4 vCPU):
| Agent type | Concurrency | Tasks/hour/container |
|---|---|---|
| Activity Planner | 2 | ~20 (heavy LLM calls) |
| Copywriter | 4 | ~120 (varies by length) |
| SEO Specialist | 3 | ~60 (API calls bottleneck) |
| Social Media Manager | 4 | ~120 |
| Paid Ads Manager | 2 | ~30 (API rate limits) |
| Data Analyst | 3 | ~60 |
| Content Researcher | 2 | ~20 (Playwright CPU) |
Scale to 1M tasks/day:
- 1M tasks/day = ~41,666 tasks/hour = ~694 tasks/minute
- Mix varies by campaign type; assume 300 copywriter + 200 SEO + 194 others
- Copywriter: 300/hour needs 3 containers (100/hour each)
- At peak (10× burst): 30 copywriter containers
- Total: ~50–100 worker containers for 1M tasks/day
3. Databases
PostgreSQL (relational, transactional data)
Bottlenecks: Write throughput on high-frequency tables (llm_calls, task_runs), read throughput on cost aggregation queries.
Scaling approach:
- Connection pooling: PgBouncer in front of PostgreSQL; workers connect to PgBouncer, not directly to Postgres. Cap at 100 real connections regardless of worker count.
- Read replicas: Cost aggregation and reporting queries hit a read replica; writes go to primary
- Partitioning:
llm_callsandtask_runspartitioned bycreated_at(monthly) so old data doesn’t slow down current-month queries - Index strategy: Critical indexes already defined; add
BRINindexes on time-series tables
-- Partition llm_calls by month for scale
CREATE TABLE llm_calls (
...
) PARTITION BY RANGE (created_at);
CREATE TABLE llm_calls_2026_04 PARTITION OF llm_calls
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');Target: PostgreSQL 16 on 8 vCPU/32GB RAM handles ~50K writes/second. Sufficient for millions of tasks/day.
MongoDB (document store, append-heavy)
Bottlenecks: Deliverable content writes, event log append rate.
Scaling approach:
- Per-tenant database provides natural sharding — each tenant’s deliverable writes go to their own database
- TTL indexes on streaming buffers and short-lived event logs auto-cleanup without manual purging
- Replica set: 3-node replica set for durability; read from secondaries for log queries
// TTL: streaming output buffer expires after 24h (UI caches what it needs)
deliverableStreamSchema.index({ createdAt: 1 }, { expireAfterSeconds: 86_400 });
// TTL: raw agent output logs expire after 90 days
agentOutputSchema.index({ createdAt: 1 }, { expireAfterSeconds: 7_776_000 });Target: MongoDB 7 on 8 vCPU/32GB RAM handles ~100K writes/second on append-only collections.
Redis
Bottlenecks: Queue command rate at very high task throughput.
Scaling approach:
- At < 100K tasks/day: single Redis instance is sufficient
- At > 100K tasks/day: Redis Cluster (3 primary shards + 3 replicas)
- Queue keys are tenant-scoped so shard routing is predictable
4. LLM API Rate Limits
LLM APIs are the most likely real-world bottleneck, not our infrastructure.
Anthropic Claude:
- Tier 4 (enterprise): ~40,000,000 input tokens/minute
- At ~2,000 tokens/task average: ~20,000 Claude tasks/minute = 28.8M tasks/day
- This exceeds our target → rate limit handling is a queue concern, not a scale concern
OpenAI:
- Tier 5 (enterprise): 30M tokens/minute
- Similar ceiling to Claude
Mitigations:
- Per-tenant token bucket rate limiting prevents one tenant from consuming all quota
- Automatic fallback to secondary provider if primary hits rate limit
- Ollama workers act as an overflow valve for classification/routing tasks
// Per-tenant, per-provider rate limiter
const rateLimiter = new Bottleneck({
reservoir: tenant.plan === 'enterprise' ? 500_000 : 100_000, // tokens/min
reservoirRefreshAmount: tenant.plan === 'enterprise' ? 500_000 : 100_000,
reservoirRefreshInterval: 60_000,
});Tenant-Aware Agent Allocation
Not all tenants get all agents. Agent availability is based on plan and tenant-specific configuration. A small client might only have Copywriter + SEO Specialist enabled; an enterprise client gets all 7 agents plus custom agents.
Tenant Agent Plan Matrix
| Plan | Available Agents | Max Concurrent per Agent |
|---|---|---|
| Free | Copywriter, SEO Specialist | 1 each |
| Pro | Copywriter, SEO, Social, Analyst | 2 each |
| Agency | All 7 standard agents | 4 each |
| Enterprise | All 7 + custom agents | Configurable |
Configuring Agents per Tenant
Super admins configure agent availability per tenant in the Manage app (/tenants/[id] → Config tab):
interface TenantAgentAllocation {
tenantId: string;
agentRole: AgentRole;
isEnabled: boolean;
maxConcurrent: number; // max simultaneous tasks for this agent for this tenant
monthlyTaskLimit?: number; // optional: cap at N tasks/month for cost control
modelOverride?: string; // tenant-specific model (e.g. force GPT-4o-mini for budget clients)
adapterOverride?: AdapterType; // tenant-specific adapter (e.g. webhook for enterprise)
}This is the basis for the per-tenant agent selection screen in both the Manage app and the tenant Dashboard.
Priority Queuing at Scale
At high throughput, queue depth can grow large. Priority ensures high-value and urgent work is processed first:
Priority levels:
1 = Urgent (user explicitly flagged as urgent)
5 = Normal
10 = Batch / backgroundActivity Planner decomposition tasks always get priority 1 (fast feedback to the tenant that their campaign is “started”).
Observability at Scale
At millions of tasks/day, debugging requires structured observability:
Metrics (Prometheus + Grafana)
- Queue depth per tenant per agent type
- Task completion rate / error rate (P50, P95, P99 durations)
- LLM token throughput per provider
- DB query latency
- Redis memory usage
Distributed tracing (OpenTelemetry)
- Every task execution is a trace: enqueue → dispatch → agent callback → validate → store
- Trace ID flows from the API request through BullMQ job → adapter → callback
- Allows “why is this campaign slow?” investigations at scale
Structured logging (pino)
- All log lines include:
tenantId,campaignId,taskId,agentRole,traceId - Shipped to Grafana Loki or Datadog
- Queryable: “show all errors for tenant Acme in the last 1 hour”
Capacity Planning Summary
| Scale | API instances | Worker containers | PostgreSQL | MongoDB | Redis |
|---|---|---|---|---|---|
| < 10K tasks/day | 1 | 5 | 1 instance | 1 instance | 1 instance |
| 100K tasks/day | 2 | 20 | 1 + 1 read replica | replica set | 1 instance |
| 1M tasks/day | 5 | 100 | 1 + 2 read replicas + PgBouncer | replica set | Redis Cluster |
| 10M tasks/day | 20+ | 500+ | Citus (sharded) | Atlas (sharded) | Redis Cluster |