Scalability Design

Target Scale

The system must be capable of:

Millions of tasks/activities per day across all tenants
Thousands of concurrent tenants
Thousands of simultaneous agent executions
Linear horizontal scaling — add more workers, handle more load, no architectural changes

Why the Current Architecture Scales

Control plane ≠ agent execution

Because the control plane does not run agents — it only dispatches and receives callbacks — the control plane’s load is proportional to orchestration overhead, not LLM compute time. A single control plane cluster can dispatch millions of tasks to external agent runtimes.

Agent runtimes (Claude API, OpenAI API, Ollama workers) scale independently and are not bottlenecked by the control plane.

BullMQ is horizontally scalable by design

Adding more worker processes to any agent queue increases throughput proportionally. Workers are stateless — they read from Redis, execute a task, write results to the DB, and exit.


10 copywriter worker processes × 4 concurrency each = 40 concurrent copywriter tasks
Scale to 100 workers = 400 concurrent copywriter tasks
No code changes required

Shared queues with per-tenant priority and rate limiting

All tenants share the same agent__{agentRole} queue. A surge from one tenant does not starve others because:

Priority — jobs carry a numeric priority; urgent/normal jobs from any tenant are dequeued first regardless of tenant.
Per-tenant rate limiter — BullMQ’s rate limiter is keyed on tenantId, capping how many jobs a single tenant can have active at once:


// Tenant-level concurrency cap enforced at the worker level
limiter: { max: tenantConcurrencyLimit, duration: 1000, groupKey: job.data.tenantId }

This replaces the old {tenantId}:agent__{agentRole} per-tenant queue scheme and matches the existing rag__ingestion queue pattern.

Scaling Strategy by Layer

1. API Layer (Fastify)

Bottlenecks: HTTP connection handling, JSON serialisation, DB query throughput.

Scaling approach:

Horizontal: run N Fastify instances behind a load balancer (Coolify/Traefik handles this)
SSE connections: each SSE connection is long-lived; use Redis pub/sub to fan out events to all Fastify instances so any instance can serve any client’s SSE stream


// SSE uses Redis pub/sub — any Fastify instance can serve any tenant's stream
redisSubscriber.subscribe(`tenant:${tenantId}:events`);
redisSubscriber.on('message', (channel, event) => {
  // Forward to all SSE clients connected to this instance for this tenant
  sseClients.get(tenantId)?.forEach(client => client.write(`data: ${event}\n\n`));
});

Target: 1 Fastify instance handles ~5,000 req/s. 10 instances = 50,000 req/s. Scale out as needed.

2. BullMQ Workers

Bottlenecks: Agent adapter dispatch rate, Redis command rate.

Scaling approach:

Each worker type is a separate horizontally-scalable process
Worker instances are stateless Docker containers — Coolify scales them
Redis command rate: Redis 7 handles ~1M commands/second on a single node; at scale, use Redis Cluster


docker service scale api-workers=50    # 50 worker containers

Target throughput per worker container (4 vCPU):

Agent type	Concurrency	Tasks/hour/container
Activity Planner	2	~20 (heavy LLM calls)
Copywriter	4	~120 (varies by length)
SEO Specialist	3	~60 (API calls bottleneck)
Social Media Manager	4	~120
Paid Ads Manager	2	~30 (API rate limits)
Data Analyst	3	~60
Content Researcher	2	~20 (Playwright CPU)

Scale to 1M tasks/day:

1M tasks/day = ~41,666 tasks/hour = ~694 tasks/minute
Mix varies by campaign type; assume 300 copywriter + 200 SEO + 194 others
Copywriter: 300/hour needs 3 containers (100/hour each)
At peak (10× burst): 30 copywriter containers
Total: ~50–100 worker containers for 1M tasks/day

3. Databases

PostgreSQL (relational, transactional data)

Bottlenecks: Write throughput on high-frequency tables (llm_calls, task_runs), read throughput on cost aggregation queries.

Scaling approach:

Connection pooling: PgBouncer in front of PostgreSQL; workers connect to PgBouncer, not directly to Postgres. Cap at 100 real connections regardless of worker count.
Read replicas: Cost aggregation and reporting queries hit a read replica; writes go to primary
Partitioning: llm_calls and task_runs partitioned by created_at (monthly) so old data doesn’t slow down current-month queries
Index strategy: Critical indexes already defined; add BRIN indexes on time-series tables


-- Partition llm_calls by month for scale
CREATE TABLE llm_calls (
  ...
) PARTITION BY RANGE (created_at);
 
CREATE TABLE llm_calls_2026_04 PARTITION OF llm_calls
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Target: PostgreSQL 16 on 8 vCPU/32GB RAM handles ~50K writes/second. Sufficient for millions of tasks/day.

MongoDB (document store, append-heavy)

Bottlenecks: Deliverable content writes, event log append rate.

Scaling approach:

Per-tenant database provides natural sharding — each tenant’s deliverable writes go to their own database
TTL indexes on streaming buffers and short-lived event logs auto-cleanup without manual purging
Replica set: 3-node replica set for durability; read from secondaries for log queries


// TTL: streaming output buffer expires after 24h (UI caches what it needs)
deliverableStreamSchema.index({ createdAt: 1 }, { expireAfterSeconds: 86_400 });
 
// TTL: raw agent output logs expire after 90 days
agentOutputSchema.index({ createdAt: 1 }, { expireAfterSeconds: 7_776_000 });

Target: MongoDB 7 on 8 vCPU/32GB RAM handles ~100K writes/second on append-only collections.

Redis

Bottlenecks: Queue command rate at very high task throughput.

Scaling approach:

At < 100K tasks/day: single Redis instance is sufficient
At > 100K tasks/day: Redis Cluster (3 primary shards + 3 replicas)
Queue keys are tenant-scoped so shard routing is predictable

4. LLM API Rate Limits

LLM APIs are the most likely real-world bottleneck, not our infrastructure.

Anthropic Claude:

Tier 4 (enterprise): ~40,000,000 input tokens/minute
At ~2,000 tokens/task average: ~20,000 Claude tasks/minute = 28.8M tasks/day
This exceeds our target → rate limit handling is a queue concern, not a scale concern

OpenAI:

Tier 5 (enterprise): 30M tokens/minute
Similar ceiling to Claude

Mitigations:

Per-tenant token bucket rate limiting prevents one tenant from consuming all quota
Automatic fallback to secondary provider if primary hits rate limit
Ollama workers act as an overflow valve for classification/routing tasks


// Per-tenant, per-provider rate limiter
const rateLimiter = new Bottleneck({
  reservoir:              tenant.plan === 'enterprise' ? 500_000 : 100_000, // tokens/min
  reservoirRefreshAmount: tenant.plan === 'enterprise' ? 500_000 : 100_000,
  reservoirRefreshInterval: 60_000,
});

Tenant-Aware Agent Allocation

Not all tenants get all agents. Agent availability is based on plan and tenant-specific configuration. A small client might only have Copywriter + SEO Specialist enabled; an enterprise client gets all 7 agents plus custom agents.

Tenant Agent Plan Matrix

Plan	Available Agents	Max Concurrent per Agent
Free	Copywriter, SEO Specialist	1 each
Pro	Copywriter, SEO, Social, Analyst	2 each
Agency	All 7 standard agents	4 each
Enterprise	All 7 + custom agents	Configurable

Configuring Agents per Tenant

Super admins configure agent availability per tenant in the Manage app (/tenants/[id] → Config tab):


interface TenantAgentAllocation {
  tenantId:    string;
  agentRole:   AgentRole;
  isEnabled:   boolean;
  maxConcurrent: number;         // max simultaneous tasks for this agent for this tenant
  monthlyTaskLimit?: number;     // optional: cap at N tasks/month for cost control
  modelOverride?: string;        // tenant-specific model (e.g. force GPT-4o-mini for budget clients)
  adapterOverride?: AdapterType; // tenant-specific adapter (e.g. webhook for enterprise)
}

This is the basis for the per-tenant agent selection screen in both the Manage app and the tenant Dashboard.

Priority Queuing at Scale

At high throughput, queue depth can grow large. Priority ensures high-value and urgent work is processed first:


Priority levels:
  1 = Urgent (user explicitly flagged as urgent)
  5 = Normal
 10 = Batch / background

Activity Planner decomposition tasks always get priority 1 (fast feedback to the tenant that their campaign is “started”).

Observability at Scale

At millions of tasks/day, debugging requires structured observability:

Metrics (Prometheus + Grafana)

Queue depth per tenant per agent type
Task completion rate / error rate (P50, P95, P99 durations)
LLM token throughput per provider
DB query latency
Redis memory usage

Distributed tracing (OpenTelemetry)

Every task execution is a trace: enqueue → dispatch → agent callback → validate → store
Trace ID flows from the API request through BullMQ job → adapter → callback
Allows “why is this campaign slow?” investigations at scale

Structured logging (pino)

All log lines include: tenantId, campaignId, taskId, agentRole, traceId
Shipped to Grafana Loki or Datadog
Queryable: “show all errors for tenant Acme in the last 1 hour”

Capacity Planning Summary

Scale	API instances	Worker containers	PostgreSQL	MongoDB	Redis
< 10K tasks/day	1	5	1 instance	1 instance	1 instance
100K tasks/day	2	20	1 + 1 read replica	replica set	1 instance
1M tasks/day	5	100	1 + 2 read replicas + PgBouncer	replica set	Redis Cluster
10M tasks/day	20+	500+	Citus (sharded)	Atlas (sharded)	Redis Cluster