Skip to Content
Scalability

Scalability Design

Target Scale

The system must be capable of:

  • Millions of tasks/activities per day across all tenants
  • Thousands of concurrent tenants
  • Thousands of simultaneous agent executions
  • Linear horizontal scaling — add more workers, handle more load, no architectural changes

Why the Current Architecture Scales

Control plane ≠ agent execution

Because the control plane does not run agents — it only dispatches and receives callbacks — the control plane’s load is proportional to orchestration overhead, not LLM compute time. A single control plane cluster can dispatch millions of tasks to external agent runtimes.

Agent runtimes (Claude API, OpenAI API, Ollama workers) scale independently and are not bottlenecked by the control plane.

BullMQ is horizontally scalable by design

Adding more worker processes to any agent queue increases throughput proportionally. Workers are stateless — they read from Redis, execute a task, write results to the DB, and exit.

10 copywriter worker processes × 4 concurrency each = 40 concurrent copywriter tasks Scale to 100 workers = 400 concurrent copywriter tasks No code changes required

Shared queues with per-tenant priority and rate limiting

All tenants share the same agent__{agentRole} queue. A surge from one tenant does not starve others because:

  1. Priority — jobs carry a numeric priority; urgent/normal jobs from any tenant are dequeued first regardless of tenant.
  2. Per-tenant rate limiter — BullMQ’s rate limiter is keyed on tenantId, capping how many jobs a single tenant can have active at once:
// Tenant-level concurrency cap enforced at the worker level limiter: { max: tenantConcurrencyLimit, duration: 1000, groupKey: job.data.tenantId }

This replaces the old {tenantId}:agent__{agentRole} per-tenant queue scheme and matches the existing rag__ingestion queue pattern.


Scaling Strategy by Layer

1. API Layer (Fastify)

Bottlenecks: HTTP connection handling, JSON serialisation, DB query throughput.

Scaling approach:

  • Horizontal: run N Fastify instances behind a load balancer (Coolify/Traefik handles this)
  • SSE connections: each SSE connection is long-lived; use Redis pub/sub to fan out events to all Fastify instances so any instance can serve any client’s SSE stream
// SSE uses Redis pub/sub — any Fastify instance can serve any tenant's stream redisSubscriber.subscribe(`tenant:${tenantId}:events`); redisSubscriber.on('message', (channel, event) => { // Forward to all SSE clients connected to this instance for this tenant sseClients.get(tenantId)?.forEach(client => client.write(`data: ${event}\n\n`)); });

Target: 1 Fastify instance handles ~5,000 req/s. 10 instances = 50,000 req/s. Scale out as needed.

2. BullMQ Workers

Bottlenecks: Agent adapter dispatch rate, Redis command rate.

Scaling approach:

  • Each worker type is a separate horizontally-scalable process
  • Worker instances are stateless Docker containers — Coolify scales them
  • Redis command rate: Redis 7 handles ~1M commands/second on a single node; at scale, use Redis Cluster
docker service scale api-workers=50 # 50 worker containers

Target throughput per worker container (4 vCPU):

Agent typeConcurrencyTasks/hour/container
Activity Planner2~20 (heavy LLM calls)
Copywriter4~120 (varies by length)
SEO Specialist3~60 (API calls bottleneck)
Social Media Manager4~120
Paid Ads Manager2~30 (API rate limits)
Data Analyst3~60
Content Researcher2~20 (Playwright CPU)

Scale to 1M tasks/day:

  • 1M tasks/day = ~41,666 tasks/hour = ~694 tasks/minute
  • Mix varies by campaign type; assume 300 copywriter + 200 SEO + 194 others
  • Copywriter: 300/hour needs 3 containers (100/hour each)
  • At peak (10× burst): 30 copywriter containers
  • Total: ~50–100 worker containers for 1M tasks/day

3. Databases

PostgreSQL (relational, transactional data)

Bottlenecks: Write throughput on high-frequency tables (llm_calls, task_runs), read throughput on cost aggregation queries.

Scaling approach:

  • Connection pooling: PgBouncer in front of PostgreSQL; workers connect to PgBouncer, not directly to Postgres. Cap at 100 real connections regardless of worker count.
  • Read replicas: Cost aggregation and reporting queries hit a read replica; writes go to primary
  • Partitioning: llm_calls and task_runs partitioned by created_at (monthly) so old data doesn’t slow down current-month queries
  • Index strategy: Critical indexes already defined; add BRIN indexes on time-series tables
-- Partition llm_calls by month for scale CREATE TABLE llm_calls ( ... ) PARTITION BY RANGE (created_at); CREATE TABLE llm_calls_2026_04 PARTITION OF llm_calls FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Target: PostgreSQL 16 on 8 vCPU/32GB RAM handles ~50K writes/second. Sufficient for millions of tasks/day.

MongoDB (document store, append-heavy)

Bottlenecks: Deliverable content writes, event log append rate.

Scaling approach:

  • Per-tenant database provides natural sharding — each tenant’s deliverable writes go to their own database
  • TTL indexes on streaming buffers and short-lived event logs auto-cleanup without manual purging
  • Replica set: 3-node replica set for durability; read from secondaries for log queries
// TTL: streaming output buffer expires after 24h (UI caches what it needs) deliverableStreamSchema.index({ createdAt: 1 }, { expireAfterSeconds: 86_400 }); // TTL: raw agent output logs expire after 90 days agentOutputSchema.index({ createdAt: 1 }, { expireAfterSeconds: 7_776_000 });

Target: MongoDB 7 on 8 vCPU/32GB RAM handles ~100K writes/second on append-only collections.

Redis

Bottlenecks: Queue command rate at very high task throughput.

Scaling approach:

  • At < 100K tasks/day: single Redis instance is sufficient
  • At > 100K tasks/day: Redis Cluster (3 primary shards + 3 replicas)
  • Queue keys are tenant-scoped so shard routing is predictable

4. LLM API Rate Limits

LLM APIs are the most likely real-world bottleneck, not our infrastructure.

Anthropic Claude:

  • Tier 4 (enterprise): ~40,000,000 input tokens/minute
  • At ~2,000 tokens/task average: ~20,000 Claude tasks/minute = 28.8M tasks/day
  • This exceeds our target → rate limit handling is a queue concern, not a scale concern

OpenAI:

  • Tier 5 (enterprise): 30M tokens/minute
  • Similar ceiling to Claude

Mitigations:

  • Per-tenant token bucket rate limiting prevents one tenant from consuming all quota
  • Automatic fallback to secondary provider if primary hits rate limit
  • Ollama workers act as an overflow valve for classification/routing tasks
// Per-tenant, per-provider rate limiter const rateLimiter = new Bottleneck({ reservoir: tenant.plan === 'enterprise' ? 500_000 : 100_000, // tokens/min reservoirRefreshAmount: tenant.plan === 'enterprise' ? 500_000 : 100_000, reservoirRefreshInterval: 60_000, });

Tenant-Aware Agent Allocation

Not all tenants get all agents. Agent availability is based on plan and tenant-specific configuration. A small client might only have Copywriter + SEO Specialist enabled; an enterprise client gets all 7 agents plus custom agents.

Tenant Agent Plan Matrix

PlanAvailable AgentsMax Concurrent per Agent
FreeCopywriter, SEO Specialist1 each
ProCopywriter, SEO, Social, Analyst2 each
AgencyAll 7 standard agents4 each
EnterpriseAll 7 + custom agentsConfigurable

Configuring Agents per Tenant

Super admins configure agent availability per tenant in the Manage app (/tenants/[id] → Config tab):

interface TenantAgentAllocation { tenantId: string; agentRole: AgentRole; isEnabled: boolean; maxConcurrent: number; // max simultaneous tasks for this agent for this tenant monthlyTaskLimit?: number; // optional: cap at N tasks/month for cost control modelOverride?: string; // tenant-specific model (e.g. force GPT-4o-mini for budget clients) adapterOverride?: AdapterType; // tenant-specific adapter (e.g. webhook for enterprise) }

This is the basis for the per-tenant agent selection screen in both the Manage app and the tenant Dashboard.


Priority Queuing at Scale

At high throughput, queue depth can grow large. Priority ensures high-value and urgent work is processed first:

Priority levels: 1 = Urgent (user explicitly flagged as urgent) 5 = Normal 10 = Batch / background

Activity Planner decomposition tasks always get priority 1 (fast feedback to the tenant that their campaign is “started”).


Observability at Scale

At millions of tasks/day, debugging requires structured observability:

Metrics (Prometheus + Grafana)

  • Queue depth per tenant per agent type
  • Task completion rate / error rate (P50, P95, P99 durations)
  • LLM token throughput per provider
  • DB query latency
  • Redis memory usage

Distributed tracing (OpenTelemetry)

  • Every task execution is a trace: enqueue → dispatch → agent callback → validate → store
  • Trace ID flows from the API request through BullMQ job → adapter → callback
  • Allows “why is this campaign slow?” investigations at scale

Structured logging (pino)

  • All log lines include: tenantId, campaignId, taskId, agentRole, traceId
  • Shipped to Grafana Loki or Datadog
  • Queryable: “show all errors for tenant Acme in the last 1 hour”

Capacity Planning Summary

ScaleAPI instancesWorker containersPostgreSQLMongoDBRedis
< 10K tasks/day151 instance1 instance1 instance
100K tasks/day2201 + 1 read replicareplica set1 instance
1M tasks/day51001 + 2 read replicas + PgBouncerreplica setRedis Cluster
10M tasks/day20+500+Citus (sharded)Atlas (sharded)Redis Cluster

© 2026 Leadmetrics — Internal use only