Activity Planner: Slow Runtime (~8 min) — Root Cause Investigation
Status: Open
Severity: Medium-High — job completes but takes ~8 min; can exceed 16 min with internal retry, causing BullMQ lock expiry and duplicate execution
File: packages/agents/src/workers/activity.worker.ts
Observed: 2026-05-06 00:20 → 00:28 (7 min 41 sec Claude execution)
Timeline from Logs
00:20:23 Activity planner job started
00:20:23 Activity planner prompt built — invoking Claude (instant — DB queries fast)
00:28:04 Claude execution finished (7 min 41 sec later)
00:28:04 JSON parsed successfully
00:28:04 requireActivityApproval=true — period set to dm_review
00:28:04 Activity planner job completed100% of the runtime is Claude execution. All DB queries, JSON parsing, and writes are instant.
Root Cause Detail (from code investigation)
Finding 1: CLAUDE.md in skills dir MANDATES RAG searches (biggest hidden cause)
createSkillsDir() writes a CLAUDE.md into the skills directory that says:
## IMPORTANT: Call this tool before writing any content
Before generating your output, you MUST call search_knowledge to retrieve relevant context
from the tenant's indexed documents. This is mandatory — do not skip this step.
Step 1 — Search brand docs first
Step 2 — Search for the specific topic
Step 3 — Check published content to avoid duplicationThe activity planner passes skillsDir (line 427, 448 in activity.worker.ts) but has no
allowedTools: [] restriction. So Claude reads CLAUDE.md, obeys the “MUST call”
instruction, and executes 3+ search_knowledge Bash tool calls before generating any JSON.
Each tool call = one turn = one full API round-trip to Anthropic. That is 3 extra turns before a single output token is generated.
The deliverable planner never passes skillsDir at all, so Claude never sees this file.
The activity planner is the only pure-JSON planner that passes skillsDir. This CLAUDE.md
was written for content-generating agents (blog writer, social post writer) — not for
planning agents that do structured JSON generation.
Finding 2: Missing allowedTools: [] and maxTurnsPerRun
The execute config (lines 441-453):
config: {
cwd,
model: "claude-sonnet-4-6",
dangerouslySkipPermissions: true,
timeoutSec: 900,
// NO allowedTools
// NO maxTurnsPerRun
}Without allowedTools: [], Claude Code CLI runs with full tool access. Claude can and
does use Bash to run search_knowledge.js. Without maxTurnsPerRun, there is no cap on
the number of turns — tool calls + continuation turns can inflate to 5-8 turns total.
Finding 3: Internal retry loop can double the runtime
Lines 431-508: The worker has an internal retry loop (MAX_INTERNAL_RETRIES = 2). If JSON
parsing fails on attempt 1 (which happens when tool call output gets interleaved with
Claude’s JSON response), the entire execute() is called again — a second full invocation.
The parse failures are caused by the tool-use pollution in Finding 1. Fix tool use,
fix the retries. But if the retry does fire, runtime becomes 16+ min, which exceeds
the 16 min lockDuration and causes BullMQ lock expiry + duplicate execution.
Finding 4: Full client context injected verbatim (30-50K chars)
contextText at line 399 is the entire ClientContext.content blob — same issue as the
deliverable planner. The structured inputs (goalsTable, templatesTable) are already
compact because they are built directly from DB fields. The only fat in the prompt is the
context file.
The activity planner needs from the context: business overview and target audience (for
inputHints content direction). It does NOT need brand voice details, competitor analysis,
key page breakdowns, or technical website details.
Finding 5: lockDuration too tight and gets worse with retry
lockDuration: 960_000 = 16 min (only 1 min buffer over 900s timeout). With tool-use
inflation + internal retry, actual runtime easily exceeds 16 min, causing BullMQ lock
expiry and duplicate execution. The deliverable planner uses a 3 min buffer (1,080,000).
Comparison: Activity vs Deliverable Planner Config
| Setting | Deliverable Planner | Activity Planner |
|---|---|---|
allowedTools | [] ✅ | missing ❌ |
maxTurnsPerRun | 3 | missing ❌ |
skillsDir | not passed ✅ | passed — CLAUDE.md mandates RAG ❌❌ |
lockDuration | 1,080,000 (3 min buffer) | 960,000 (1 min buffer) ❌ |
| Internal retry | no | yes — up to 2× invocations ❌ |
| Estimated turn count | 2-3 | 5-8 (3 RAG + 2-3 output) |
| Observed runtime | ~7.5 min | ~8 min (can be 16+ with retry) |
Proposed Fixes
Fix 1: Add allowedTools: [], remove skillsDir, add maxTurnsPerRun: 2 (immediate, biggest gain)
const execute = await getClaudeAdapter(); // remove skillsDir — not needed
const result = await execute({
config: {
cwd,
model: "claude-sonnet-4-6",
dangerouslySkipPermissions: true,
allowedTools: [], // pure JSON generation — no tools needed
maxTurnsPerRun: 2, // 1 turn for output + 1 continuation buffer
timeoutSec: 600, // 10 min (generous; tools can no longer inflate this)
},
prompt: activePrompt,
// skillsDir removed
agentId: "activity-planner",
...
});Removing skillsDir means Claude never sees the mandatory RAG CLAUDE.md. Adding
allowedTools: [] as a belt-and-suspenders guard. Expected impact: removes 3+ tool-call
turns, saves 4-6 min. With no tool use the first-attempt JSON parse will succeed, making
the internal retry loop a true last-resort rather than a regular occurrence.
Fix 2: Increase lockDuration buffer (safety fix, zero risk)
lockDuration: 1_080_000, // 18 min (15 min timeout + 3 min buffer, matches deliverable planner)Fix 3: Extract planning-relevant sections from client context (shared with deliverable planner)
Same approach as the deliverable planner Option 1 — parse the context markdown by #
headings at prompt-build time and include only: overview, goals, audience, channels.
Estimated impact: reduces context from ~30-50K chars to ~4-6K chars, saves 1-2 min per turn.
Fix 4: Switch model to claude-haiku-4-5 (optional, after Fix 1 validated)
Pure structured JSON from known templates → activity pipeline. Haiku performs equivalently at ~5x the speed.
model: "claude-haiku-4-5",
timeoutSec: 300,
lockDuration: 480_000,Recommended Fix Order
- Fix 1 (allowedTools + remove skillsDir + maxTurnsPerRun) — zero quality risk, expected 4-6 min gain, also eliminates the retry-loop double-invocation risk.
- Fix 2 (lockDuration buffer) — safety fix, zero risk.
- Fix 3 (context section extraction) — share implementation with deliverable planner.
- Fix 4 (Haiku) — after validating Fix 1+3 on a live run.