Channel Action Suggester: Claude Subprocess Exits with Code 1
Status: ✅ Fixed (2026-05-06) — concurrency: 3 → 1
Severity: Medium — affected jobs fail immediately; BullMQ retries eventually succeed, so data is not lost but credits are wasted and latency increases
File: packages/agents/src/workers/insights/channel-action-suggester.worker.ts
Symptom
After channel insights complete, the action suggester job starts and immediately fails (~400ms) with no meaningful error:
ERROR [channel-action-suggester] Claude adapter returned failure
error: "Process exited with code 1"
connectedChannelId: cmotrtt7n0009w13o1uunhhna
ERROR [channel-action-suggester] Channel action suggester job failed
jobId: channel-action-suggester__<channelId>__<ts>
Error: Process exited with code 1
at Worker.processJob (channel-action-suggester.worker.ts:232)The failure is always near-instant (~400ms–1.5s), far too fast for Claude to have processed anything. Retries succeed once no concurrent job is running.
Root Cause
The execute.ts adapter (packages/adapters/claude-local/src/server/execute.ts) uses a .agent.pid file in the worker’s cwd to detect and kill orphaned Claude subprocesses left over from a server restart:
// Kill any orphaned subprocess from a previous run (server restart scenario).
const pidFile = path.join(config.cwd, ".agent.pid");
// reads old PID → taskkill /F /T /PID <old> → writes current PIDAll channel-action-suggester jobs share the same cwd:
os.tmpdir()/leadmetrics-agents/insights/channel-action-suggesterWith concurrency: 3, BullMQ runs multiple jobs simultaneously. When Job B starts while Job A is still running:
- Job B reads
.agent.pid— finds Job A’s PID - Job B issues
taskkill /F /T /PID <A>— kills Job A’s Claude process - Job A’s Claude process exits with code 1, empty stderr
- Job A’s
execute.tsresolves{ success: false, error: "Process exited with code 1" } - Job A throws and BullMQ schedules a retry
This was confirmed in live logs (2026-05-06):
13:34:48 Job A starts (Claude running, ~1 min)
13:35:48 Job B starts ← concurrent; kills Job A via pidFile
13:35:48 Job A FAILS "Process exited with code 1" (416ms after B started)
13:38:28 Job D starts
13:39:07 Job E starts ← same pattern; kills Job D
13:39:08 Job D FAILS "Process exited with code 1"Why insight workers don’t have this problem: each insight worker type has its own agentRole (e.g. "facebook-insights", "gsc-insights"), so each gets its own cwd and its own pid file. At any given moment typically only one job per channel type is running.
Previous Incorrect Fix (2026-05-05)
An earlier fix added maxStalledCount: 2, maxTurnsPerRun: 3, and lockDuration: 300_000. These helped with a separate stall-retry issue but did not address the concurrent-kill race condition. The "Process exited with code 1" error continued appearing in the 2026-05-06 E2E session.
Fix Applied (2026-05-06)
Set CONCURRENCY from 3 to 1:
// packages/agents/src/workers/insights/channel-action-suggester.worker.ts
const CONCURRENCY = 1; // was 3With concurrency: 1, BullMQ only runs one job at a time for this worker. The pid file mechanism was designed for this model — one Claude process per agent role at a time. Jobs queue up and execute serially, which is appropriate for a background suggestion generation task.
Why Not Use Per-Execution cwd?
An alternative of using path.join(workerCwd(AGENT_ROLE), runId) as the cwd would isolate concurrent jobs but would break orphan cleanup: after a server restart, the new job would use a new runId-derived directory and never find the orphaned Claude process from the previous run.
Notes
lockDuration: 300_000(5 min) remains, giving each serial job enough lock timemaxStalledCount: 2remains as a safety net for Windows DLL cold-start on first Claude spawn- If throughput becomes a bottleneck, the proper fix is to scope the pid file per
connectedChannelId(one file per channel, not one per agent role), which would allow concurrent jobs across different channels without interfering