Channel Action Suggester: Claude Subprocess Exits with Code 1

Status: ✅ Fixed (2026-05-06) — concurrency: 3 → 1 Severity: Medium — affected jobs fail immediately; BullMQ retries eventually succeed, so data is not lost but credits are wasted and latency increases File: packages/agents/src/workers/insights/channel-action-suggester.worker.ts

Symptom

After channel insights complete, the action suggester job starts and immediately fails (~400ms) with no meaningful error:


ERROR [channel-action-suggester] Claude adapter returned failure
  error: "Process exited with code 1"
  connectedChannelId: cmotrtt7n0009w13o1uunhhna

ERROR [channel-action-suggester] Channel action suggester job failed
  jobId: channel-action-suggester__<channelId>__<ts>
    Error: Process exited with code 1
      at Worker.processJob (channel-action-suggester.worker.ts:232)

The failure is always near-instant (~400ms–1.5s), far too fast for Claude to have processed anything. Retries succeed once no concurrent job is running.

Root Cause

The execute.ts adapter (packages/adapters/claude-local/src/server/execute.ts) uses a .agent.pid file in the worker’s cwd to detect and kill orphaned Claude subprocesses left over from a server restart:


// Kill any orphaned subprocess from a previous run (server restart scenario).
const pidFile = path.join(config.cwd, ".agent.pid");
// reads old PID → taskkill /F /T /PID <old> → writes current PID

All channel-action-suggester jobs share the same cwd:


os.tmpdir()/leadmetrics-agents/insights/channel-action-suggester

With concurrency: 3, BullMQ runs multiple jobs simultaneously. When Job B starts while Job A is still running:

Job B reads .agent.pid — finds Job A’s PID
Job B issues taskkill /F /T /PID <A> — kills Job A’s Claude process
Job A’s Claude process exits with code 1, empty stderr
Job A’s execute.ts resolves { success: false, error: "Process exited with code 1" }
Job A throws and BullMQ schedules a retry

This was confirmed in live logs (2026-05-06):


13:34:48  Job A starts   (Claude running, ~1 min)
13:35:48  Job B starts   ← concurrent; kills Job A via pidFile
13:35:48  Job A FAILS    "Process exited with code 1"  (416ms after B started)

13:38:28  Job D starts
13:39:07  Job E starts   ← same pattern; kills Job D
13:39:08  Job D FAILS    "Process exited with code 1"

Why insight workers don’t have this problem: each insight worker type has its own agentRole (e.g. "facebook-insights", "gsc-insights"), so each gets its own cwd and its own pid file. At any given moment typically only one job per channel type is running.

Previous Incorrect Fix (2026-05-05)

An earlier fix added maxStalledCount: 2, maxTurnsPerRun: 3, and lockDuration: 300_000. These helped with a separate stall-retry issue but did not address the concurrent-kill race condition. The "Process exited with code 1" error continued appearing in the 2026-05-06 E2E session.

Fix Applied (2026-05-06)

Set CONCURRENCY from 3 to 1:


// packages/agents/src/workers/insights/channel-action-suggester.worker.ts
const CONCURRENCY = 1;  // was 3

With concurrency: 1, BullMQ only runs one job at a time for this worker. The pid file mechanism was designed for this model — one Claude process per agent role at a time. Jobs queue up and execute serially, which is appropriate for a background suggestion generation task.

Why Not Use Per-Execution cwd?

An alternative of using path.join(workerCwd(AGENT_ROLE), runId) as the cwd would isolate concurrent jobs but would break orphan cleanup: after a server restart, the new job would use a new runId-derived directory and never find the orphaned Claude process from the previous run.

Notes

lockDuration: 300_000 (5 min) remains, giving each serial job enough lock time
maxStalledCount: 2 remains as a safety net for Windows DLL cold-start on first Claude spawn
If throughput becomes a bottleneck, the proper fix is to scope the pid file per connectedChannelId (one file per channel, not one per agent role), which would allow concurrent jobs across different channels without interfering