E2E Run #1 — Issue Log

Date: 2026-05-06
Tenant: Leadmetrics (cmotpjzgi0004w1dg5ipql5rd)
Pipeline stage reached: Activities approved → Content workers running

Summary

#	Worker	Severity	Status	Title
1	activity-worker	🔴 Critical	Open	JSON output shape inconsistency — bare array vs wrapper object
2	activity-worker	🔴 Critical	Open	Retry prompt missing activity field schema — Claude drifts on field names
3	activity-worker	🔴 Critical	Open	No pre-write validation — `agentQueue`/`label` null causes Prisma crash
4	activity-worker	🔴 Critical	Open	Claude generates 60+ activities for 12 templates — output truncation
5	website-crawler	🔴 Critical	Open	`concurrency: 3` + shared cwd per tenant — `.agent.pid` race kills jobs
6	backlink-outreach-writer	🔴 Critical	Open	`concurrency: 3` + shared cwd per tenant — same `.agent.pid` race
7	multiple workers	🟡 Medium	Open	BullMQ lock expiry on long-running jobs — “Missing key” on moveToFinished
8	all content workers	🟡 Medium	Open	RAG context fetch 404 — Azure AI Search index missing for tenant
9	keyword-researcher	🟡 Medium	Open	Invalid JSON output on first job — raw text saved, no structured groups
10	notifications	🟡 Medium	Open	8 of 11 email template slugs missing from seed — SendGrid 400 on all pipeline emails
11	all workers	🟢 Low	Open	`taskkill` error output leaks into agents log

Issue Details

#1 — Activity planner: Bare array vs wrapper object

Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — causes retry on every run, doubles cost and runtime

Symptom:


WARN [activity-worker] JSON parse failed on attempt 1 — retrying
  err: "Error: 'activities' field is not an array"

Root cause:
Claude returns a top-level JSON array [{...}, {...}] on first attempt instead of the expected wrapper {"activities": [...]}. The extractJson<T> helper finds the first { character inside the array and parses that single element as the root object. The root object has no activities key, so the schema check fails.

extractJson limitation: It uses brace-depth matching starting from the first { found. It cannot handle top-level arrays — it will always find the { inside the first array element and parse that as the root.

Impact: Every run incurs at least one internal retry. Each retry is a full ~7-minute Claude invocation. Observed across outer retry attempts too — Claude consistently produces arrays on first attempt.

#2 — Activity planner: Retry prompt missing activity field schema

Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — causes PrismaClientValidationError on retry attempts

Symptom:


ERROR [activity-worker] Activity planner job failed
  PrismaClientValidationError: Argument `agentQueue` is missing.
  data: { deliverableType: "keyword_research", label: undefined, ... }

Root cause:
The retry prompt (line ~437) only includes the top-level wrapper schema { "activities": [...] }. It does not re-state the full ActivitySpec field schema. On retries, Claude uses different field names:

Correct	Claude used on retry
`label`	`title`
`inputHints`	`inputs`
`dueDate`	`scheduledDate`
`agentQueue` (required)	omitted for invented type `"keyword_research"`

Impact: Retry attempt writes to Prisma with label: undefined and missing agentQueue → crash.

#3 — Activity planner: No pre-write field validation

Worker: packages/agents/src/workers/activity.worker.ts:539
Severity: 🔴 Critical — unguarded crash to DB

Symptom: Same PrismaClientValidationError as #2.

Root cause:
Activity.agentQueue String and Activity.label String are non-nullable in the Prisma schema. The worker passes Claude’s output directly into db.activity.create() with no runtime check that required fields are present and non-undefined. Any time Claude omits or misnames a field the write throws immediately.

Fix approach: Validate each ActivitySpec before the db.$transaction — filter out or throw on records where agentQueue or label is undefined/empty.

#4 — Activity planner: Claude generates 60+ activities

Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — output truncation causes parse failure

Symptom:


INFO [activity-worker] Claude execution finished
  outputChars: 7067
  outputPreview: "on editorial and niche link opportunities\"}},{\"tempId\":\"act_64\",..."
  err: "Error: 'activities' field is not an array"

Root cause:
The prompt has 12 deliverable templates but Claude generates 64+ activities (one per topic/keyword variant, not per template). The output grows to tens of thousands of characters. The Claude adapter captures only the tail of the output stream (7067 chars), so extractJson receives a fragment starting mid-array at around act_55–act_64. Parsing the fragment finds the first { inside a truncated array → same wrapper failure as #1.

Observed: act_64 visible in outputPreview; full output likely exceeded 60,000+ chars.

Fix approach: Cap activities per deliverable type in the prompt (e.g. “max 2 per template”) and add an explicit count limit instruction.

#5 — Website crawler: `.agent.pid` race with `concurrency: 3`

Worker: packages/agents/src/workers/website-crawler.worker.ts:277
Severity: 🔴 Critical — concurrent crawlers kill each other

Symptom:


WARN [website-crawler-worker] Crawl failed — prospect skipped
  domain: "g2.com"  error: "Process exited with code 1"

WARN [website-crawler-worker] Could not parse crawl output — skipping prospect
  domain: "medianama.com"

Root cause:
Identical to the channel-action-suggester bug (fixed 2026-05-06). The execute.ts adapter writes a .agent.pid file in cwd on each Claude spawn. On the next spawn in the same cwd, it reads this file and issues taskkill /F /T /PID <old>.

The crawler uses:


const cwd = path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE);

All crawls for the same tenant share one cwd. With concurrency: 3, when jobs B and C start while job A is running:

B reads A’s .agent.pid → taskkill kills A → A exits code 1
C reads B’s .agent.pid → kills B → B exits code 1 or produces partial output

The “Could not parse crawl output” variant occurs when Claude is killed mid-output — it exits with code 0 but stdout is truncated/garbled.

Observed: g2.com, yourstory.com, medianama.com, digitalvidya.com (first batch), nasscom.in, lighthouseinsights.in, entrepreneur.com all failed on first attempt. Several succeeded on retries when running serially.

Code comment (incorrect): Line 277 says // crawl up to 3 domains in parallel per server — the pid-file design prevents this.

Fix options:

concurrency: 1 — correct but serializes all crawls (18 prospects × ~30s = ~9 min per batch)
Scope cwd by backlinkId (e.g. path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE, backlinkId)) — allows true concurrency without pid collision, but breaks server-restart orphan cleanup for that worker

#6 — Backlink outreach writer: `.agent.pid` race with `concurrency: 3`

Worker: packages/agents/src/workers/backlink-outreach-writer.worker.ts:316
Severity: 🔴 Critical — same root cause as #5

Symptom:


ERROR [backlink-outreach-writer-worker] Outreach writer job failed
  Error: Process exited with code 1

Root cause:
Same .agent.pid shared-cwd + concurrency:3 pattern.


const cwd = path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE);
concurrency: 3

Outreach writers are queued immediately when a crawler completes. Since the crawler runs with concurrency:3 (even if effectively serialized by the pid-file), multiple crawlers complete in quick succession and each queues an outreach writer job. Those outreach writers then start concurrently and kill each other.

Observed: 9 outreach writer failures during this session. All retried and eventually succeeded (BullMQ exponential backoff).

Pattern note: Any worker using path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE) with concurrency > 1 has this issue. The architectural rule (see project_claude_adapter_pid_file.md memory) applies: one Claude process per cwd at a time.

#7 — BullMQ lock expiry on long-running jobs

Workers affected: setup (client-researcher, competitor-researcher, context-file-writer), insight workers, deliverable-planner, strategy-writer, activity-planner, ai-visibility-seeder, opportunity-matcher
Severity: 🟡 Medium — jobs complete but are marked failed; credits wasted on retries

Symptom:


Error: could not renew lock for job deliverable-planner__...
Error: Missing key for job deliverable-planner__.... moveToFinished
Error: Missing key for job deliverable-planner__.... moveToDelayed

Root cause:
BullMQ’s stall checker runs periodically and reclaims the lock key of any job that hasn’t renewed it within lockDuration. Long-running jobs (deliverable planner: ~14 min, activity planner: ~8 min, strategy writer: ~7+ min) exceed their lockDuration. When the worker finishes and calls moveToFinished, the key is gone — the job is effectively orphaned in BullMQ’s state, gets retried.

Impact: Data is written correctly (the Claude output was processed), but BullMQ retries the job. The retry re-runs Claude from scratch → duplicate writes guarded by dedup logic (or not), wasted credit spend.

Fix approach: Increase lockDuration and lockRenewTime proportionally for each worker based on observed P95 execution time. E.g. deliverable planner: lockDuration: 1_200_000 (20 min).

#8 — RAG context fetch: 404 on all content workers

Workers affected: strategy-writer, content-worker (all agent roles), landing-page-writer
Severity: 🟡 Medium — non-fatal but all content generated without client website context

Symptom:


WARN [content-worker] RAG context fetch failed — proceeding without it
  agentRole: "keyword-researcher"
  NotFoundError: 404 Resource not found
  apim-request-id: ...
  status: 404

Root cause:
The content workers call search() from @leadmetrics/feature-search to fetch relevant website content from an Azure AI Search index. The index for this tenant does not exist — either:

The website crawl data was never ingested into the RAG index (prior ragengine-missing-openai-key issue may have caused this), or
The Azure AI Search deployment/index name for this environment is misconfigured

The 404 comes from Azure OpenAI’s APIM gateway when the embedder (embedder.ts:63) calls the embeddings endpoint for a query vector.

Impact: All content workers proceed without RAG context. Blog posts, landing pages, emails, ads are generated using only the context injected directly in the prompt (strategy/goals/client context) — not from crawled website content.

Related: ragengine-missing-openai-key.md (✅ fixed) — the RAG engine env vars were added but the index may not have been back-filled after the fix.

#9 — Keyword researcher: Invalid JSON output (non-retried)

Worker: packages/agents/src/workers/keyword-researcher.worker.ts (via content.worker.ts)
Severity: 🟡 Medium — keyword data silently missing, no retry

Symptom:


WARN [keyword-researcher-worker] keyword-researcher output is not valid JSON — saving raw as outputPayload only
  activityId: "cmotu004h00eaw1pk0q5mmacw"

Root cause:
Claude occasionally returns keyword research output as prose, markdown-fenced JSON (```json ... ```), or a mix of explanation + JSON. The keyword researcher worker’s JSON parser fails. The fallback saves the raw output string to outputPayload on the activity record and marks the job complete (no error, no retry).

Impact: Structured keyword groups are not saved to the DB for that activity. Downstream tasks that depend on keyword groups receive nothing. The first keyword-researcher job (activity cmotu004h00eaw1pk0q5mmacw) silently produced no usable output; the second job for a different activity succeeded.

Fix approach: Strip markdown fences before parsing (same pattern as channel-action-suggester does for its output). If still invalid, retry rather than silently saving raw text.

#10 — Email templates: 8 of 11 slugs missing from seed

Service: packages/db/src/seed.ts, packages/agents/src/workers/
Severity: 🟡 Medium — DMs never receive pipeline email notifications

Symptom:


WARN [notifications:email-loader] Email template not found — using slug as subject fallback
  slug: "activity_pipeline_dm_review"

ERROR [notifications:email] Email job failed
  400 Bad Request
  "The content value must be a string at least one character in length."
  field: "content.0.value"

Root cause:
The seed file (packages/db/src/seed.ts) defines only 3 of the 11 email template slugs used by the agent workers. When a template slug is not found in the DB, the email loader falls back to using the slug itself as the subject with an empty HTML/text body. SendGrid rejects this with a 400.

Seeded vs missing:

Slug	Worker	Seeded?
`context_ready`	setup.worker	✅
`social_post_published`	social-publisher	✅
`performance_report_ready`	performance-report-writer	✅
`deliverable_plan_ready`	strategy.worker	❌
`strategy_ready`	strategy-writer	❌
`activity_pipeline_dm_review`	activity.worker	❌
`activity_pipeline_ready`	activity.worker	❌
`activity_pipeline_failed`	activity.worker	❌
`channel_suggestions_ready`	channel-action-suggester	❌
`report-ready`	custom-report-writer	❌
`seo-rank-alert`	gsc-keywords-snapshot	❌

Observed failures this session: deliverable_plan_ready (from strategy.worker after deliverable plan created) and activity_pipeline_dm_review (from activity.worker after activities set to dm_review).

Impact: All pipeline milestone email notifications silently fail. DMs only receive email for onboarding context completion and social post / performance report events.

#11 — `taskkill` error output leaks into agents log

Service: packages/adapters/claude-local/src/server/execute.ts
Severity: 🟢 Low — cosmetic noise only

Symptom:


ERROR: The process with PID 5776 (child process of PID 2948) could not be terminated.
Reason: The operation attempted is not supported.

ERROR: The process with PID 2948 (child process of PID 30672) could not be terminated.
Reason: There is no running instance of the task.

Root cause:
execute.ts runs taskkill /F /T /PID <oldPid> to clean up orphaned Claude subprocesses. When the old PID has already exited (normal case: previous job finished cleanly), taskkill outputs error lines to its stdout/stderr. These lines are not captured/suppressed by execute.ts and flow directly into the process stdout, appearing in the agents log.

Impact: Log noise only. No functional impact. Appears alongside every pid-file cleanup where the previous process had already exited.

Timeline

Time	Event
~13:35	Channel action suggester race condition (concurrency:3→1 fix applied)
~13:50	Deliverable planner starts
~14:06	Deliverable planner loses BullMQ lock mid-run (Missing key errors)
~14:10	Deliverable planner completes (data written despite lock expiry)
~14:20	Activity planner starts (outer attempt 1)
~14:24	Activity planner fails — PrismaClientValidationError (missing agentQueue)
~14:24	Activity planner outer retry 2 starts
~14:31	Activity planner attempt 1 → “activities field is not an array” (output truncated at act_64)
~14:34	Activity planner attempt 2 → 36 activities parsed, 48 total with action items
~14:34	Activities in `dm_review`, `requireActivityApproval: true`
~14:35	`activity_pipeline_dm_review` email fails → SendGrid 400 (template missing)
~14:50	DM Reviewer approves activity pipeline
~14:50	8 content workers dispatch simultaneously, all get RAG 404 (non-fatal)
~14:51	GBP post, email writer, landing page writer complete
~14:52	Backlink researcher #1 completes (18 prospects), crawlers start
~14:52	Crawler race begins — g2.com, yourstory.com killed by concurrent crawlers
~14:52	Outreach writer race begins — jobs start killing each other
~14:53	keyword-researcher #1: invalid JSON, raw saved
~14:53	content-brief-writer #1, meta-ads-writer complete
~14:55	First backlink campaign: 6 emails drafted, moved to dm_review
~14:55	keyword-researcher #2: groups saved successfully
~14:56	google-ads-writer, backlink-researcher #2, #3 complete
~14:58	Pipeline still running (second + third batch crawls, outreach writers)

E2E Run #1 — Issue Log

Summary

Issue Details

#1 — Activity planner: Bare array vs wrapper object

#2 — Activity planner: Retry prompt missing activity field schema

#3 — Activity planner: No pre-write field validation

#4 — Activity planner: Claude generates 60+ activities

#5 — Website crawler: .agent.pid race with concurrency: 3

#6 — Backlink outreach writer: .agent.pid race with concurrency: 3

#7 — BullMQ lock expiry on long-running jobs

#8 — RAG context fetch: 404 on all content workers

#9 — Keyword researcher: Invalid JSON output (non-retried)

#10 — Email templates: 8 of 11 slugs missing from seed

#11 — taskkill error output leaks into agents log

Timeline

#5 — Website crawler: `.agent.pid` race with `concurrency: 3`

#6 — Backlink outreach writer: `.agent.pid` race with `concurrency: 3`

#11 — `taskkill` error output leaks into agents log