E2E Run #1 — Issue Log
Date: 2026-05-06
Tenant: Leadmetrics (cmotpjzgi0004w1dg5ipql5rd)
Pipeline stage reached: Activities approved → Content workers running
Summary
| # | Worker | Severity | Status | Title |
|---|---|---|---|---|
| 1 | activity-worker | 🔴 Critical | Open | JSON output shape inconsistency — bare array vs wrapper object |
| 2 | activity-worker | 🔴 Critical | Open | Retry prompt missing activity field schema — Claude drifts on field names |
| 3 | activity-worker | 🔴 Critical | Open | No pre-write validation — agentQueue/label null causes Prisma crash |
| 4 | activity-worker | 🔴 Critical | Open | Claude generates 60+ activities for 12 templates — output truncation |
| 5 | website-crawler | 🔴 Critical | Open | concurrency: 3 + shared cwd per tenant — .agent.pid race kills jobs |
| 6 | backlink-outreach-writer | 🔴 Critical | Open | concurrency: 3 + shared cwd per tenant — same .agent.pid race |
| 7 | multiple workers | 🟡 Medium | Open | BullMQ lock expiry on long-running jobs — “Missing key” on moveToFinished |
| 8 | all content workers | 🟡 Medium | Open | RAG context fetch 404 — Azure AI Search index missing for tenant |
| 9 | keyword-researcher | 🟡 Medium | Open | Invalid JSON output on first job — raw text saved, no structured groups |
| 10 | notifications | 🟡 Medium | Open | 8 of 11 email template slugs missing from seed — SendGrid 400 on all pipeline emails |
| 11 | all workers | 🟢 Low | Open | taskkill error output leaks into agents log |
Issue Details
#1 — Activity planner: Bare array vs wrapper object
Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — causes retry on every run, doubles cost and runtime
Symptom:
WARN [activity-worker] JSON parse failed on attempt 1 — retrying
err: "Error: 'activities' field is not an array"Root cause:
Claude returns a top-level JSON array [{...}, {...}] on first attempt instead of the expected wrapper {"activities": [...]}. The extractJson<T> helper finds the first { character inside the array and parses that single element as the root object. The root object has no activities key, so the schema check fails.
extractJson limitation: It uses brace-depth matching starting from the first { found. It cannot handle top-level arrays — it will always find the { inside the first array element and parse that as the root.
Impact: Every run incurs at least one internal retry. Each retry is a full ~7-minute Claude invocation. Observed across outer retry attempts too — Claude consistently produces arrays on first attempt.
#2 — Activity planner: Retry prompt missing activity field schema
Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — causes PrismaClientValidationError on retry attempts
Symptom:
ERROR [activity-worker] Activity planner job failed
PrismaClientValidationError: Argument `agentQueue` is missing.
data: { deliverableType: "keyword_research", label: undefined, ... }Root cause:
The retry prompt (line ~437) only includes the top-level wrapper schema { "activities": [...] }. It does not re-state the full ActivitySpec field schema. On retries, Claude uses different field names:
| Correct | Claude used on retry |
|---|---|
label | title |
inputHints | inputs |
dueDate | scheduledDate |
agentQueue (required) | omitted for invented type "keyword_research" |
Impact: Retry attempt writes to Prisma with label: undefined and missing agentQueue → crash.
#3 — Activity planner: No pre-write field validation
Worker: packages/agents/src/workers/activity.worker.ts:539
Severity: 🔴 Critical — unguarded crash to DB
Symptom: Same PrismaClientValidationError as #2.
Root cause:
Activity.agentQueue String and Activity.label String are non-nullable in the Prisma schema. The worker passes Claude’s output directly into db.activity.create() with no runtime check that required fields are present and non-undefined. Any time Claude omits or misnames a field the write throws immediately.
Fix approach: Validate each ActivitySpec before the db.$transaction — filter out or throw on records where agentQueue or label is undefined/empty.
#4 — Activity planner: Claude generates 60+ activities
Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — output truncation causes parse failure
Symptom:
INFO [activity-worker] Claude execution finished
outputChars: 7067
outputPreview: "on editorial and niche link opportunities\"}},{\"tempId\":\"act_64\",..."
err: "Error: 'activities' field is not an array"Root cause:
The prompt has 12 deliverable templates but Claude generates 64+ activities (one per topic/keyword variant, not per template). The output grows to tens of thousands of characters. The Claude adapter captures only the tail of the output stream (7067 chars), so extractJson receives a fragment starting mid-array at around act_55–act_64. Parsing the fragment finds the first { inside a truncated array → same wrapper failure as #1.
Observed: act_64 visible in outputPreview; full output likely exceeded 60,000+ chars.
Fix approach: Cap activities per deliverable type in the prompt (e.g. “max 2 per template”) and add an explicit count limit instruction.
#5 — Website crawler: .agent.pid race with concurrency: 3
Worker: packages/agents/src/workers/website-crawler.worker.ts:277
Severity: 🔴 Critical — concurrent crawlers kill each other
Symptom:
WARN [website-crawler-worker] Crawl failed — prospect skipped
domain: "g2.com" error: "Process exited with code 1"
WARN [website-crawler-worker] Could not parse crawl output — skipping prospect
domain: "medianama.com"Root cause:
Identical to the channel-action-suggester bug (fixed 2026-05-06). The execute.ts adapter writes a .agent.pid file in cwd on each Claude spawn. On the next spawn in the same cwd, it reads this file and issues taskkill /F /T /PID <old>.
The crawler uses:
const cwd = path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE);All crawls for the same tenant share one cwd. With concurrency: 3, when jobs B and C start while job A is running:
- B reads A’s
.agent.pid→taskkillkills A → A exits code 1 - C reads B’s
.agent.pid→ kills B → B exits code 1 or produces partial output
The “Could not parse crawl output” variant occurs when Claude is killed mid-output — it exits with code 0 but stdout is truncated/garbled.
Observed: g2.com, yourstory.com, medianama.com, digitalvidya.com (first batch), nasscom.in, lighthouseinsights.in, entrepreneur.com all failed on first attempt. Several succeeded on retries when running serially.
Code comment (incorrect): Line 277 says // crawl up to 3 domains in parallel per server — the pid-file design prevents this.
Fix options:
concurrency: 1— correct but serializes all crawls (18 prospects × ~30s = ~9 min per batch)- Scope
cwdbybacklinkId(e.g.path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE, backlinkId)) — allows true concurrency without pid collision, but breaks server-restart orphan cleanup for that worker
#6 — Backlink outreach writer: .agent.pid race with concurrency: 3
Worker: packages/agents/src/workers/backlink-outreach-writer.worker.ts:316
Severity: 🔴 Critical — same root cause as #5
Symptom:
ERROR [backlink-outreach-writer-worker] Outreach writer job failed
Error: Process exited with code 1Root cause:
Same .agent.pid shared-cwd + concurrency:3 pattern.
const cwd = path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE);
concurrency: 3Outreach writers are queued immediately when a crawler completes. Since the crawler runs with concurrency:3 (even if effectively serialized by the pid-file), multiple crawlers complete in quick succession and each queues an outreach writer job. Those outreach writers then start concurrently and kill each other.
Observed: 9 outreach writer failures during this session. All retried and eventually succeeded (BullMQ exponential backoff).
Pattern note: Any worker using path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE) with concurrency > 1 has this issue. The architectural rule (see project_claude_adapter_pid_file.md memory) applies: one Claude process per cwd at a time.
#7 — BullMQ lock expiry on long-running jobs
Workers affected: setup (client-researcher, competitor-researcher, context-file-writer), insight workers, deliverable-planner, strategy-writer, activity-planner, ai-visibility-seeder, opportunity-matcher
Severity: 🟡 Medium — jobs complete but are marked failed; credits wasted on retries
Symptom:
Error: could not renew lock for job deliverable-planner__...
Error: Missing key for job deliverable-planner__.... moveToFinished
Error: Missing key for job deliverable-planner__.... moveToDelayedRoot cause:
BullMQ’s stall checker runs periodically and reclaims the lock key of any job that hasn’t renewed it within lockDuration. Long-running jobs (deliverable planner: ~14 min, activity planner: ~8 min, strategy writer: ~7+ min) exceed their lockDuration. When the worker finishes and calls moveToFinished, the key is gone — the job is effectively orphaned in BullMQ’s state, gets retried.
Impact: Data is written correctly (the Claude output was processed), but BullMQ retries the job. The retry re-runs Claude from scratch → duplicate writes guarded by dedup logic (or not), wasted credit spend.
Fix approach: Increase lockDuration and lockRenewTime proportionally for each worker based on observed P95 execution time. E.g. deliverable planner: lockDuration: 1_200_000 (20 min).
#8 — RAG context fetch: 404 on all content workers
Workers affected: strategy-writer, content-worker (all agent roles), landing-page-writer
Severity: 🟡 Medium — non-fatal but all content generated without client website context
Symptom:
WARN [content-worker] RAG context fetch failed — proceeding without it
agentRole: "keyword-researcher"
NotFoundError: 404 Resource not found
apim-request-id: ...
status: 404Root cause:
The content workers call search() from @leadmetrics/feature-search to fetch relevant website content from an Azure AI Search index. The index for this tenant does not exist — either:
- The website crawl data was never ingested into the RAG index (prior
ragengine-missing-openai-keyissue may have caused this), or - The Azure AI Search deployment/index name for this environment is misconfigured
The 404 comes from Azure OpenAI’s APIM gateway when the embedder (embedder.ts:63) calls the embeddings endpoint for a query vector.
Impact: All content workers proceed without RAG context. Blog posts, landing pages, emails, ads are generated using only the context injected directly in the prompt (strategy/goals/client context) — not from crawled website content.
Related: ragengine-missing-openai-key.md (✅ fixed) — the RAG engine env vars were added but the index may not have been back-filled after the fix.
#9 — Keyword researcher: Invalid JSON output (non-retried)
Worker: packages/agents/src/workers/keyword-researcher.worker.ts (via content.worker.ts)
Severity: 🟡 Medium — keyword data silently missing, no retry
Symptom:
WARN [keyword-researcher-worker] keyword-researcher output is not valid JSON — saving raw as outputPayload only
activityId: "cmotu004h00eaw1pk0q5mmacw"Root cause:
Claude occasionally returns keyword research output as prose, markdown-fenced JSON (```json ... ```), or a mix of explanation + JSON. The keyword researcher worker’s JSON parser fails. The fallback saves the raw output string to outputPayload on the activity record and marks the job complete (no error, no retry).
Impact: Structured keyword groups are not saved to the DB for that activity. Downstream tasks that depend on keyword groups receive nothing. The first keyword-researcher job (activity cmotu004h00eaw1pk0q5mmacw) silently produced no usable output; the second job for a different activity succeeded.
Fix approach: Strip markdown fences before parsing (same pattern as channel-action-suggester does for its output). If still invalid, retry rather than silently saving raw text.
#10 — Email templates: 8 of 11 slugs missing from seed
Service: packages/db/src/seed.ts, packages/agents/src/workers/
Severity: 🟡 Medium — DMs never receive pipeline email notifications
Symptom:
WARN [notifications:email-loader] Email template not found — using slug as subject fallback
slug: "activity_pipeline_dm_review"
ERROR [notifications:email] Email job failed
400 Bad Request
"The content value must be a string at least one character in length."
field: "content.0.value"Root cause:
The seed file (packages/db/src/seed.ts) defines only 3 of the 11 email template slugs used by the agent workers. When a template slug is not found in the DB, the email loader falls back to using the slug itself as the subject with an empty HTML/text body. SendGrid rejects this with a 400.
Seeded vs missing:
| Slug | Worker | Seeded? |
|---|---|---|
context_ready | setup.worker | ✅ |
social_post_published | social-publisher | ✅ |
performance_report_ready | performance-report-writer | ✅ |
deliverable_plan_ready | strategy.worker | ❌ |
strategy_ready | strategy-writer | ❌ |
activity_pipeline_dm_review | activity.worker | ❌ |
activity_pipeline_ready | activity.worker | ❌ |
activity_pipeline_failed | activity.worker | ❌ |
channel_suggestions_ready | channel-action-suggester | ❌ |
report-ready | custom-report-writer | ❌ |
seo-rank-alert | gsc-keywords-snapshot | ❌ |
Observed failures this session: deliverable_plan_ready (from strategy.worker after deliverable plan created) and activity_pipeline_dm_review (from activity.worker after activities set to dm_review).
Impact: All pipeline milestone email notifications silently fail. DMs only receive email for onboarding context completion and social post / performance report events.
#11 — taskkill error output leaks into agents log
Service: packages/adapters/claude-local/src/server/execute.ts
Severity: 🟢 Low — cosmetic noise only
Symptom:
ERROR: The process with PID 5776 (child process of PID 2948) could not be terminated.
Reason: The operation attempted is not supported.
ERROR: The process with PID 2948 (child process of PID 30672) could not be terminated.
Reason: There is no running instance of the task.Root cause:
execute.ts runs taskkill /F /T /PID <oldPid> to clean up orphaned Claude subprocesses. When the old PID has already exited (normal case: previous job finished cleanly), taskkill outputs error lines to its stdout/stderr. These lines are not captured/suppressed by execute.ts and flow directly into the process stdout, appearing in the agents log.
Impact: Log noise only. No functional impact. Appears alongside every pid-file cleanup where the previous process had already exited.
Timeline
| Time | Event |
|---|---|
| ~13:35 | Channel action suggester race condition (concurrency:3→1 fix applied) |
| ~13:50 | Deliverable planner starts |
| ~14:06 | Deliverable planner loses BullMQ lock mid-run (Missing key errors) |
| ~14:10 | Deliverable planner completes (data written despite lock expiry) |
| ~14:20 | Activity planner starts (outer attempt 1) |
| ~14:24 | Activity planner fails — PrismaClientValidationError (missing agentQueue) |
| ~14:24 | Activity planner outer retry 2 starts |
| ~14:31 | Activity planner attempt 1 → “activities field is not an array” (output truncated at act_64) |
| ~14:34 | Activity planner attempt 2 → 36 activities parsed, 48 total with action items |
| ~14:34 | Activities in dm_review, requireActivityApproval: true |
| ~14:35 | activity_pipeline_dm_review email fails → SendGrid 400 (template missing) |
| ~14:50 | DM Reviewer approves activity pipeline |
| ~14:50 | 8 content workers dispatch simultaneously, all get RAG 404 (non-fatal) |
| ~14:51 | GBP post, email writer, landing page writer complete |
| ~14:52 | Backlink researcher #1 completes (18 prospects), crawlers start |
| ~14:52 | Crawler race begins — g2.com, yourstory.com killed by concurrent crawlers |
| ~14:52 | Outreach writer race begins — jobs start killing each other |
| ~14:53 | keyword-researcher #1: invalid JSON, raw saved |
| ~14:53 | content-brief-writer #1, meta-ads-writer complete |
| ~14:55 | First backlink campaign: 6 emails drafted, moved to dm_review |
| ~14:55 | keyword-researcher #2: groups saved successfully |
| ~14:56 | google-ads-writer, backlink-researcher #2, #3 complete |
| ~14:58 | Pipeline still running (second + third batch crawls, outreach writers) |