Skip to Content
IssuesRunsE2E Run #1 — Issue Log

E2E Run #1 — Issue Log

Date: 2026-05-06
Tenant: Leadmetrics (cmotpjzgi0004w1dg5ipql5rd)
Pipeline stage reached: Activities approved → Content workers running


Summary

#WorkerSeverityStatusTitle
1activity-worker🔴 CriticalOpenJSON output shape inconsistency — bare array vs wrapper object
2activity-worker🔴 CriticalOpenRetry prompt missing activity field schema — Claude drifts on field names
3activity-worker🔴 CriticalOpenNo pre-write validation — agentQueue/label null causes Prisma crash
4activity-worker🔴 CriticalOpenClaude generates 60+ activities for 12 templates — output truncation
5website-crawler🔴 CriticalOpenconcurrency: 3 + shared cwd per tenant — .agent.pid race kills jobs
6backlink-outreach-writer🔴 CriticalOpenconcurrency: 3 + shared cwd per tenant — same .agent.pid race
7multiple workers🟡 MediumOpenBullMQ lock expiry on long-running jobs — “Missing key” on moveToFinished
8all content workers🟡 MediumOpenRAG context fetch 404 — Azure AI Search index missing for tenant
9keyword-researcher🟡 MediumOpenInvalid JSON output on first job — raw text saved, no structured groups
10notifications🟡 MediumOpen8 of 11 email template slugs missing from seed — SendGrid 400 on all pipeline emails
11all workers🟢 LowOpentaskkill error output leaks into agents log

Issue Details


#1 — Activity planner: Bare array vs wrapper object

Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — causes retry on every run, doubles cost and runtime

Symptom:

WARN [activity-worker] JSON parse failed on attempt 1 — retrying err: "Error: 'activities' field is not an array"

Root cause:
Claude returns a top-level JSON array [{...}, {...}] on first attempt instead of the expected wrapper {"activities": [...]}. The extractJson<T> helper finds the first { character inside the array and parses that single element as the root object. The root object has no activities key, so the schema check fails.

extractJson limitation: It uses brace-depth matching starting from the first { found. It cannot handle top-level arrays — it will always find the { inside the first array element and parse that as the root.

Impact: Every run incurs at least one internal retry. Each retry is a full ~7-minute Claude invocation. Observed across outer retry attempts too — Claude consistently produces arrays on first attempt.


#2 — Activity planner: Retry prompt missing activity field schema

Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — causes PrismaClientValidationError on retry attempts

Symptom:

ERROR [activity-worker] Activity planner job failed PrismaClientValidationError: Argument `agentQueue` is missing. data: { deliverableType: "keyword_research", label: undefined, ... }

Root cause:
The retry prompt (line ~437) only includes the top-level wrapper schema { "activities": [...] }. It does not re-state the full ActivitySpec field schema. On retries, Claude uses different field names:

CorrectClaude used on retry
labeltitle
inputHintsinputs
dueDatescheduledDate
agentQueue (required)omitted for invented type "keyword_research"

Impact: Retry attempt writes to Prisma with label: undefined and missing agentQueue → crash.


#3 — Activity planner: No pre-write field validation

Worker: packages/agents/src/workers/activity.worker.ts:539
Severity: 🔴 Critical — unguarded crash to DB

Symptom: Same PrismaClientValidationError as #2.

Root cause:
Activity.agentQueue String and Activity.label String are non-nullable in the Prisma schema. The worker passes Claude’s output directly into db.activity.create() with no runtime check that required fields are present and non-undefined. Any time Claude omits or misnames a field the write throws immediately.

Fix approach: Validate each ActivitySpec before the db.$transaction — filter out or throw on records where agentQueue or label is undefined/empty.


#4 — Activity planner: Claude generates 60+ activities

Worker: packages/agents/src/workers/activity.worker.ts
Severity: 🔴 Critical — output truncation causes parse failure

Symptom:

INFO [activity-worker] Claude execution finished outputChars: 7067 outputPreview: "on editorial and niche link opportunities\"}},{\"tempId\":\"act_64\",..." err: "Error: 'activities' field is not an array"

Root cause:
The prompt has 12 deliverable templates but Claude generates 64+ activities (one per topic/keyword variant, not per template). The output grows to tens of thousands of characters. The Claude adapter captures only the tail of the output stream (7067 chars), so extractJson receives a fragment starting mid-array at around act_55act_64. Parsing the fragment finds the first { inside a truncated array → same wrapper failure as #1.

Observed: act_64 visible in outputPreview; full output likely exceeded 60,000+ chars.

Fix approach: Cap activities per deliverable type in the prompt (e.g. “max 2 per template”) and add an explicit count limit instruction.


#5 — Website crawler: .agent.pid race with concurrency: 3

Worker: packages/agents/src/workers/website-crawler.worker.ts:277
Severity: 🔴 Critical — concurrent crawlers kill each other

Symptom:

WARN [website-crawler-worker] Crawl failed — prospect skipped domain: "g2.com" error: "Process exited with code 1" WARN [website-crawler-worker] Could not parse crawl output — skipping prospect domain: "medianama.com"

Root cause:
Identical to the channel-action-suggester bug (fixed 2026-05-06). The execute.ts adapter writes a .agent.pid file in cwd on each Claude spawn. On the next spawn in the same cwd, it reads this file and issues taskkill /F /T /PID <old>.

The crawler uses:

const cwd = path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE);

All crawls for the same tenant share one cwd. With concurrency: 3, when jobs B and C start while job A is running:

  • B reads A’s .agent.pidtaskkill kills A → A exits code 1
  • C reads B’s .agent.pid → kills B → B exits code 1 or produces partial output

The “Could not parse crawl output” variant occurs when Claude is killed mid-output — it exits with code 0 but stdout is truncated/garbled.

Observed: g2.com, yourstory.com, medianama.com, digitalvidya.com (first batch), nasscom.in, lighthouseinsights.in, entrepreneur.com all failed on first attempt. Several succeeded on retries when running serially.

Code comment (incorrect): Line 277 says // crawl up to 3 domains in parallel per server — the pid-file design prevents this.

Fix options:

  1. concurrency: 1 — correct but serializes all crawls (18 prospects × ~30s = ~9 min per batch)
  2. Scope cwd by backlinkId (e.g. path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE, backlinkId)) — allows true concurrency without pid collision, but breaks server-restart orphan cleanup for that worker

Worker: packages/agents/src/workers/backlink-outreach-writer.worker.ts:316
Severity: 🔴 Critical — same root cause as #5

Symptom:

ERROR [backlink-outreach-writer-worker] Outreach writer job failed Error: Process exited with code 1

Root cause:
Same .agent.pid shared-cwd + concurrency:3 pattern.

const cwd = path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE); concurrency: 3

Outreach writers are queued immediately when a crawler completes. Since the crawler runs with concurrency:3 (even if effectively serialized by the pid-file), multiple crawlers complete in quick succession and each queues an outreach writer job. Those outreach writers then start concurrently and kill each other.

Observed: 9 outreach writer failures during this session. All retried and eventually succeeded (BullMQ exponential backoff).

Pattern note: Any worker using path.join(os.tmpdir(), "leadmetrics-agents", tenantId, AGENT_ROLE) with concurrency > 1 has this issue. The architectural rule (see project_claude_adapter_pid_file.md memory) applies: one Claude process per cwd at a time.


#7 — BullMQ lock expiry on long-running jobs

Workers affected: setup (client-researcher, competitor-researcher, context-file-writer), insight workers, deliverable-planner, strategy-writer, activity-planner, ai-visibility-seeder, opportunity-matcher
Severity: 🟡 Medium — jobs complete but are marked failed; credits wasted on retries

Symptom:

Error: could not renew lock for job deliverable-planner__... Error: Missing key for job deliverable-planner__.... moveToFinished Error: Missing key for job deliverable-planner__.... moveToDelayed

Root cause:
BullMQ’s stall checker runs periodically and reclaims the lock key of any job that hasn’t renewed it within lockDuration. Long-running jobs (deliverable planner: ~14 min, activity planner: ~8 min, strategy writer: ~7+ min) exceed their lockDuration. When the worker finishes and calls moveToFinished, the key is gone — the job is effectively orphaned in BullMQ’s state, gets retried.

Impact: Data is written correctly (the Claude output was processed), but BullMQ retries the job. The retry re-runs Claude from scratch → duplicate writes guarded by dedup logic (or not), wasted credit spend.

Fix approach: Increase lockDuration and lockRenewTime proportionally for each worker based on observed P95 execution time. E.g. deliverable planner: lockDuration: 1_200_000 (20 min).


#8 — RAG context fetch: 404 on all content workers

Workers affected: strategy-writer, content-worker (all agent roles), landing-page-writer
Severity: 🟡 Medium — non-fatal but all content generated without client website context

Symptom:

WARN [content-worker] RAG context fetch failed — proceeding without it agentRole: "keyword-researcher" NotFoundError: 404 Resource not found apim-request-id: ... status: 404

Root cause:
The content workers call search() from @leadmetrics/feature-search to fetch relevant website content from an Azure AI Search index. The index for this tenant does not exist — either:

  1. The website crawl data was never ingested into the RAG index (prior ragengine-missing-openai-key issue may have caused this), or
  2. The Azure AI Search deployment/index name for this environment is misconfigured

The 404 comes from Azure OpenAI’s APIM gateway when the embedder (embedder.ts:63) calls the embeddings endpoint for a query vector.

Impact: All content workers proceed without RAG context. Blog posts, landing pages, emails, ads are generated using only the context injected directly in the prompt (strategy/goals/client context) — not from crawled website content.

Related: ragengine-missing-openai-key.md (✅ fixed) — the RAG engine env vars were added but the index may not have been back-filled after the fix.


#9 — Keyword researcher: Invalid JSON output (non-retried)

Worker: packages/agents/src/workers/keyword-researcher.worker.ts (via content.worker.ts)
Severity: 🟡 Medium — keyword data silently missing, no retry

Symptom:

WARN [keyword-researcher-worker] keyword-researcher output is not valid JSON — saving raw as outputPayload only activityId: "cmotu004h00eaw1pk0q5mmacw"

Root cause:
Claude occasionally returns keyword research output as prose, markdown-fenced JSON (```json ... ```), or a mix of explanation + JSON. The keyword researcher worker’s JSON parser fails. The fallback saves the raw output string to outputPayload on the activity record and marks the job complete (no error, no retry).

Impact: Structured keyword groups are not saved to the DB for that activity. Downstream tasks that depend on keyword groups receive nothing. The first keyword-researcher job (activity cmotu004h00eaw1pk0q5mmacw) silently produced no usable output; the second job for a different activity succeeded.

Fix approach: Strip markdown fences before parsing (same pattern as channel-action-suggester does for its output). If still invalid, retry rather than silently saving raw text.


#10 — Email templates: 8 of 11 slugs missing from seed

Service: packages/db/src/seed.ts, packages/agents/src/workers/
Severity: 🟡 Medium — DMs never receive pipeline email notifications

Symptom:

WARN [notifications:email-loader] Email template not found — using slug as subject fallback slug: "activity_pipeline_dm_review" ERROR [notifications:email] Email job failed 400 Bad Request "The content value must be a string at least one character in length." field: "content.0.value"

Root cause:
The seed file (packages/db/src/seed.ts) defines only 3 of the 11 email template slugs used by the agent workers. When a template slug is not found in the DB, the email loader falls back to using the slug itself as the subject with an empty HTML/text body. SendGrid rejects this with a 400.

Seeded vs missing:

SlugWorkerSeeded?
context_readysetup.worker
social_post_publishedsocial-publisher
performance_report_readyperformance-report-writer
deliverable_plan_readystrategy.worker
strategy_readystrategy-writer
activity_pipeline_dm_reviewactivity.worker
activity_pipeline_readyactivity.worker
activity_pipeline_failedactivity.worker
channel_suggestions_readychannel-action-suggester
report-readycustom-report-writer
seo-rank-alertgsc-keywords-snapshot

Observed failures this session: deliverable_plan_ready (from strategy.worker after deliverable plan created) and activity_pipeline_dm_review (from activity.worker after activities set to dm_review).

Impact: All pipeline milestone email notifications silently fail. DMs only receive email for onboarding context completion and social post / performance report events.


#11 — taskkill error output leaks into agents log

Service: packages/adapters/claude-local/src/server/execute.ts
Severity: 🟢 Low — cosmetic noise only

Symptom:

ERROR: The process with PID 5776 (child process of PID 2948) could not be terminated. Reason: The operation attempted is not supported. ERROR: The process with PID 2948 (child process of PID 30672) could not be terminated. Reason: There is no running instance of the task.

Root cause:
execute.ts runs taskkill /F /T /PID <oldPid> to clean up orphaned Claude subprocesses. When the old PID has already exited (normal case: previous job finished cleanly), taskkill outputs error lines to its stdout/stderr. These lines are not captured/suppressed by execute.ts and flow directly into the process stdout, appearing in the agents log.

Impact: Log noise only. No functional impact. Appears alongside every pid-file cleanup where the previous process had already exited.


Timeline

TimeEvent
~13:35Channel action suggester race condition (concurrency:3→1 fix applied)
~13:50Deliverable planner starts
~14:06Deliverable planner loses BullMQ lock mid-run (Missing key errors)
~14:10Deliverable planner completes (data written despite lock expiry)
~14:20Activity planner starts (outer attempt 1)
~14:24Activity planner fails — PrismaClientValidationError (missing agentQueue)
~14:24Activity planner outer retry 2 starts
~14:31Activity planner attempt 1 → “activities field is not an array” (output truncated at act_64)
~14:34Activity planner attempt 2 → 36 activities parsed, 48 total with action items
~14:34Activities in dm_review, requireActivityApproval: true
~14:35activity_pipeline_dm_review email fails → SendGrid 400 (template missing)
~14:50DM Reviewer approves activity pipeline
~14:508 content workers dispatch simultaneously, all get RAG 404 (non-fatal)
~14:51GBP post, email writer, landing page writer complete
~14:52Backlink researcher #1 completes (18 prospects), crawlers start
~14:52Crawler race begins — g2.com, yourstory.com killed by concurrent crawlers
~14:52Outreach writer race begins — jobs start killing each other
~14:53keyword-researcher #1: invalid JSON, raw saved
~14:53content-brief-writer #1, meta-ads-writer complete
~14:55First backlink campaign: 6 emails drafted, moved to dm_review
~14:55keyword-researcher #2: groups saved successfully
~14:56google-ads-writer, backlink-researcher #2, #3 complete
~14:58Pipeline still running (second + third batch crawls, outreach writers)

© 2026 Leadmetrics — Internal use only