Opportunity Matcher: BullMQ Lock Expires During LLM Scoring

Status: ✅ Fixed (2026-05-05) — lockDuration:120_000→360_000 + lockRenewTime:60_000
Severity: Medium — job still produces output (with fallback scores) but leaves BullMQ in an inconsistent state and logs misleading errors
File: packages/agents/src/workers/opportunity-matcher.worker.ts

Symptom

When the opportunity pool has 131 candidates, the LLM scoring pass times out and the BullMQ lock expires:


WARN [opportunity-matcher] Scoring batch failed — using fallback scores
  err: "Process timed out after 120s"

Error: Missing lock for job opportunity-matcher__<tenantId>__<ts>. moveToFinished
Error: Missing lock for job opportunity-matcher__<tenantId>__<ts>. moveToDelayed

The worker continues and creates 95 opportunities using fallback scores. However because the lock expired, BullMQ cannot cleanly move the job to completed or delayed:

moveToFinished fails → job is not marked completed
moveToDelayed fails → job cannot be retried via BullMQ’s retry mechanism

The job then re-runs from scratch (stalled job re-pick) with only the 36 unscored candidates (the 95 already-created ones are filtered out), completing normally.

Root Cause

BullMQ locks a job for lockDuration milliseconds. If the worker takes longer than lockDuration to process, the lock expires and BullMQ assumes the job stalled. The opportunity-matcher’s lock is set to match its timeout of 120s — but scoring 131 candidates with an LLM takes more than 120s in practice.

The sequence:

Job starts, BullMQ lock set for 120s
LLM scores candidates in batches — takes >120s
Lock expires → BullMQ marks job stalled, removes lock
Worker finishes processing → tries to call moveToFinished → fails (no lock)
Worker tries moveToDelayed in the failure handler → also fails (no lock)
BullMQ picks up the stalled job again → runs with 36 remaining candidates → succeeds

How to Fix

Fix 1 — Increase lockDuration (required)

Set lockDuration to at least 2× the expected max processing time. With 131 candidates and batch LLM scoring, budget 300–360s:


// opportunity-matcher.worker.ts
const worker = new Worker(queueName, processJob, {
  connection,
  concurrency: 2,
  lockDuration: 360_000,   // was: 120_000 (matches timeout, which is too short)
  lockRenewTime: 60_000,   // renew every 60s so lock stays alive
});

Fix 2 — Enable lock renewal (belt-and-suspenders)

BullMQ supports automatic lock renewal via lockRenewTime. If the worker actively renews the lock while processing, a long-running job never loses it:


lockRenewTime: lockDuration / 3,  // renew at 1/3 of lock duration

This is the recommended approach for long-running jobs.

Fix 3 — Cap the scoring batch size

If 131 candidates is a realistic upper bound and scoring them always takes >120s, consider capping each scoring batch at a lower count (e.g. 80) and scheduling the remainder as a follow-up job. This keeps individual job durations predictable.

Fix 4 — Separate the scoring timeout from the lock duration

The timeout: 120s on the scoring subprocess is appropriate as a per-call guard. The BullMQ lockDuration is a separate concern and should be longer:


scoring subprocess timeout:  120s  (abort a single LLM call if it hangs)
BullMQ lockDuration:         360s  (total job processing budget)

Impact

95 opportunities created with fallback scores instead of LLM-ranked scores
BullMQ error logs on every run with >~80 candidates
Job appears in the stalled queue briefly, which may affect the Execution Queue dashboard display