Opportunity Matcher: BullMQ Lock Expires During LLM Scoring
Status: ✅ Fixed (2026-05-05) — lockDuration:120_000→360_000 + lockRenewTime:60_000
Severity: Medium — job still produces output (with fallback scores) but leaves BullMQ in an inconsistent state and logs misleading errors
File: packages/agents/src/workers/opportunity-matcher.worker.ts
Symptom
When the opportunity pool has 131 candidates, the LLM scoring pass times out and the BullMQ lock expires:
WARN [opportunity-matcher] Scoring batch failed — using fallback scores
err: "Process timed out after 120s"
Error: Missing lock for job opportunity-matcher__<tenantId>__<ts>. moveToFinished
Error: Missing lock for job opportunity-matcher__<tenantId>__<ts>. moveToDelayedThe worker continues and creates 95 opportunities using fallback scores. However because the lock expired, BullMQ cannot cleanly move the job to completed or delayed:
moveToFinishedfails → job is not marked completedmoveToDelayedfails → job cannot be retried via BullMQ’s retry mechanism
The job then re-runs from scratch (stalled job re-pick) with only the 36 unscored candidates (the 95 already-created ones are filtered out), completing normally.
Root Cause
BullMQ locks a job for lockDuration milliseconds. If the worker takes longer than lockDuration to process, the lock expires and BullMQ assumes the job stalled. The opportunity-matcher’s lock is set to match its timeout of 120s — but scoring 131 candidates with an LLM takes more than 120s in practice.
The sequence:
- Job starts, BullMQ lock set for 120s
- LLM scores candidates in batches — takes >120s
- Lock expires → BullMQ marks job stalled, removes lock
- Worker finishes processing → tries to call
moveToFinished→ fails (no lock) - Worker tries
moveToDelayedin the failure handler → also fails (no lock) - BullMQ picks up the stalled job again → runs with 36 remaining candidates → succeeds
How to Fix
Fix 1 — Increase lockDuration (required)
Set lockDuration to at least 2× the expected max processing time. With 131 candidates and batch LLM scoring, budget 300–360s:
// opportunity-matcher.worker.ts
const worker = new Worker(queueName, processJob, {
connection,
concurrency: 2,
lockDuration: 360_000, // was: 120_000 (matches timeout, which is too short)
lockRenewTime: 60_000, // renew every 60s so lock stays alive
});Fix 2 — Enable lock renewal (belt-and-suspenders)
BullMQ supports automatic lock renewal via lockRenewTime. If the worker actively renews the lock while processing, a long-running job never loses it:
lockRenewTime: lockDuration / 3, // renew at 1/3 of lock durationThis is the recommended approach for long-running jobs.
Fix 3 — Cap the scoring batch size
If 131 candidates is a realistic upper bound and scoring them always takes >120s, consider capping each scoring batch at a lower count (e.g. 80) and scheduling the remainder as a follow-up job. This keeps individual job durations predictable.
Fix 4 — Separate the scoring timeout from the lock duration
The timeout: 120s on the scoring subprocess is appropriate as a per-call guard. The BullMQ lockDuration is a separate concern and should be longer:
scoring subprocess timeout: 120s (abort a single LLM call if it hangs)
BullMQ lockDuration: 360s (total job processing budget)Impact
- 95 opportunities created with fallback scores instead of LLM-ranked scores
- BullMQ error logs on every run with >~80 candidates
- Job appears in the stalled queue briefly, which may affect the Execution Queue dashboard display