Skip to Content
IssuesOpportunity Matcher: BullMQ Lock Expires During LLM Scoring

Opportunity Matcher: BullMQ Lock Expires During LLM Scoring

Status: ✅ Fixed (2026-05-05) — lockDuration:120_000→360_000 + lockRenewTime:60_000
Severity: Medium — job still produces output (with fallback scores) but leaves BullMQ in an inconsistent state and logs misleading errors
File: packages/agents/src/workers/opportunity-matcher.worker.ts

Symptom

When the opportunity pool has 131 candidates, the LLM scoring pass times out and the BullMQ lock expires:

WARN [opportunity-matcher] Scoring batch failed — using fallback scores err: "Process timed out after 120s" Error: Missing lock for job opportunity-matcher__<tenantId>__<ts>. moveToFinished Error: Missing lock for job opportunity-matcher__<tenantId>__<ts>. moveToDelayed

The worker continues and creates 95 opportunities using fallback scores. However because the lock expired, BullMQ cannot cleanly move the job to completed or delayed:

  • moveToFinished fails → job is not marked completed
  • moveToDelayed fails → job cannot be retried via BullMQ’s retry mechanism

The job then re-runs from scratch (stalled job re-pick) with only the 36 unscored candidates (the 95 already-created ones are filtered out), completing normally.

Root Cause

BullMQ locks a job for lockDuration milliseconds. If the worker takes longer than lockDuration to process, the lock expires and BullMQ assumes the job stalled. The opportunity-matcher’s lock is set to match its timeout of 120s — but scoring 131 candidates with an LLM takes more than 120s in practice.

The sequence:

  1. Job starts, BullMQ lock set for 120s
  2. LLM scores candidates in batches — takes >120s
  3. Lock expires → BullMQ marks job stalled, removes lock
  4. Worker finishes processing → tries to call moveToFinished → fails (no lock)
  5. Worker tries moveToDelayed in the failure handler → also fails (no lock)
  6. BullMQ picks up the stalled job again → runs with 36 remaining candidates → succeeds

How to Fix

Fix 1 — Increase lockDuration (required)

Set lockDuration to at least 2× the expected max processing time. With 131 candidates and batch LLM scoring, budget 300–360s:

// opportunity-matcher.worker.ts const worker = new Worker(queueName, processJob, { connection, concurrency: 2, lockDuration: 360_000, // was: 120_000 (matches timeout, which is too short) lockRenewTime: 60_000, // renew every 60s so lock stays alive });

Fix 2 — Enable lock renewal (belt-and-suspenders)

BullMQ supports automatic lock renewal via lockRenewTime. If the worker actively renews the lock while processing, a long-running job never loses it:

lockRenewTime: lockDuration / 3, // renew at 1/3 of lock duration

This is the recommended approach for long-running jobs.

Fix 3 — Cap the scoring batch size

If 131 candidates is a realistic upper bound and scoring them always takes >120s, consider capping each scoring batch at a lower count (e.g. 80) and scheduling the remainder as a follow-up job. This keeps individual job durations predictable.

Fix 4 — Separate the scoring timeout from the lock duration

The timeout: 120s on the scoring subprocess is appropriate as a per-call guard. The BullMQ lockDuration is a separate concern and should be longer:

scoring subprocess timeout: 120s (abort a single LLM call if it hangs) BullMQ lockDuration: 360s (total job processing budget)

Impact

  • 95 opportunities created with fallback scores instead of LLM-ranked scores
  • BullMQ error logs on every run with >~80 candidates
  • Job appears in the stalled queue briefly, which may affect the Execution Queue dashboard display

© 2026 Leadmetrics — Internal use only