Web Crawler: 0 pagesIndexed on Fresh Tenant Crawl

Status: ✅ Fixed — root cause confirmed and patched (2026-05-05)
Severity: Was Critical — see rag-content-wrong-file-id.md for the actual bug
File: packages/agents/src/workers/tenant-website-crawler.worker.ts

Symptom

The first crawl of a brand new tenant crawls 100 pages but reports 0 indexed:


[09:08:20] Crawl completed
  pagesCrawled:      100
  mediaDownloaded:   247
  pagesIndexed:      0       ← all pages skipped in RAG pipeline
  duplicatesSkipped: 1689
  changedPages:      0
  compressionSavedBytes: 1509777

Root Cause (Confirmed)

The crawler passed webPage.id (a WebPage primary key) as the fileId argument to enqueueRagContent. The RAG ingestion worker looks up a RagFile by that ID. Since WebPage.id and RagFile.id are separate CUID sequences, the lookup always returns null → worker logs “RagFile not found — skipping” → pagesIndexed stays 0.

This is not a naming issue — pages were genuinely not reaching the vector store.

See docs/issues/rag-content-wrong-file-id.md for full analysis and fix details.

Fix

Applied 2026-05-05: the crawler now creates a RagFile record before calling enqueueRagContent and passes ragFile.id. The pagesIndexed counter now accurately reflects pages sent to Qdrant.

Requires agents server restart to take effect.