Web Crawler: 0 pagesIndexed on Fresh Tenant Crawl
Status: ✅ Fixed — root cause confirmed and patched (2026-05-05)
Severity: Was Critical — see rag-content-wrong-file-id.md for the actual bug
File: packages/agents/src/workers/tenant-website-crawler.worker.ts
Symptom
The first crawl of a brand new tenant crawls 100 pages but reports 0 indexed:
[09:08:20] Crawl completed
pagesCrawled: 100
mediaDownloaded: 247
pagesIndexed: 0 ← all pages skipped in RAG pipeline
duplicatesSkipped: 1689
changedPages: 0
compressionSavedBytes: 1509777Root Cause (Confirmed)
The crawler passed webPage.id (a WebPage primary key) as the fileId argument to
enqueueRagContent. The RAG ingestion worker looks up a RagFile by that ID. Since
WebPage.id and RagFile.id are separate CUID sequences, the lookup always returns
null → worker logs “RagFile not found — skipping” → pagesIndexed stays 0.
This is not a naming issue — pages were genuinely not reaching the vector store.
See docs/issues/rag-content-wrong-file-id.md for full analysis and fix details.
Fix
Applied 2026-05-05: the crawler now creates a RagFile record before calling
enqueueRagContent and passes ragFile.id. The pagesIndexed counter now accurately
reflects pages sent to Qdrant.
Requires agents server restart to take effect.