RAG Ingestion: All Website Pages Silently Skipped (Wrong ID Type)
Status: ✅ Fixed (2026-05-05)
Severity: Critical — website RAG vector store empty; strategy writer degrades to context-file-only mode
File: packages/agents/src/workers/tenant-website-crawler.worker.ts line 314
Symptom
The RAG engine log shows 50–100 consecutive warnings for every fresh crawl:
WARN [rag-ingestion-worker] RagFile not found — skipping fileId: cmos2lhnz0003w1bkjz6jwmlm_...The strategy writer then logs:
WARN [strategy-writer-worker] RAG context fetch failed — strategy will use context file only
404 Resource not foundpagesIndexed in the crawl summary is always 0 despite 100 pages being crawled.
Root Cause
tenant-website-crawler.worker.ts passed webPage.id (a WebPage DB record ID) directly as the fileId argument to enqueueRagContent:
// BEFORE (broken)
await enqueueRagContent(extractedText, webPage.id, tenantId, ragDataset.id);The RAG ingestion worker (rag-ingestion.worker.ts line 217) resolves the fileId as a RagFile primary key:
const ragFile = await prisma.ragFile.findUnique({ where: { id: data.fileId } });
if (!ragFile) { log.warn("RagFile not found — skipping"); return; }WebPage.id and RagFile.id are separate CUID sequences. Passing a WebPage.id never finds a RagFile row, so every content job is silently dropped.
The result: the website_content Qdrant collection is empty → Azure OpenAI returns 404 on vector search → strategy writer falls back to context file only.
Fix Applied
At the RAG indexing block in tenant-website-crawler.worker.ts, create a RagFile record first and pass its ID:
// AFTER (fixed)
const ragFile = await prisma.ragFile.create({
data: {
tenantId,
datasetId: ragDataset.id,
fileName: new URL(pageUrl).pathname || "/",
mimeType: "text/plain",
fileSizeBytes: Buffer.byteLength(extractedText),
storageKey: "", // inline content — no DO Spaces object
storageUrl: pageUrl,
source: "website_crawl",
status: "pending",
},
});
await enqueueRagContent(extractedText, ragFile.id, tenantId, ragDataset.id);This matches the pattern used in rag-ingestion.worker.ts lines 286-300 for the website-content job handler.
Side Effect: pagesIndexed Counter Now Accurate
Previously pagesIndexed was always 0 because enqueueRagContent would silently fail. After this fix, pagesIndexed increments correctly for each page sent to RAG.
Agents Server Restart Required
The fix requires restarting apps/servers/agents to pick up the compiled change.