Skip to Content
IssuesRAG Ingestion: All Website Pages Silently Skipped (Wrong ID Type)

RAG Ingestion: All Website Pages Silently Skipped (Wrong ID Type)

Status: ✅ Fixed (2026-05-05)
Severity: Critical — website RAG vector store empty; strategy writer degrades to context-file-only mode
File: packages/agents/src/workers/tenant-website-crawler.worker.ts line 314

Symptom

The RAG engine log shows 50–100 consecutive warnings for every fresh crawl:

WARN [rag-ingestion-worker] RagFile not found — skipping fileId: cmos2lhnz0003w1bkjz6jwmlm_...

The strategy writer then logs:

WARN [strategy-writer-worker] RAG context fetch failed — strategy will use context file only 404 Resource not found

pagesIndexed in the crawl summary is always 0 despite 100 pages being crawled.

Root Cause

tenant-website-crawler.worker.ts passed webPage.id (a WebPage DB record ID) directly as the fileId argument to enqueueRagContent:

// BEFORE (broken) await enqueueRagContent(extractedText, webPage.id, tenantId, ragDataset.id);

The RAG ingestion worker (rag-ingestion.worker.ts line 217) resolves the fileId as a RagFile primary key:

const ragFile = await prisma.ragFile.findUnique({ where: { id: data.fileId } }); if (!ragFile) { log.warn("RagFile not found — skipping"); return; }

WebPage.id and RagFile.id are separate CUID sequences. Passing a WebPage.id never finds a RagFile row, so every content job is silently dropped.

The result: the website_content Qdrant collection is empty → Azure OpenAI returns 404 on vector search → strategy writer falls back to context file only.

Fix Applied

At the RAG indexing block in tenant-website-crawler.worker.ts, create a RagFile record first and pass its ID:

// AFTER (fixed) const ragFile = await prisma.ragFile.create({ data: { tenantId, datasetId: ragDataset.id, fileName: new URL(pageUrl).pathname || "/", mimeType: "text/plain", fileSizeBytes: Buffer.byteLength(extractedText), storageKey: "", // inline content — no DO Spaces object storageUrl: pageUrl, source: "website_crawl", status: "pending", }, }); await enqueueRagContent(extractedText, ragFile.id, tenantId, ragDataset.id);

This matches the pattern used in rag-ingestion.worker.ts lines 286-300 for the website-content job handler.

Side Effect: pagesIndexed Counter Now Accurate

Previously pagesIndexed was always 0 because enqueueRagContent would silently fail. After this fix, pagesIndexed increments correctly for each page sent to RAG.

Agents Server Restart Required

The fix requires restarting apps/servers/agents to pick up the compiled change.

© 2026 Leadmetrics — Internal use only