RAG Ingestion: All Website Pages Silently Skipped (Wrong ID Type)

Status: ✅ Fixed (2026-05-05)
Severity: Critical — website RAG vector store empty; strategy writer degrades to context-file-only mode
File: packages/agents/src/workers/tenant-website-crawler.worker.ts line 314

Symptom

The RAG engine log shows 50–100 consecutive warnings for every fresh crawl:


WARN [rag-ingestion-worker] RagFile not found — skipping  fileId: cmos2lhnz0003w1bkjz6jwmlm_...

The strategy writer then logs:


WARN [strategy-writer-worker] RAG context fetch failed — strategy will use context file only
  404 Resource not found

pagesIndexed in the crawl summary is always 0 despite 100 pages being crawled.

Root Cause

tenant-website-crawler.worker.ts passed webPage.id (a WebPage DB record ID) directly as the fileId argument to enqueueRagContent:


// BEFORE (broken)
await enqueueRagContent(extractedText, webPage.id, tenantId, ragDataset.id);

The RAG ingestion worker (rag-ingestion.worker.ts line 217) resolves the fileId as a RagFile primary key:


const ragFile = await prisma.ragFile.findUnique({ where: { id: data.fileId } });
if (!ragFile) { log.warn("RagFile not found — skipping"); return; }

WebPage.id and RagFile.id are separate CUID sequences. Passing a WebPage.id never finds a RagFile row, so every content job is silently dropped.

The result: the website_content Qdrant collection is empty → Azure OpenAI returns 404 on vector search → strategy writer falls back to context file only.

Fix Applied

At the RAG indexing block in tenant-website-crawler.worker.ts, create a RagFile record first and pass its ID:


// AFTER (fixed)
const ragFile = await prisma.ragFile.create({
  data: {
    tenantId,
    datasetId:     ragDataset.id,
    fileName:      new URL(pageUrl).pathname || "/",
    mimeType:      "text/plain",
    fileSizeBytes: Buffer.byteLength(extractedText),
    storageKey:    "",   // inline content — no DO Spaces object
    storageUrl:    pageUrl,
    source:        "website_crawl",
    status:        "pending",
  },
});
await enqueueRagContent(extractedText, ragFile.id, tenantId, ragDataset.id);

This matches the pattern used in rag-ingestion.worker.ts lines 286-300 for the website-content job handler.

Side Effect: pagesIndexed Counter Now Accurate

Previously pagesIndexed was always 0 because enqueueRagContent would silently fail. After this fix, pagesIndexed increments correctly for each page sent to RAG.

Agents Server Restart Required

The fix requires restarting apps/servers/agents to pick up the compiled change.