Website Channel — Crawler & Media Indexing
Build Status (May 2026)
| Component | Status |
|---|---|
BullMQ worker (tenant-website-crawler.worker.ts) | [Live] |
| Utility libs (robots, sitemap, image-compression, document-extractor) | [Live] |
Service layer (webcrawl.service.ts in @leadmetrics/feature-knowledge) | [Live] |
| DB models (WebPage, WebMedia, WebPageMedia, WebCrawlJob) | [Live] — schema pushed |
| Dashboard channel detail UI (5-tab: Overview/Pages/Media/Insights/Issues) | [Live] |
WebPage detail page (/channels/:id/webpages/:webPageId) with Issues card | [Live] |
| Website Insights (auto-triggered after crawl completes) | [Live] |
| Website Issues — per-page issue detection + AI code-fixer PR agent | [Live May 2026] — see issues.md |
| Manage portal crawl-settings UI | [To Build] |
| Scheduled re-crawl (ScheduledTask on completion) | [To Build] |
| Search indexer integration (Typesense) | [To Build] |
Bug fixed May 2026 (worker wiring):
startTenantWebCrawlerWorkerwas never called inapps/servers/agents/src/index.ts— crawls were silently ignored. Now wired viastartTenantWebCrawler()export.Bug fixed May 2026 (brand extractor):
extractBrandAssets()returnedconfidence = 0on modern Next.js/Tailwind/shadcn sites for three reasons: (1)document.querySelector("header button")returned the first (often ghost/transparent) button instead of the filled CTA — fixed by scanning ALL header/nav buttons; (2) HSL CSS variables (--primary: 239 84% 67%) were not parsed — fixed by addingparseHsl()inside the evaluate block; (3)next/fontinlines fonts as@font-facerules instead of<link>tags — fixed by scanning inline<style>blocks. Also fixed: onboarding wizard held stalenullfrom server render —BrandAssetsStepnow fetches fresh DB state on mount.
Overview
During tenant signup, the user provides their company website URL. The system automatically creates a Website connected channel, crawls the entire site (up to the configured page limit), stores webpages and media assets, and indexes the text content into RAG so AI agents can use it immediately.
What Gets Stored
| Table | Content |
|---|---|
WebPages | Every crawled page: URL, title, description, extracted text, screenshot |
WebMedia | Images, videos, and documents found on those pages |
WebPageMedia | Join table linking media assets to pages (many-to-many) |
WebCrawlJob | Crawl session tracking: progress, statistics, errors |
Qdrant (website_content dataset) | Text chunks for RAG / semantic search |
Data Model
WebPages
Stores one record per crawled page URL.
id cuid
tenantId String
connectedChannelId String (FK → ConnectedChannel)
url String (unique per channel)
title String?
description String?
metaKeywords String?
htmlContent Text? (raw HTML, optional)
extractedText Text? (clean text, scripts/nav stripped)
screenshot String? (DO Spaces key)
contentHash String? (SHA-256 of extractedText — change detection)
httpStatus Int? (200, 404, 500…)
depth Int (0 = start URL)
crawlJobId String (FK → WebCrawlJob)
status String ("ACTIVE" | "FAILED" | "DELETED")
errorMessage Text?
lastCrawledAt DateTime?
lastChangedAt DateTime? (set when contentHash differs from previous crawl)
createdAt DateTime
updatedAt DateTime
Indexes: [tenantId, connectedChannelId], [crawlJobId], unique [url, connectedChannelId]WebMedia
Stores one record per unique media asset (deduplicated by content hash). The same logo used on every page is stored once in WebMedia and linked to each page via WebPageMedia.
id cuid
tenantId String
mediaType String ("IMAGE" | "VIDEO" | "DOCUMENT")
sourceUrl String (original URL on client website)
fileName String
mimeType String
-- Storage (DO Spaces)
storageKey String (DO Spaces object key)
storageUrl String (CDN URL)
fileSizeBytes Int (compressed size)
originalFileSizeBytes Int? (size before compression)
compressionRatio Float? (e.g. 0.54 = 54% smaller)
-- Deduplication
contentHash String @unique (SHA-256 of file content)
-- Image metadata
width Int?
height Int?
altText String?
-- Video metadata
videoProvider String? ("LOCAL" | "YOUTUBE" | "VIMEO")
embedCode String?
thumbnailUrl String?
-- Document metadata
extractedText Text? (first 10,000 chars of parsed text, for display)
ragIndexed Boolean @default(false) (true once text sent to Qdrant)
status String ("ACTIVE" | "FAILED" | "DELETED")
errorMessage Text?
lastCrawledAt DateTime?
createdAt DateTime
updatedAt DateTime
Indexes: [tenantId], unique [contentHash], [tenantId, mediaType]WebPageMedia (join table)
webPageId String (FK → WebPages)
webMediaId String (FK → WebMedia)
position Int? (order on page)
Primary key: [webPageId, webMediaId]WebCrawlJob
Tracks each crawl session.
id cuid
tenantId String
connectedChannelId String (FK → ConnectedChannel)
startUrl String
maxPages Int (from tenant config, default 100)
maxDepth Int (default 3)
sameOriginOnly Boolean (always true)
includeMedia Boolean (default true)
mediaTypes String[] (["IMAGE","VIDEO","DOCUMENT"])
-- Progress
status String ("QUEUED" | "RUNNING" | "COMPLETED" | "FAILED")
pagesCrawled Int
mediaDownloaded Int
pagesIndexed Int
duplicatesSkipped Int
changedPages Int (pages whose contentHash differed from last crawl)
robotsTxtSkipped Int (URLs excluded by robots.txt)
errorMessage Text?
-- Timing
triggeredBy String? (userId)
scheduledFor DateTime?
startedAt DateTime?
completedAt DateTime?
-- Snapshots
crawlConfig Json (copy of config at time of crawl)
statistics Json? (bandwidth, compressionSavingsBytes, duration, etc.)
createdAt DateTime
updatedAt DateTimeTenant Crawl Configuration
Two fields added to the existing Tenant table:
websiteCrawlMaxPages Int @default(100) // range 10–500
websiteCrawlFrequency String @default("MONTHLY") // "MANUAL" | "WEEKLY" | "MONTHLY"
websiteMaxVideoSizeMb Int @default(50) // hard cap on local video downloadsUpdatable from the manage portal. The active WebCrawlJob reads these values at the time the crawl starts.
Crawler Agent
BullMQ Queue
| Property | Value |
|---|---|
| Queue name | agent__tenant-web-crawler |
| Agent role | "tenant-web-crawler" (TenantWebCrawlerJobData in queue/types.ts) |
| Enqueue function | enqueueWebsiteCrawl(webCrawlJobId, tenantId, connectedChannelId) |
| Worker file | packages/agents/src/workers/tenant-website-crawler.worker.ts |
| Deduplication | deduplication.id = "tenant-web-crawler__{connectedChannelId}" — one crawl per channel at a time |
| Timeout | 30 minutes |
| Attempts | 2 (exponential backoff 30s) |
Crawl Logic (BFS)
enqueueWebsiteCrawl(crawlJobId, tenantId, channelId)
└─ website-crawler.worker.ts
├─ Load tenant config: websiteCrawlMaxPages, websiteCrawlFrequency, websiteMaxVideoSizeMb
├─ Parse startUrl → extract origin (e.g. "https://example.com")
│
├─ Phase 1 — Pre-crawl setup
│ ├─ Fetch {origin}/robots.txt → parse Disallow rules for "*" and "Googlebot" agents
│ │ Store disallowed paths list in memory for this crawl session
│ │
│ └─ Fetch {origin}/sitemap.xml (and sitemap index files)
│ ├─ Parse all <loc> URLs → filter to same-origin only
│ ├─ Seed BFS queue with sitemap URLs (up to maxPages)
│ └─ Add startUrl to front of queue if not already present
│
├─ Open Playwright browser (headless, block ads/trackers)
│
├─ Phase 2 — BFS crawl (queue seeded from sitemap + startUrl)
│ For each URL (up to maxPages):
│ ├─ SKIP if URL origin ≠ startUrl origin (same-domain only)
│ ├─ SKIP if URL path matches any robots.txt Disallow rule
│ │ (increment WebCrawlJob.robotsTxtSkipped, log warning)
│ ├─ Fetch page, wait for load, strip nav/footer/cookies
│ ├─ Extract: title, description, metaKeywords, clean text
│ ├─ Compute contentHash (SHA-256 of extractedText)
│ │
│ ├─ Re-crawl change detection:
│ │ ├─ Look up existing WebPages record by [url, connectedChannelId]
│ │ ├─ If contentHash SAME as stored → skip RAG re-embed, update lastCrawledAt only
│ │ └─ If contentHash DIFFERENT (or new page) → update record, re-embed, increment changedPages
│ │
│ ├─ Create/update WebPages record (set lastChangedAt if hash changed)
│ ├─ Screenshot → compress (JPEG 80%) → upload to DO Spaces
│ │ Path: webpages/{tenantId}/{channelId}/screenshots/{pageId}.jpg
│ │
│ ├─ Download & process media:
│ │ ├─ Images (<img>, <picture>, CSS background-image)
│ │ │ ├─ Download → SHA-256 hash
│ │ │ ├─ Check WebMedia.contentHash — SKIP if duplicate
│ │ │ ├─ Compress: JPEG 80% / WebP 85%, max 1920px width (via sharp)
│ │ │ ├─ Upload: webmedia/{tenantId}/images/{hash}.{ext}
│ │ │ ├─ Create WebMedia (IMAGE) with fileSizeBytes, originalFileSizeBytes
│ │ │ └─ Create WebPageMedia join record
│ │ │
│ │ ├─ Videos (<video> tags)
│ │ │ ├─ YouTube/Vimeo <iframe>: store provider + embedCode (no download)
│ │ │ └─ Local videos:
│ │ │ ├─ Check Content-Length header first
│ │ │ ├─ SKIP if > websiteMaxVideoSizeMb (default 50 MB), log warning
│ │ │ └─ If within limit: download → hash → deduplicate → upload
│ │ │ Path: webmedia/{tenantId}/videos/{hash}.{ext}
│ │ │
│ │ └─ Documents (<a href="*.pdf|*.docx|*.doc">)
│ │ ├─ Download → hash → deduplicate → upload
│ │ │ Path: webmedia/{tenantId}/documents/{hash}.{ext}
│ │ └─ Extract text: pipe through pdf-parse (PDF) or mammoth (DOCX)
│ │ └─ enqueueRagContent(docText) → website_content dataset → Qdrant
│ │
│ ├─ enqueueRagContent(extractedText) → website_content dataset → Qdrant
│ │ (only if page is new or contentHash changed)
│ ├─ Extract same-origin links → add to BFS queue (if not seeded from sitemap)
│ └─ Update WebCrawlJob.pagesCrawled every 5 pages
│
├─ Mark WebCrawlJob status = "COMPLETED"
├─ Publish Redis event with statistics
└─ Create ScheduledTask for periodic re-crawl (if frequency ≠ MANUAL)Image Compression Utility
packages/agents/src/utils/image-compression.ts
Uses the sharp library:
| Format | Quality | Max Width | Notes |
|---|---|---|---|
| JPEG | 80% | 1920px | Default for photos |
| WebP | 85% | 1920px | Better compression ratio |
| PNG | Level 9 | 1920px | Lossless, used for logos/icons |
Returns: { buffer, width, height, originalSize, compressedSize, compressionRatio }
Document Text Extraction
After uploading a document to DO Spaces, the worker extracts its text for RAG indexing using the same parsers already in rag-ingestion.worker.ts:
| File type | Library | Notes |
|---|---|---|
.pdf | pdf-parse | Extracts plain text from all pages |
.docx / .doc | mammoth | Converts to plain text, strips formatting |
The extracted text is sent via enqueueRagContent() with the WebMedia id as the fileId and the website_content dataset. The WebMedia record stores a extractedText field (first 10,000 chars) for display in the UI.
Robots.txt Parsing
The worker fetches {origin}/robots.txt once at crawl start. It parses Disallow rules for user-agent * and Googlebot. Each URL is checked against these rules before fetching. Disallowed URLs are skipped, counted in WebCrawlJob.robotsTxtSkipped, and noted in the crawl summary. If robots.txt returns 404 or cannot be parsed, crawling proceeds without restrictions.
Sitemap.xml Parsing
Before BFS begins, the worker tries {origin}/sitemap.xml. If found:
- Parses all
<loc>entries (recursively follows<sitemapindex>references) - Filters to same-origin URLs only, strips query params, deduplicates
- Seeds the BFS queue with up to
maxPagessitemap URLs startUrlis prepended to ensure the homepage is always crawled first
If sitemap.xml is not found (404), the worker falls back to pure BFS from startUrl. When the site has a sitemap, this produces much better RAG coverage because priority pages are crawled before obscure deep links.
Video Size Limit
Before downloading a local video file, the worker sends a HEAD request to check the Content-Length header. If the declared size exceeds websiteMaxVideoSizeMb (default 50 MB), the download is skipped. A warning is logged with the file URL and size. The WebCrawlJob.statistics JSON records videosSkippedOversized count.
YouTube/Vimeo embeds are always stored (only the embed code is saved, not the video file itself).
Deduplication
The same image appearing on 50 pages is stored once in DO Spaces and once in WebMedia. Each page gets a WebPageMedia join record pointing to the shared asset.
WebMedia (1 record)
contentHash = "sha256:abc123..."
storageUrl = "https://cdn.example.com/webmedia/tenant1/images/abc123.jpg"
fileSizeBytes = 45_000 (compressed from 200_000)
WebPageMedia (50 records)
webPageId = page-001 → webMediaId = media-abc
webPageId = page-002 → webMediaId = media-abc
...
Delete logic:
deleteWebMedia(id) → check WebPageMedia count
if count > 0 → refuse or cascade based on caller intent
if count = 0 → delete from DO Spaces + DBBrand Assets Extraction
During every crawl, the worker runs extractBrandAssets(page) on the homepage only, before any DOM mutation (nav/footer removal). Results are written to brand_assets for the tenant — but only if primaryColor is currently null (never overwrites manual input).
Extraction strategy
| Priority | Source | Notes |
|---|---|---|
| 1 | CSS custom properties on :root | Handles hex, rgb(), and HSL (239 84% 67%) formats |
| 2 | Computed styles on ALL header/nav buttons + links | Skips transparent ghost buttons automatically |
| 3 | Hero/banner/section background colors | Broader fallback if header scan yields nothing |
| 4 | Google Fonts <link> hrefs | URL ?family=Inter:wght@400 → “Inter” |
| 5 | @font-face names in inline <style> tags | Catches Next.js next/font inlined fonts |
| 6 | Computed font-family on body / h1 | Non-system fonts only |
confidence = [primaryColor, secondaryColor, fontPrimary].filter(Boolean).length / 3. Nothing is written if confidence === 0.
Code: packages/agents/src/utils/brand-extractor.ts
Signup Integration
Trigger Point
After the tenant record is created in Step 4 (POST /auth/v1/register/complete):
// In apps/api/src/routers/auth.ts — after tenant.create()
if (companyWebsite) {
// 1. Create Website ConnectedChannel (auto-connects, no auth needed)
const channel = await channelService.createChannel(
{ type: "Website", title: `${companyName} Website`, url: companyWebsite },
userId,
tenantId,
);
// 2. Create WebCrawlJob with tenant defaults
const crawlJob = await db.webCrawlJob.create({
data: {
tenantId,
connectedChannelId: channel.id,
startUrl: companyWebsite,
maxPages: 100, // tenant.websiteCrawlMaxPages default
maxDepth: 3,
sameOriginOnly: true,
status: "QUEUED",
triggeredBy: userId,
},
});
// 3. Enqueue crawl
await enqueueWebsiteCrawl(crawlJob.id, tenantId, channel.id);
}Tenant Defaults Set at Creation
await db.tenant.create({
data: {
...tenantFields,
websiteCrawlMaxPages: 100,
websiteCrawlFrequency: "MONTHLY",
},
});API Routes
All routes require tenant authentication (Authorization: Bearer <token>).
Prefix: /tenant/v1/channels/:id/webcrawl
| Method | Path | Description |
|---|---|---|
GET | /status | Active/last crawl job status + progress |
POST | /start | Manually trigger a new crawl |
GET | /pages | List crawled webpages (paginated, filterable by depth) |
GET | /pages/:webPageId | Webpage detail: screenshot, text, associated media |
GET | /media | List media assets (filterable by mediaType) |
GET | /media/:webMediaId | Media detail including usage count (pages using it) |
GET | /statistics | Aggregated stats: pages, media breakdown, compression savings, storage |
DELETE | /pages/:webPageId | Delete webpage + remove WebPageMedia joins |
DELETE | /media/:webMediaId | Delete media (checks usage count before removing from storage) |
Prefix: /manage/v1/tenants/:id/crawl-settings
| Method | Path | Description |
|---|---|---|
GET | / | Get tenant crawl config (websiteCrawlMaxPages, websiteCrawlFrequency) |
PATCH | / | Update config. Validated: pages 10–500, frequency enum. Updates scheduled tasks if frequency changes. |
Channel Details UI (Dashboard)
Route: /channels/:id (Website channel type)
┌─────────────────────────────────────────────────────┐
│ 🌐 Acme Corp Website [Re-crawl] │
│ https://acmecorp.com ● Connected │
├─────────────────────────────────────────────────────┤
│ 47 webpages │ 23 images │ 5 videos │ 3 docs │
│ 12 duplicates skipped │ 2.1 MB saved (45%) │
│ Last crawled: 2 hours ago │ Next: May 1 │
├─────────────────────────────────────────────────────┤
│ [Webpages] [Media Gallery] [Crawl History] │
├─────────────────────────────────────────────────────┤
│ Webpages tab: │
│ URL Title Depth Status │
│ / Home 0 ✓ Active │
│ /about About Us 1 ✓ Active │
│ /products/... Products 2 ✓ Active │
│ ... │
└─────────────────────────────────────────────────────┘Tabs
Webpages tab — WebPages table, paginated
- Columns: URL, title, depth, HTTP status, last crawled, actions (View, Delete)
- Click row → opens webpage detail view
Webpage detail view
- Screenshot preview
- Title, URL, meta description, depth, HTTP status
- Extracted text (first 500 chars + “Show full text”)
- Media grid (images/videos/docs found on this page via WebPageMedia)
Media Gallery tab — WebMedia grid, filterable
- Filter buttons: All, Images, Videos, Documents
- Images: thumbnail grid
- Videos: thumbnail + play icon (or embed preview)
- Documents: file icon + filename
- Each item shows “Used on X pages” badge
- Click image → lightbox with compression info:
Original: 2.4 MB → 1.1 MB (54% saved)
Crawl History tab — WebCrawlJob list
- Status badge (Queued / Running / Completed / Failed)
- Start time, duration, pages crawled, media downloaded, duplicates skipped
- Errors summary if any pages failed
Re-crawl Button
- Disabled while a crawl is already running
- POST
/tenant/v1/channels/:id/webcrawl/start - Shows live progress toast:
"Crawled 23/100 pages, 47 images (12 duplicates skipped)" - Polls
/webcrawl/statusevery 5 seconds untilCOMPLETEDorFAILED
Manage Portal — Crawl Settings
Route: /tenants/:id/crawl-settings
Website Crawl Settings
──────────────────────
Max Pages per Crawl [100 ▼] (10 – 500)
Crawl Frequency [Monthly ▼] (Manual / Weekly / Monthly)
Current usage: 47 pages across 1 website channel
Next scheduled crawl: May 1, 2026
[Save Settings]Updating frequency immediately:
MANUAL→ deletes any existing ScheduledTaskWEEKLY→ upserts ScheduledTask withcron: "0 0 * * 0"MONTHLY→ upserts ScheduledTask withcron: "0 0 1 * *"
Scheduled Re-crawl
On crawl completion, the worker creates a ScheduledTask:
{
type: "website-recrawl",
channelId: channel.id,
tenantId: tenantId,
cron: "0 0 1 * *", // from tenant.websiteCrawlFrequency
nextRunAt: <next occurrence>,
}The scheduler worker (existing) checks for due website-recrawl tasks and:
- Loads tenant config (
websiteCrawlMaxPages,websiteMaxVideoSizeMb) - Creates a new WebCrawlJob
- Calls
enqueueWebsiteCrawl() - Updates
ScheduledTask.nextRunAt
Change Detection on Re-crawl
On each re-crawl, the worker compares the newly computed contentHash against the value stored in the existing WebPages record:
| Scenario | Action |
|---|---|
| Page is new (no DB record) | Insert record, enqueue RAG embed |
| Hash unchanged | Update lastCrawledAt only — skip RAG re-embed |
| Hash changed | Update record + lastChangedAt, delete old Qdrant vectors for webPageId, re-enqueue RAG embed |
| Page returns 4xx/5xx | Mark status = "FAILED", keep last good text in RAG |
This avoids redundant Qdrant writes on large sites where most content is stable across monthly crawls. The changedPages counter in WebCrawlJob.statistics gives visibility into how much content changed.
RAG Integration
Three content sources feed the website_content Qdrant dataset:
website-crawler.worker.ts
│
├─ Page text (new or changed pages only)
│ └─ enqueueRagContent(extractedText, webPageId, tenantId, datasetId)
│ └─ rag-ingestion.worker.ts → chunk → embed → upsert Qdrant
│
├─ Document text (PDF/DOCX)
│ ├─ pdf-parse or mammoth → raw text string
│ └─ enqueueRagContent(docText, webMediaId, tenantId, datasetId)
│ └─ rag-ingestion.worker.ts → chunk → embed → upsert Qdrant
│
└─ Re-crawl change detection
├─ contentHash SAME → skip enqueueRagContent entirely
└─ contentHash DIFFERENT → delete old Qdrant vectors for webPageId,
re-enqueue with new textAll three paths use the existing rag__ingestion queue and rag-ingestion.worker.ts — no changes required to the RAG system. Agents with access to website_content (all content agents by default) will automatically benefit from crawled text and document content.
Storage Layout (DO Spaces)
{bucket}/
webpages/
{tenantId}/
{channelId}/
screenshots/
{webPageId}.jpg (compressed, JPEG 80%)
webmedia/
{tenantId}/
images/
{contentHash}.jpg (compressed, JPEG 80% or WebP)
videos/
{contentHash}.mp4
documents/
{contentHash}.pdf
{contentHash}.docxCDN URLs stored in WebMedia.storageUrl. Content-hash based paths mean the same file uploaded twice always resolves to the same object (natural deduplication at the storage layer too).
Files Built (April 2026)
| File | Status |
|---|---|
packages/agents/src/workers/tenant-website-crawler.worker.ts | ✅ Built |
packages/agents/src/utils/image-compression.ts | ✅ Built |
packages/agents/src/utils/robots-parser.ts | ✅ Built |
packages/agents/src/utils/sitemap-parser.ts | ✅ Built |
packages/agents/src/utils/document-extractor.ts | ✅ Built |
packages/feature-knowledge/src/webcrawl.service.ts | ✅ Built |
packages/db/prisma/schema.prisma | ✅ WebPage/WebMedia/WebPageMedia/WebCrawlJob models + Tenant crawl fields pushed |
packages/queue/src/queues.ts | ✅ enqueueWebsiteCrawl() added |
packages/queue/src/types.ts | ✅ TenantWebCrawlerJobData + "tenant-web-crawler" AgentRole added |
packages/feature-knowledge/src/index.ts | ✅ Exports all webcrawl service functions + DTOs |
Still To Build
| File | Purpose |
|---|---|
apps/api/src/routers/channels.ts | Add /webcrawl/* sub-routes |
apps/api/src/routers/manage.ts | Add /crawl-settings routes |
apps/dashboard/src/app/(dashboard)/channels/[id]/page.tsx | Channel details page with Statistics + Webpages + Media Gallery + History tabs |
apps/dashboard/src/app/(dashboard)/channels/[id]/pages/[webPageId]/page.tsx | Webpage detail view |
apps/manage/src/app/(manage)/tenants/[id]/crawl-settings/page.tsx | Manage portal: max pages + frequency settings |
| Scheduled re-crawl | Create ScheduledTask on completion; scheduler worker picks it up |
apps/api/src/routers/auth.ts | Auto-trigger crawl at signup if companyWebsite present |
Decisions Log
| # | Decision | Detail |
|---|---|---|
| 1 | Sitemap.xml parsed first | Fetch and parse /sitemap.xml before BFS. Seeds BFS queue with sitemap URLs for better coverage within the page limit. Falls back to pure BFS if 404. |
| 2 | robots.txt respected | Fetch /robots.txt at crawl start. Skip any URL matching Disallow rules for * or Googlebot. Count skips in robotsTxtSkipped. |
| 3 | Document text extracted for RAG | PDFs/DOCX text is extracted via pdf-parse/mammoth and sent to website_content via enqueueRagContent(). First 10,000 chars stored in WebMedia.extractedText for UI display. |
| 4 | Video size limit: 50 MB | Local video downloads capped at websiteMaxVideoSizeMb (default 50 MB, stored on Tenant). Checked via Content-Length header before download. Oversized files counted in videosSkippedOversized. |
| 5 | Change detection on re-crawl | On re-crawl, compare new contentHash to stored. If unchanged: skip RAG re-embed. If changed: delete old Qdrant vectors, re-enqueue with new text, set lastChangedAt. |
Open Questions
- Plan-tier page limits: Could enforce per-plan limits (Free=50, Starter=100, Pro=300, Enterprise=500) rather than a single tenant-level setting. The manage portal would show the plan limit as a ceiling.