Website Channel — Crawler & Media Indexing

Build Status (May 2026)

Component	Status
BullMQ worker (`tenant-website-crawler.worker.ts`)	[Live]
Utility libs (robots, sitemap, image-compression, document-extractor)	[Live]
Service layer (`webcrawl.service.ts` in `@leadmetrics/feature-knowledge`)	[Live]
DB models (WebPage, WebMedia, WebPageMedia, WebCrawlJob)	[Live] — schema pushed
Dashboard channel detail UI (5-tab: Overview/Pages/Media/Insights/Issues)	[Live]
WebPage detail page (`/channels/:id/webpages/:webPageId`) with Issues card	[Live]
Website Insights (auto-triggered after crawl completes)	[Live]
Website Issues — per-page issue detection + AI code-fixer PR agent	[Live May 2026] — see `issues.md`
Manage portal crawl-settings UI	[To Build]
Scheduled re-crawl (ScheduledTask on completion)	[To Build]
Search indexer integration (Typesense)	[To Build]

Bug fixed May 2026 (worker wiring): startTenantWebCrawlerWorker was never called in apps/servers/agents/src/index.ts — crawls were silently ignored. Now wired via startTenantWebCrawler() export.

Bug fixed May 2026 (brand extractor): extractBrandAssets() returned confidence = 0 on modern Next.js/Tailwind/shadcn sites for three reasons: (1) document.querySelector("header button") returned the first (often ghost/transparent) button instead of the filled CTA — fixed by scanning ALL header/nav buttons; (2) HSL CSS variables (--primary: 239 84% 67%) were not parsed — fixed by adding parseHsl() inside the evaluate block; (3) next/font inlines fonts as @font-face rules instead of <link> tags — fixed by scanning inline <style> blocks. Also fixed: onboarding wizard held stale null from server render — BrandAssetsStep now fetches fresh DB state on mount.

Overview

During tenant signup, the user provides their company website URL. The system automatically creates a Website connected channel, crawls the entire site (up to the configured page limit), stores webpages and media assets, and indexes the text content into RAG so AI agents can use it immediately.

What Gets Stored

Table	Content
`WebPages`	Every crawled page: URL, title, description, extracted text, screenshot
`WebMedia`	Images, videos, and documents found on those pages
`WebPageMedia`	Join table linking media assets to pages (many-to-many)
`WebCrawlJob`	Crawl session tracking: progress, statistics, errors
Qdrant (`website_content` dataset)	Text chunks for RAG / semantic search

Data Model

WebPages

Stores one record per crawled page URL.


id                  cuid
tenantId            String
connectedChannelId  String (FK → ConnectedChannel)
url                 String (unique per channel)
title               String?
description         String?
metaKeywords        String?
htmlContent         Text?      (raw HTML, optional)
extractedText       Text?      (clean text, scripts/nav stripped)
screenshot          String?    (DO Spaces key)
contentHash         String?    (SHA-256 of extractedText — change detection)
httpStatus          Int?       (200, 404, 500…)
depth               Int        (0 = start URL)
crawlJobId          String     (FK → WebCrawlJob)
status              String     ("ACTIVE" | "FAILED" | "DELETED")
errorMessage        Text?
lastCrawledAt       DateTime?
lastChangedAt       DateTime?  (set when contentHash differs from previous crawl)
createdAt           DateTime
updatedAt           DateTime

Indexes: [tenantId, connectedChannelId], [crawlJobId], unique [url, connectedChannelId]

WebMedia

Stores one record per unique media asset (deduplicated by content hash). The same logo used on every page is stored once in WebMedia and linked to each page via WebPageMedia.


id                    cuid
tenantId              String
mediaType             String    ("IMAGE" | "VIDEO" | "DOCUMENT")
sourceUrl             String    (original URL on client website)
fileName              String
mimeType              String

-- Storage (DO Spaces)
storageKey            String    (DO Spaces object key)
storageUrl            String    (CDN URL)
fileSizeBytes         Int       (compressed size)
originalFileSizeBytes Int?      (size before compression)
compressionRatio      Float?    (e.g. 0.54 = 54% smaller)

-- Deduplication
contentHash           String    @unique  (SHA-256 of file content)

-- Image metadata
width                 Int?
height                Int?
altText               String?

-- Video metadata
videoProvider         String?   ("LOCAL" | "YOUTUBE" | "VIMEO")
embedCode             String?
thumbnailUrl          String?

-- Document metadata
extractedText         Text?     (first 10,000 chars of parsed text, for display)
ragIndexed            Boolean   @default(false)  (true once text sent to Qdrant)

status                String    ("ACTIVE" | "FAILED" | "DELETED")
errorMessage          Text?
lastCrawledAt         DateTime?
createdAt             DateTime
updatedAt             DateTime

Indexes: [tenantId], unique [contentHash], [tenantId, mediaType]

WebPageMedia (join table)


webPageId   String (FK → WebPages)
webMediaId  String (FK → WebMedia)
position    Int?   (order on page)

Primary key: [webPageId, webMediaId]

WebCrawlJob

Tracks each crawl session.


id                  cuid
tenantId            String
connectedChannelId  String (FK → ConnectedChannel)
startUrl            String
maxPages            Int     (from tenant config, default 100)
maxDepth            Int     (default 3)
sameOriginOnly      Boolean (always true)
includeMedia        Boolean (default true)
mediaTypes          String[] (["IMAGE","VIDEO","DOCUMENT"])

-- Progress
status              String  ("QUEUED" | "RUNNING" | "COMPLETED" | "FAILED")
pagesCrawled        Int
mediaDownloaded     Int
pagesIndexed        Int
duplicatesSkipped   Int
changedPages        Int     (pages whose contentHash differed from last crawl)
robotsTxtSkipped    Int     (URLs excluded by robots.txt)
errorMessage        Text?

-- Timing
triggeredBy         String?  (userId)
scheduledFor        DateTime?
startedAt           DateTime?
completedAt         DateTime?

-- Snapshots
crawlConfig         Json     (copy of config at time of crawl)
statistics          Json?    (bandwidth, compressionSavingsBytes, duration, etc.)

createdAt           DateTime
updatedAt           DateTime

Tenant Crawl Configuration

Two fields added to the existing Tenant table:


websiteCrawlMaxPages      Int     @default(100)   // range 10–500
websiteCrawlFrequency     String  @default("MONTHLY")  // "MANUAL" | "WEEKLY" | "MONTHLY"
websiteMaxVideoSizeMb     Int     @default(50)    // hard cap on local video downloads

Updatable from the manage portal. The active WebCrawlJob reads these values at the time the crawl starts.

Crawler Agent

BullMQ Queue

Property	Value
Queue name	`agent__tenant-web-crawler`
Agent role	`"tenant-web-crawler"` (`TenantWebCrawlerJobData` in queue/types.ts)
Enqueue function	`enqueueWebsiteCrawl(webCrawlJobId, tenantId, connectedChannelId)`
Worker file	`packages/agents/src/workers/tenant-website-crawler.worker.ts`
Deduplication	`deduplication.id = "tenant-web-crawler__{connectedChannelId}"` — one crawl per channel at a time
Timeout	30 minutes
Attempts	2 (exponential backoff 30s)

Crawl Logic (BFS)


enqueueWebsiteCrawl(crawlJobId, tenantId, channelId)
  └─ website-crawler.worker.ts
       ├─ Load tenant config: websiteCrawlMaxPages, websiteCrawlFrequency, websiteMaxVideoSizeMb
       ├─ Parse startUrl → extract origin (e.g. "https://example.com")
       │
       ├─ Phase 1 — Pre-crawl setup
       │   ├─ Fetch {origin}/robots.txt → parse Disallow rules for "*" and "Googlebot" agents
       │   │   Store disallowed paths list in memory for this crawl session
       │   │
       │   └─ Fetch {origin}/sitemap.xml (and sitemap index files)
       │       ├─ Parse all <loc> URLs → filter to same-origin only
       │       ├─ Seed BFS queue with sitemap URLs (up to maxPages)
       │       └─ Add startUrl to front of queue if not already present
       │
       ├─ Open Playwright browser (headless, block ads/trackers)
       │
       ├─ Phase 2 — BFS crawl (queue seeded from sitemap + startUrl)
       │   For each URL (up to maxPages):
       │   ├─ SKIP if URL origin ≠ startUrl origin (same-domain only)
       │   ├─ SKIP if URL path matches any robots.txt Disallow rule
       │   │   (increment WebCrawlJob.robotsTxtSkipped, log warning)
       │   ├─ Fetch page, wait for load, strip nav/footer/cookies
       │   ├─ Extract: title, description, metaKeywords, clean text
       │   ├─ Compute contentHash (SHA-256 of extractedText)
       │   │
       │   ├─ Re-crawl change detection:
       │   │   ├─ Look up existing WebPages record by [url, connectedChannelId]
       │   │   ├─ If contentHash SAME as stored → skip RAG re-embed, update lastCrawledAt only
       │   │   └─ If contentHash DIFFERENT (or new page) → update record, re-embed, increment changedPages
       │   │
       │   ├─ Create/update WebPages record (set lastChangedAt if hash changed)
       │   ├─ Screenshot → compress (JPEG 80%) → upload to DO Spaces
       │   │   Path: webpages/{tenantId}/{channelId}/screenshots/{pageId}.jpg
       │   │
       │   ├─ Download & process media:
       │   │   ├─ Images (<img>, <picture>, CSS background-image)
       │   │   │   ├─ Download → SHA-256 hash
       │   │   │   ├─ Check WebMedia.contentHash — SKIP if duplicate
       │   │   │   ├─ Compress: JPEG 80% / WebP 85%, max 1920px width (via sharp)
       │   │   │   ├─ Upload: webmedia/{tenantId}/images/{hash}.{ext}
       │   │   │   ├─ Create WebMedia (IMAGE) with fileSizeBytes, originalFileSizeBytes
       │   │   │   └─ Create WebPageMedia join record
       │   │   │
       │   │   ├─ Videos (<video> tags)
       │   │   │   ├─ YouTube/Vimeo <iframe>: store provider + embedCode (no download)
       │   │   │   └─ Local videos:
       │   │   │       ├─ Check Content-Length header first
       │   │   │       ├─ SKIP if > websiteMaxVideoSizeMb (default 50 MB), log warning
       │   │   │       └─ If within limit: download → hash → deduplicate → upload
       │   │   │           Path: webmedia/{tenantId}/videos/{hash}.{ext}
       │   │   │
       │   │   └─ Documents (<a href="*.pdf|*.docx|*.doc">)
       │   │       ├─ Download → hash → deduplicate → upload
       │   │       │   Path: webmedia/{tenantId}/documents/{hash}.{ext}
       │   │       └─ Extract text: pipe through pdf-parse (PDF) or mammoth (DOCX)
       │   │           └─ enqueueRagContent(docText) → website_content dataset → Qdrant
       │   │
       │   ├─ enqueueRagContent(extractedText) → website_content dataset → Qdrant
       │   │   (only if page is new or contentHash changed)
       │   ├─ Extract same-origin links → add to BFS queue (if not seeded from sitemap)
       │   └─ Update WebCrawlJob.pagesCrawled every 5 pages
       │
       ├─ Mark WebCrawlJob status = "COMPLETED"
       ├─ Publish Redis event with statistics
       └─ Create ScheduledTask for periodic re-crawl (if frequency ≠ MANUAL)

Image Compression Utility

packages/agents/src/utils/image-compression.ts

Uses the sharp library:

Format	Quality	Max Width	Notes
JPEG	80%	1920px	Default for photos
WebP	85%	1920px	Better compression ratio
PNG	Level 9	1920px	Lossless, used for logos/icons

Returns: { buffer, width, height, originalSize, compressedSize, compressionRatio }

Document Text Extraction

After uploading a document to DO Spaces, the worker extracts its text for RAG indexing using the same parsers already in rag-ingestion.worker.ts:

File type	Library	Notes
`.pdf`	`pdf-parse`	Extracts plain text from all pages
`.docx` / `.doc`	`mammoth`	Converts to plain text, strips formatting

The extracted text is sent via enqueueRagContent() with the WebMedia id as the fileId and the website_content dataset. The WebMedia record stores a extractedText field (first 10,000 chars) for display in the UI.

Robots.txt Parsing

The worker fetches {origin}/robots.txt once at crawl start. It parses Disallow rules for user-agent * and Googlebot. Each URL is checked against these rules before fetching. Disallowed URLs are skipped, counted in WebCrawlJob.robotsTxtSkipped, and noted in the crawl summary. If robots.txt returns 404 or cannot be parsed, crawling proceeds without restrictions.

Sitemap.xml Parsing

Before BFS begins, the worker tries {origin}/sitemap.xml. If found:

Parses all <loc> entries (recursively follows <sitemapindex> references)
Filters to same-origin URLs only, strips query params, deduplicates
Seeds the BFS queue with up to maxPages sitemap URLs
startUrl is prepended to ensure the homepage is always crawled first

If sitemap.xml is not found (404), the worker falls back to pure BFS from startUrl. When the site has a sitemap, this produces much better RAG coverage because priority pages are crawled before obscure deep links.

Video Size Limit

Before downloading a local video file, the worker sends a HEAD request to check the Content-Length header. If the declared size exceeds websiteMaxVideoSizeMb (default 50 MB), the download is skipped. A warning is logged with the file URL and size. The WebCrawlJob.statistics JSON records videosSkippedOversized count.

YouTube/Vimeo embeds are always stored (only the embed code is saved, not the video file itself).

Deduplication

The same image appearing on 50 pages is stored once in DO Spaces and once in WebMedia. Each page gets a WebPageMedia join record pointing to the shared asset.


WebMedia (1 record)
  contentHash = "sha256:abc123..."
  storageUrl  = "https://cdn.example.com/webmedia/tenant1/images/abc123.jpg"
  fileSizeBytes = 45_000   (compressed from 200_000)

WebPageMedia (50 records)
  webPageId  = page-001 → webMediaId = media-abc
  webPageId  = page-002 → webMediaId = media-abc
  ...

Delete logic:
  deleteWebMedia(id) → check WebPageMedia count
    if count > 0 → refuse or cascade based on caller intent
    if count = 0 → delete from DO Spaces + DB

Brand Assets Extraction

During every crawl, the worker runs extractBrandAssets(page) on the homepage only, before any DOM mutation (nav/footer removal). Results are written to brand_assets for the tenant — but only if primaryColor is currently null (never overwrites manual input).

Extraction strategy

Priority	Source	Notes
1	CSS custom properties on `:root`	Handles hex, rgb(), and HSL (`239 84% 67%`) formats
2	Computed styles on ALL header/nav buttons + links	Skips transparent ghost buttons automatically
3	Hero/banner/section background colors	Broader fallback if header scan yields nothing
4	Google Fonts `<link>` hrefs	URL `?family=Inter:wght@400` → “Inter”
5	`@font-face` names in inline `<style>` tags	Catches Next.js `next/font` inlined fonts
6	Computed `font-family` on `body` / `h1`	Non-system fonts only

confidence = [primaryColor, secondaryColor, fontPrimary].filter(Boolean).length / 3. Nothing is written if confidence === 0.

Code: packages/agents/src/utils/brand-extractor.ts

Trigger Point

After the tenant record is created in Step 4 (POST /auth/v1/register/complete):


// In apps/api/src/routers/auth.ts — after tenant.create()
 
if (companyWebsite) {
  // 1. Create Website ConnectedChannel (auto-connects, no auth needed)
  const channel = await channelService.createChannel(
    { type: "Website", title: `${companyName} Website`, url: companyWebsite },
    userId,
    tenantId,
  );
 
  // 2. Create WebCrawlJob with tenant defaults
  const crawlJob = await db.webCrawlJob.create({
    data: {
      tenantId,
      connectedChannelId: channel.id,
      startUrl: companyWebsite,
      maxPages: 100,        // tenant.websiteCrawlMaxPages default
      maxDepth: 3,
      sameOriginOnly: true,
      status: "QUEUED",
      triggeredBy: userId,
    },
  });
 
  // 3. Enqueue crawl
  await enqueueWebsiteCrawl(crawlJob.id, tenantId, channel.id);
}

Tenant Defaults Set at Creation


await db.tenant.create({
  data: {
    ...tenantFields,
    websiteCrawlMaxPages: 100,
    websiteCrawlFrequency: "MONTHLY",
  },
});

API Routes

All routes require tenant authentication (Authorization: Bearer <token>).

Prefix: /tenant/v1/channels/:id/webcrawl

Method	Path	Description
`GET`	`/status`	Active/last crawl job status + progress
`POST`	`/start`	Manually trigger a new crawl
`GET`	`/pages`	List crawled webpages (paginated, filterable by depth)
`GET`	`/pages/:webPageId`	Webpage detail: screenshot, text, associated media
`GET`	`/media`	List media assets (filterable by `mediaType`)
`GET`	`/media/:webMediaId`	Media detail including usage count (pages using it)
`GET`	`/statistics`	Aggregated stats: pages, media breakdown, compression savings, storage
`DELETE`	`/pages/:webPageId`	Delete webpage + remove WebPageMedia joins
`DELETE`	`/media/:webMediaId`	Delete media (checks usage count before removing from storage)

Prefix: /manage/v1/tenants/:id/crawl-settings

Method	Path	Description
`GET`	`/`	Get tenant crawl config (`websiteCrawlMaxPages`, `websiteCrawlFrequency`)
`PATCH`	`/`	Update config. Validated: pages 10–500, frequency enum. Updates scheduled tasks if frequency changes.

Channel Details UI (Dashboard)

Route: /channels/:id (Website channel type)


┌─────────────────────────────────────────────────────┐
│  🌐 Acme Corp Website                    [Re-crawl] │
│  https://acmecorp.com  ●  Connected                 │
├─────────────────────────────────────────────────────┤
│  47 webpages  │  23 images  │  5 videos  │  3 docs  │
│  12 duplicates skipped  │  2.1 MB saved (45%)       │
│  Last crawled: 2 hours ago  │  Next: May 1           │
├─────────────────────────────────────────────────────┤
│  [Webpages]  [Media Gallery]  [Crawl History]       │
├─────────────────────────────────────────────────────┤
│  Webpages tab:                                      │
│  URL               Title         Depth  Status      │
│  /                 Home          0      ✓ Active    │
│  /about            About Us      1      ✓ Active    │
│  /products/...     Products      2      ✓ Active    │
│  ...                                                │
└─────────────────────────────────────────────────────┘

Tabs

Webpages tab — WebPages table, paginated

Columns: URL, title, depth, HTTP status, last crawled, actions (View, Delete)
Click row → opens webpage detail view

Webpage detail view

Screenshot preview
Title, URL, meta description, depth, HTTP status
Extracted text (first 500 chars + “Show full text”)
Media grid (images/videos/docs found on this page via WebPageMedia)

Media Gallery tab — WebMedia grid, filterable

Filter buttons: All, Images, Videos, Documents
Images: thumbnail grid
Videos: thumbnail + play icon (or embed preview)
Documents: file icon + filename
Each item shows “Used on X pages” badge
Click image → lightbox with compression info: Original: 2.4 MB → 1.1 MB (54% saved)

Crawl History tab — WebCrawlJob list

Status badge (Queued / Running / Completed / Failed)
Start time, duration, pages crawled, media downloaded, duplicates skipped
Errors summary if any pages failed

Re-crawl Button

Disabled while a crawl is already running
POST /tenant/v1/channels/:id/webcrawl/start
Shows live progress toast: "Crawled 23/100 pages, 47 images (12 duplicates skipped)"
Polls /webcrawl/status every 5 seconds until COMPLETED or FAILED

Manage Portal — Crawl Settings

Route: /tenants/:id/crawl-settings


Website Crawl Settings
──────────────────────
Max Pages per Crawl    [100          ▼]   (10 – 500)
Crawl Frequency        [Monthly      ▼]   (Manual / Weekly / Monthly)

Current usage: 47 pages across 1 website channel
Next scheduled crawl: May 1, 2026

                                   [Save Settings]

Updating frequency immediately:

MANUAL → deletes any existing ScheduledTask
WEEKLY → upserts ScheduledTask with cron: "0 0 * * 0"
MONTHLY → upserts ScheduledTask with cron: "0 0 1 * *"

Scheduled Re-crawl

On crawl completion, the worker creates a ScheduledTask:


{
  type:      "website-recrawl",
  channelId: channel.id,
  tenantId:  tenantId,
  cron:      "0 0 1 * *",   // from tenant.websiteCrawlFrequency
  nextRunAt: <next occurrence>,
}

The scheduler worker (existing) checks for due website-recrawl tasks and:

Loads tenant config (websiteCrawlMaxPages, websiteMaxVideoSizeMb)
Creates a new WebCrawlJob
Calls enqueueWebsiteCrawl()
Updates ScheduledTask.nextRunAt

Change Detection on Re-crawl

On each re-crawl, the worker compares the newly computed contentHash against the value stored in the existing WebPages record:

Scenario	Action
Page is new (no DB record)	Insert record, enqueue RAG embed
Hash unchanged	Update `lastCrawledAt` only — skip RAG re-embed
Hash changed	Update record + `lastChangedAt`, delete old Qdrant vectors for `webPageId`, re-enqueue RAG embed
Page returns 4xx/5xx	Mark `status = "FAILED"`, keep last good text in RAG

This avoids redundant Qdrant writes on large sites where most content is stable across monthly crawls. The changedPages counter in WebCrawlJob.statistics gives visibility into how much content changed.

RAG Integration

Three content sources feed the website_content Qdrant dataset:


website-crawler.worker.ts
  │
  ├─ Page text (new or changed pages only)
  │    └─ enqueueRagContent(extractedText, webPageId, tenantId, datasetId)
  │         └─ rag-ingestion.worker.ts → chunk → embed → upsert Qdrant
  │
  ├─ Document text (PDF/DOCX)
  │    ├─ pdf-parse or mammoth → raw text string
  │    └─ enqueueRagContent(docText, webMediaId, tenantId, datasetId)
  │         └─ rag-ingestion.worker.ts → chunk → embed → upsert Qdrant
  │
  └─ Re-crawl change detection
       ├─ contentHash SAME → skip enqueueRagContent entirely
       └─ contentHash DIFFERENT → delete old Qdrant vectors for webPageId,
                                   re-enqueue with new text

All three paths use the existing rag__ingestion queue and rag-ingestion.worker.ts — no changes required to the RAG system. Agents with access to website_content (all content agents by default) will automatically benefit from crawled text and document content.

Storage Layout (DO Spaces)


{bucket}/
  webpages/
    {tenantId}/
      {channelId}/
        screenshots/
          {webPageId}.jpg         (compressed, JPEG 80%)

  webmedia/
    {tenantId}/
      images/
        {contentHash}.jpg         (compressed, JPEG 80% or WebP)
      videos/
        {contentHash}.mp4
      documents/
        {contentHash}.pdf
        {contentHash}.docx

CDN URLs stored in WebMedia.storageUrl. Content-hash based paths mean the same file uploaded twice always resolves to the same object (natural deduplication at the storage layer too).

Files Built (April 2026)

File	Status
`packages/agents/src/workers/tenant-website-crawler.worker.ts`	✅ Built
`packages/agents/src/utils/image-compression.ts`	✅ Built
`packages/agents/src/utils/robots-parser.ts`	✅ Built
`packages/agents/src/utils/sitemap-parser.ts`	✅ Built
`packages/agents/src/utils/document-extractor.ts`	✅ Built
`packages/feature-knowledge/src/webcrawl.service.ts`	✅ Built
`packages/db/prisma/schema.prisma`	✅ WebPage/WebMedia/WebPageMedia/WebCrawlJob models + Tenant crawl fields pushed
`packages/queue/src/queues.ts`	✅ `enqueueWebsiteCrawl()` added
`packages/queue/src/types.ts`	✅ `TenantWebCrawlerJobData` + `"tenant-web-crawler"` AgentRole added
`packages/feature-knowledge/src/index.ts`	✅ Exports all webcrawl service functions + DTOs

Still To Build

File	Purpose
`apps/api/src/routers/channels.ts`	Add `/webcrawl/*` sub-routes
`apps/api/src/routers/manage.ts`	Add `/crawl-settings` routes
`apps/dashboard/src/app/(dashboard)/channels/[id]/page.tsx`	Channel details page with Statistics + Webpages + Media Gallery + History tabs
`apps/dashboard/src/app/(dashboard)/channels/[id]/pages/[webPageId]/page.tsx`	Webpage detail view
`apps/manage/src/app/(manage)/tenants/[id]/crawl-settings/page.tsx`	Manage portal: max pages + frequency settings
Scheduled re-crawl	Create ScheduledTask on completion; scheduler worker picks it up
`apps/api/src/routers/auth.ts`	Auto-trigger crawl at signup if `companyWebsite` present

Decisions Log

#	Decision	Detail
1	Sitemap.xml parsed first	Fetch and parse `/sitemap.xml` before BFS. Seeds BFS queue with sitemap URLs for better coverage within the page limit. Falls back to pure BFS if 404.
2	robots.txt respected	Fetch `/robots.txt` at crawl start. Skip any URL matching `Disallow` rules for `*` or `Googlebot`. Count skips in `robotsTxtSkipped`.
3	Document text extracted for RAG	PDFs/DOCX text is extracted via `pdf-parse`/`mammoth` and sent to `website_content` via `enqueueRagContent()`. First 10,000 chars stored in `WebMedia.extractedText` for UI display.
4	Video size limit: 50 MB	Local video downloads capped at `websiteMaxVideoSizeMb` (default 50 MB, stored on Tenant). Checked via `Content-Length` header before download. Oversized files counted in `videosSkippedOversized`.
5	Change detection on re-crawl	On re-crawl, compare new `contentHash` to stored. If unchanged: skip RAG re-embed. If changed: delete old Qdrant vectors, re-enqueue with new text, set `lastChangedAt`.

Open Questions

Plan-tier page limits: Could enforce per-plan limits (Free=50, Starter=100, Pro=300, Enterprise=500) rather than a single tenant-level setting. The manage portal would show the plan limit as a ceiling.