Skip to Content
ChannelsWebsiteWebsite Channel — Crawler & Media Indexing

Website Channel — Crawler & Media Indexing

Build Status (May 2026)

ComponentStatus
BullMQ worker (tenant-website-crawler.worker.ts)[Live]
Utility libs (robots, sitemap, image-compression, document-extractor)[Live]
Service layer (webcrawl.service.ts in @leadmetrics/feature-knowledge)[Live]
DB models (WebPage, WebMedia, WebPageMedia, WebCrawlJob)[Live] — schema pushed
Dashboard channel detail UI (5-tab: Overview/Pages/Media/Insights/Issues)[Live]
WebPage detail page (/channels/:id/webpages/:webPageId) with Issues card[Live]
Website Insights (auto-triggered after crawl completes)[Live]
Website Issues — per-page issue detection + AI code-fixer PR agent[Live May 2026] — see issues.md
Manage portal crawl-settings UI[To Build]
Scheduled re-crawl (ScheduledTask on completion)[To Build]
Search indexer integration (Typesense)[To Build]

Bug fixed May 2026 (worker wiring): startTenantWebCrawlerWorker was never called in apps/servers/agents/src/index.ts — crawls were silently ignored. Now wired via startTenantWebCrawler() export.

Bug fixed May 2026 (brand extractor): extractBrandAssets() returned confidence = 0 on modern Next.js/Tailwind/shadcn sites for three reasons: (1) document.querySelector("header button") returned the first (often ghost/transparent) button instead of the filled CTA — fixed by scanning ALL header/nav buttons; (2) HSL CSS variables (--primary: 239 84% 67%) were not parsed — fixed by adding parseHsl() inside the evaluate block; (3) next/font inlines fonts as @font-face rules instead of <link> tags — fixed by scanning inline <style> blocks. Also fixed: onboarding wizard held stale null from server render — BrandAssetsStep now fetches fresh DB state on mount.


Overview

During tenant signup, the user provides their company website URL. The system automatically creates a Website connected channel, crawls the entire site (up to the configured page limit), stores webpages and media assets, and indexes the text content into RAG so AI agents can use it immediately.

What Gets Stored

TableContent
WebPagesEvery crawled page: URL, title, description, extracted text, screenshot
WebMediaImages, videos, and documents found on those pages
WebPageMediaJoin table linking media assets to pages (many-to-many)
WebCrawlJobCrawl session tracking: progress, statistics, errors
Qdrant (website_content dataset)Text chunks for RAG / semantic search

Data Model

WebPages

Stores one record per crawled page URL.

id cuid tenantId String connectedChannelId String (FK → ConnectedChannel) url String (unique per channel) title String? description String? metaKeywords String? htmlContent Text? (raw HTML, optional) extractedText Text? (clean text, scripts/nav stripped) screenshot String? (DO Spaces key) contentHash String? (SHA-256 of extractedText — change detection) httpStatus Int? (200, 404, 500…) depth Int (0 = start URL) crawlJobId String (FK → WebCrawlJob) status String ("ACTIVE" | "FAILED" | "DELETED") errorMessage Text? lastCrawledAt DateTime? lastChangedAt DateTime? (set when contentHash differs from previous crawl) createdAt DateTime updatedAt DateTime Indexes: [tenantId, connectedChannelId], [crawlJobId], unique [url, connectedChannelId]

WebMedia

Stores one record per unique media asset (deduplicated by content hash). The same logo used on every page is stored once in WebMedia and linked to each page via WebPageMedia.

id cuid tenantId String mediaType String ("IMAGE" | "VIDEO" | "DOCUMENT") sourceUrl String (original URL on client website) fileName String mimeType String -- Storage (DO Spaces) storageKey String (DO Spaces object key) storageUrl String (CDN URL) fileSizeBytes Int (compressed size) originalFileSizeBytes Int? (size before compression) compressionRatio Float? (e.g. 0.54 = 54% smaller) -- Deduplication contentHash String @unique (SHA-256 of file content) -- Image metadata width Int? height Int? altText String? -- Video metadata videoProvider String? ("LOCAL" | "YOUTUBE" | "VIMEO") embedCode String? thumbnailUrl String? -- Document metadata extractedText Text? (first 10,000 chars of parsed text, for display) ragIndexed Boolean @default(false) (true once text sent to Qdrant) status String ("ACTIVE" | "FAILED" | "DELETED") errorMessage Text? lastCrawledAt DateTime? createdAt DateTime updatedAt DateTime Indexes: [tenantId], unique [contentHash], [tenantId, mediaType]

WebPageMedia (join table)

webPageId String (FK → WebPages) webMediaId String (FK → WebMedia) position Int? (order on page) Primary key: [webPageId, webMediaId]

WebCrawlJob

Tracks each crawl session.

id cuid tenantId String connectedChannelId String (FK → ConnectedChannel) startUrl String maxPages Int (from tenant config, default 100) maxDepth Int (default 3) sameOriginOnly Boolean (always true) includeMedia Boolean (default true) mediaTypes String[] (["IMAGE","VIDEO","DOCUMENT"]) -- Progress status String ("QUEUED" | "RUNNING" | "COMPLETED" | "FAILED") pagesCrawled Int mediaDownloaded Int pagesIndexed Int duplicatesSkipped Int changedPages Int (pages whose contentHash differed from last crawl) robotsTxtSkipped Int (URLs excluded by robots.txt) errorMessage Text? -- Timing triggeredBy String? (userId) scheduledFor DateTime? startedAt DateTime? completedAt DateTime? -- Snapshots crawlConfig Json (copy of config at time of crawl) statistics Json? (bandwidth, compressionSavingsBytes, duration, etc.) createdAt DateTime updatedAt DateTime

Tenant Crawl Configuration

Two fields added to the existing Tenant table:

websiteCrawlMaxPages Int @default(100) // range 10–500 websiteCrawlFrequency String @default("MONTHLY") // "MANUAL" | "WEEKLY" | "MONTHLY" websiteMaxVideoSizeMb Int @default(50) // hard cap on local video downloads

Updatable from the manage portal. The active WebCrawlJob reads these values at the time the crawl starts.


Crawler Agent

BullMQ Queue

PropertyValue
Queue nameagent__tenant-web-crawler
Agent role"tenant-web-crawler" (TenantWebCrawlerJobData in queue/types.ts)
Enqueue functionenqueueWebsiteCrawl(webCrawlJobId, tenantId, connectedChannelId)
Worker filepackages/agents/src/workers/tenant-website-crawler.worker.ts
Deduplicationdeduplication.id = "tenant-web-crawler__{connectedChannelId}" — one crawl per channel at a time
Timeout30 minutes
Attempts2 (exponential backoff 30s)

Crawl Logic (BFS)

enqueueWebsiteCrawl(crawlJobId, tenantId, channelId) └─ website-crawler.worker.ts ├─ Load tenant config: websiteCrawlMaxPages, websiteCrawlFrequency, websiteMaxVideoSizeMb ├─ Parse startUrl → extract origin (e.g. "https://example.com") ├─ Phase 1 — Pre-crawl setup │ ├─ Fetch {origin}/robots.txt → parse Disallow rules for "*" and "Googlebot" agents │ │ Store disallowed paths list in memory for this crawl session │ │ │ └─ Fetch {origin}/sitemap.xml (and sitemap index files) │ ├─ Parse all <loc> URLs → filter to same-origin only │ ├─ Seed BFS queue with sitemap URLs (up to maxPages) │ └─ Add startUrl to front of queue if not already present ├─ Open Playwright browser (headless, block ads/trackers) ├─ Phase 2 — BFS crawl (queue seeded from sitemap + startUrl) │ For each URL (up to maxPages): │ ├─ SKIP if URL origin ≠ startUrl origin (same-domain only) │ ├─ SKIP if URL path matches any robots.txt Disallow rule │ │ (increment WebCrawlJob.robotsTxtSkipped, log warning) │ ├─ Fetch page, wait for load, strip nav/footer/cookies │ ├─ Extract: title, description, metaKeywords, clean text │ ├─ Compute contentHash (SHA-256 of extractedText) │ │ │ ├─ Re-crawl change detection: │ │ ├─ Look up existing WebPages record by [url, connectedChannelId] │ │ ├─ If contentHash SAME as stored → skip RAG re-embed, update lastCrawledAt only │ │ └─ If contentHash DIFFERENT (or new page) → update record, re-embed, increment changedPages │ │ │ ├─ Create/update WebPages record (set lastChangedAt if hash changed) │ ├─ Screenshot → compress (JPEG 80%) → upload to DO Spaces │ │ Path: webpages/{tenantId}/{channelId}/screenshots/{pageId}.jpg │ │ │ ├─ Download & process media: │ │ ├─ Images (<img>, <picture>, CSS background-image) │ │ │ ├─ Download → SHA-256 hash │ │ │ ├─ Check WebMedia.contentHash — SKIP if duplicate │ │ │ ├─ Compress: JPEG 80% / WebP 85%, max 1920px width (via sharp) │ │ │ ├─ Upload: webmedia/{tenantId}/images/{hash}.{ext} │ │ │ ├─ Create WebMedia (IMAGE) with fileSizeBytes, originalFileSizeBytes │ │ │ └─ Create WebPageMedia join record │ │ │ │ │ ├─ Videos (<video> tags) │ │ │ ├─ YouTube/Vimeo <iframe>: store provider + embedCode (no download) │ │ │ └─ Local videos: │ │ │ ├─ Check Content-Length header first │ │ │ ├─ SKIP if > websiteMaxVideoSizeMb (default 50 MB), log warning │ │ │ └─ If within limit: download → hash → deduplicate → upload │ │ │ Path: webmedia/{tenantId}/videos/{hash}.{ext} │ │ │ │ │ └─ Documents (<a href="*.pdf|*.docx|*.doc">) │ │ ├─ Download → hash → deduplicate → upload │ │ │ Path: webmedia/{tenantId}/documents/{hash}.{ext} │ │ └─ Extract text: pipe through pdf-parse (PDF) or mammoth (DOCX) │ │ └─ enqueueRagContent(docText) → website_content dataset → Qdrant │ │ │ ├─ enqueueRagContent(extractedText) → website_content dataset → Qdrant │ │ (only if page is new or contentHash changed) │ ├─ Extract same-origin links → add to BFS queue (if not seeded from sitemap) │ └─ Update WebCrawlJob.pagesCrawled every 5 pages ├─ Mark WebCrawlJob status = "COMPLETED" ├─ Publish Redis event with statistics └─ Create ScheduledTask for periodic re-crawl (if frequency ≠ MANUAL)

Image Compression Utility

packages/agents/src/utils/image-compression.ts

Uses the sharp library:

FormatQualityMax WidthNotes
JPEG80%1920pxDefault for photos
WebP85%1920pxBetter compression ratio
PNGLevel 91920pxLossless, used for logos/icons

Returns: { buffer, width, height, originalSize, compressedSize, compressionRatio }

Document Text Extraction

After uploading a document to DO Spaces, the worker extracts its text for RAG indexing using the same parsers already in rag-ingestion.worker.ts:

File typeLibraryNotes
.pdfpdf-parseExtracts plain text from all pages
.docx / .docmammothConverts to plain text, strips formatting

The extracted text is sent via enqueueRagContent() with the WebMedia id as the fileId and the website_content dataset. The WebMedia record stores a extractedText field (first 10,000 chars) for display in the UI.

Robots.txt Parsing

The worker fetches {origin}/robots.txt once at crawl start. It parses Disallow rules for user-agent * and Googlebot. Each URL is checked against these rules before fetching. Disallowed URLs are skipped, counted in WebCrawlJob.robotsTxtSkipped, and noted in the crawl summary. If robots.txt returns 404 or cannot be parsed, crawling proceeds without restrictions.

Sitemap.xml Parsing

Before BFS begins, the worker tries {origin}/sitemap.xml. If found:

  • Parses all <loc> entries (recursively follows <sitemapindex> references)
  • Filters to same-origin URLs only, strips query params, deduplicates
  • Seeds the BFS queue with up to maxPages sitemap URLs
  • startUrl is prepended to ensure the homepage is always crawled first

If sitemap.xml is not found (404), the worker falls back to pure BFS from startUrl. When the site has a sitemap, this produces much better RAG coverage because priority pages are crawled before obscure deep links.

Video Size Limit

Before downloading a local video file, the worker sends a HEAD request to check the Content-Length header. If the declared size exceeds websiteMaxVideoSizeMb (default 50 MB), the download is skipped. A warning is logged with the file URL and size. The WebCrawlJob.statistics JSON records videosSkippedOversized count.

YouTube/Vimeo embeds are always stored (only the embed code is saved, not the video file itself).


Deduplication

The same image appearing on 50 pages is stored once in DO Spaces and once in WebMedia. Each page gets a WebPageMedia join record pointing to the shared asset.

WebMedia (1 record) contentHash = "sha256:abc123..." storageUrl = "https://cdn.example.com/webmedia/tenant1/images/abc123.jpg" fileSizeBytes = 45_000 (compressed from 200_000) WebPageMedia (50 records) webPageId = page-001 → webMediaId = media-abc webPageId = page-002 → webMediaId = media-abc ... Delete logic: deleteWebMedia(id) → check WebPageMedia count if count > 0 → refuse or cascade based on caller intent if count = 0 → delete from DO Spaces + DB

Brand Assets Extraction

During every crawl, the worker runs extractBrandAssets(page) on the homepage only, before any DOM mutation (nav/footer removal). Results are written to brand_assets for the tenant — but only if primaryColor is currently null (never overwrites manual input).

Extraction strategy

PrioritySourceNotes
1CSS custom properties on :rootHandles hex, rgb(), and HSL (239 84% 67%) formats
2Computed styles on ALL header/nav buttons + linksSkips transparent ghost buttons automatically
3Hero/banner/section background colorsBroader fallback if header scan yields nothing
4Google Fonts <link> hrefsURL ?family=Inter:wght@400 → “Inter”
5@font-face names in inline <style> tagsCatches Next.js next/font inlined fonts
6Computed font-family on body / h1Non-system fonts only

confidence = [primaryColor, secondaryColor, fontPrimary].filter(Boolean).length / 3. Nothing is written if confidence === 0.

Code: packages/agents/src/utils/brand-extractor.ts


Signup Integration

Trigger Point

After the tenant record is created in Step 4 (POST /auth/v1/register/complete):

// In apps/api/src/routers/auth.ts — after tenant.create() if (companyWebsite) { // 1. Create Website ConnectedChannel (auto-connects, no auth needed) const channel = await channelService.createChannel( { type: "Website", title: `${companyName} Website`, url: companyWebsite }, userId, tenantId, ); // 2. Create WebCrawlJob with tenant defaults const crawlJob = await db.webCrawlJob.create({ data: { tenantId, connectedChannelId: channel.id, startUrl: companyWebsite, maxPages: 100, // tenant.websiteCrawlMaxPages default maxDepth: 3, sameOriginOnly: true, status: "QUEUED", triggeredBy: userId, }, }); // 3. Enqueue crawl await enqueueWebsiteCrawl(crawlJob.id, tenantId, channel.id); }

Tenant Defaults Set at Creation

await db.tenant.create({ data: { ...tenantFields, websiteCrawlMaxPages: 100, websiteCrawlFrequency: "MONTHLY", }, });

API Routes

All routes require tenant authentication (Authorization: Bearer <token>).

Prefix: /tenant/v1/channels/:id/webcrawl

MethodPathDescription
GET/statusActive/last crawl job status + progress
POST/startManually trigger a new crawl
GET/pagesList crawled webpages (paginated, filterable by depth)
GET/pages/:webPageIdWebpage detail: screenshot, text, associated media
GET/mediaList media assets (filterable by mediaType)
GET/media/:webMediaIdMedia detail including usage count (pages using it)
GET/statisticsAggregated stats: pages, media breakdown, compression savings, storage
DELETE/pages/:webPageIdDelete webpage + remove WebPageMedia joins
DELETE/media/:webMediaIdDelete media (checks usage count before removing from storage)

Prefix: /manage/v1/tenants/:id/crawl-settings

MethodPathDescription
GET/Get tenant crawl config (websiteCrawlMaxPages, websiteCrawlFrequency)
PATCH/Update config. Validated: pages 10–500, frequency enum. Updates scheduled tasks if frequency changes.

Channel Details UI (Dashboard)

Route: /channels/:id (Website channel type)

┌─────────────────────────────────────────────────────┐ │ 🌐 Acme Corp Website [Re-crawl] │ │ https://acmecorp.com ● Connected │ ├─────────────────────────────────────────────────────┤ │ 47 webpages │ 23 images │ 5 videos │ 3 docs │ │ 12 duplicates skipped │ 2.1 MB saved (45%) │ │ Last crawled: 2 hours ago │ Next: May 1 │ ├─────────────────────────────────────────────────────┤ │ [Webpages] [Media Gallery] [Crawl History] │ ├─────────────────────────────────────────────────────┤ │ Webpages tab: │ │ URL Title Depth Status │ │ / Home 0 ✓ Active │ │ /about About Us 1 ✓ Active │ │ /products/... Products 2 ✓ Active │ │ ... │ └─────────────────────────────────────────────────────┘

Tabs

Webpages tab — WebPages table, paginated

  • Columns: URL, title, depth, HTTP status, last crawled, actions (View, Delete)
  • Click row → opens webpage detail view

Webpage detail view

  • Screenshot preview
  • Title, URL, meta description, depth, HTTP status
  • Extracted text (first 500 chars + “Show full text”)
  • Media grid (images/videos/docs found on this page via WebPageMedia)

Media Gallery tab — WebMedia grid, filterable

  • Filter buttons: All, Images, Videos, Documents
  • Images: thumbnail grid
  • Videos: thumbnail + play icon (or embed preview)
  • Documents: file icon + filename
  • Each item shows “Used on X pages” badge
  • Click image → lightbox with compression info: Original: 2.4 MB → 1.1 MB (54% saved)

Crawl History tab — WebCrawlJob list

  • Status badge (Queued / Running / Completed / Failed)
  • Start time, duration, pages crawled, media downloaded, duplicates skipped
  • Errors summary if any pages failed

Re-crawl Button

  • Disabled while a crawl is already running
  • POST /tenant/v1/channels/:id/webcrawl/start
  • Shows live progress toast: "Crawled 23/100 pages, 47 images (12 duplicates skipped)"
  • Polls /webcrawl/status every 5 seconds until COMPLETED or FAILED

Manage Portal — Crawl Settings

Route: /tenants/:id/crawl-settings

Website Crawl Settings ────────────────────── Max Pages per Crawl [100 ▼] (10 – 500) Crawl Frequency [Monthly ▼] (Manual / Weekly / Monthly) Current usage: 47 pages across 1 website channel Next scheduled crawl: May 1, 2026 [Save Settings]

Updating frequency immediately:

  • MANUAL → deletes any existing ScheduledTask
  • WEEKLY → upserts ScheduledTask with cron: "0 0 * * 0"
  • MONTHLY → upserts ScheduledTask with cron: "0 0 1 * *"

Scheduled Re-crawl

On crawl completion, the worker creates a ScheduledTask:

{ type: "website-recrawl", channelId: channel.id, tenantId: tenantId, cron: "0 0 1 * *", // from tenant.websiteCrawlFrequency nextRunAt: <next occurrence>, }

The scheduler worker (existing) checks for due website-recrawl tasks and:

  1. Loads tenant config (websiteCrawlMaxPages, websiteMaxVideoSizeMb)
  2. Creates a new WebCrawlJob
  3. Calls enqueueWebsiteCrawl()
  4. Updates ScheduledTask.nextRunAt

Change Detection on Re-crawl

On each re-crawl, the worker compares the newly computed contentHash against the value stored in the existing WebPages record:

ScenarioAction
Page is new (no DB record)Insert record, enqueue RAG embed
Hash unchangedUpdate lastCrawledAt only — skip RAG re-embed
Hash changedUpdate record + lastChangedAt, delete old Qdrant vectors for webPageId, re-enqueue RAG embed
Page returns 4xx/5xxMark status = "FAILED", keep last good text in RAG

This avoids redundant Qdrant writes on large sites where most content is stable across monthly crawls. The changedPages counter in WebCrawlJob.statistics gives visibility into how much content changed.


RAG Integration

Three content sources feed the website_content Qdrant dataset:

website-crawler.worker.ts ├─ Page text (new or changed pages only) │ └─ enqueueRagContent(extractedText, webPageId, tenantId, datasetId) │ └─ rag-ingestion.worker.ts → chunk → embed → upsert Qdrant ├─ Document text (PDF/DOCX) │ ├─ pdf-parse or mammoth → raw text string │ └─ enqueueRagContent(docText, webMediaId, tenantId, datasetId) │ └─ rag-ingestion.worker.ts → chunk → embed → upsert Qdrant └─ Re-crawl change detection ├─ contentHash SAME → skip enqueueRagContent entirely └─ contentHash DIFFERENT → delete old Qdrant vectors for webPageId, re-enqueue with new text

All three paths use the existing rag__ingestion queue and rag-ingestion.worker.ts — no changes required to the RAG system. Agents with access to website_content (all content agents by default) will automatically benefit from crawled text and document content.


Storage Layout (DO Spaces)

{bucket}/ webpages/ {tenantId}/ {channelId}/ screenshots/ {webPageId}.jpg (compressed, JPEG 80%) webmedia/ {tenantId}/ images/ {contentHash}.jpg (compressed, JPEG 80% or WebP) videos/ {contentHash}.mp4 documents/ {contentHash}.pdf {contentHash}.docx

CDN URLs stored in WebMedia.storageUrl. Content-hash based paths mean the same file uploaded twice always resolves to the same object (natural deduplication at the storage layer too).


Files Built (April 2026)

FileStatus
packages/agents/src/workers/tenant-website-crawler.worker.ts✅ Built
packages/agents/src/utils/image-compression.ts✅ Built
packages/agents/src/utils/robots-parser.ts✅ Built
packages/agents/src/utils/sitemap-parser.ts✅ Built
packages/agents/src/utils/document-extractor.ts✅ Built
packages/feature-knowledge/src/webcrawl.service.ts✅ Built
packages/db/prisma/schema.prisma✅ WebPage/WebMedia/WebPageMedia/WebCrawlJob models + Tenant crawl fields pushed
packages/queue/src/queues.tsenqueueWebsiteCrawl() added
packages/queue/src/types.tsTenantWebCrawlerJobData + "tenant-web-crawler" AgentRole added
packages/feature-knowledge/src/index.ts✅ Exports all webcrawl service functions + DTOs

Still To Build

FilePurpose
apps/api/src/routers/channels.tsAdd /webcrawl/* sub-routes
apps/api/src/routers/manage.tsAdd /crawl-settings routes
apps/dashboard/src/app/(dashboard)/channels/[id]/page.tsxChannel details page with Statistics + Webpages + Media Gallery + History tabs
apps/dashboard/src/app/(dashboard)/channels/[id]/pages/[webPageId]/page.tsxWebpage detail view
apps/manage/src/app/(manage)/tenants/[id]/crawl-settings/page.tsxManage portal: max pages + frequency settings
Scheduled re-crawlCreate ScheduledTask on completion; scheduler worker picks it up
apps/api/src/routers/auth.tsAuto-trigger crawl at signup if companyWebsite present

Decisions Log

#DecisionDetail
1Sitemap.xml parsed firstFetch and parse /sitemap.xml before BFS. Seeds BFS queue with sitemap URLs for better coverage within the page limit. Falls back to pure BFS if 404.
2robots.txt respectedFetch /robots.txt at crawl start. Skip any URL matching Disallow rules for * or Googlebot. Count skips in robotsTxtSkipped.
3Document text extracted for RAGPDFs/DOCX text is extracted via pdf-parse/mammoth and sent to website_content via enqueueRagContent(). First 10,000 chars stored in WebMedia.extractedText for UI display.
4Video size limit: 50 MBLocal video downloads capped at websiteMaxVideoSizeMb (default 50 MB, stored on Tenant). Checked via Content-Length header before download. Oversized files counted in videosSkippedOversized.
5Change detection on re-crawlOn re-crawl, compare new contentHash to stored. If unchanged: skip RAG re-embed. If changed: delete old Qdrant vectors, re-enqueue with new text, set lastChangedAt.

Open Questions

  1. Plan-tier page limits: Could enforce per-plan limits (Free=50, Starter=100, Pro=300, Enterprise=500) rather than a single tenant-level setting. The manage portal would show the plan limit as a ceiling.

© 2026 Leadmetrics — Internal use only