Skip to Content
ProvidersOllama

Ollama

Category: AI / LLM
Adapter: OllamaAdapter in packages/agent-core/src/adapters/ollama.ts
External SDK: axios (Ollama REST API)


Purpose

Ollama runs open-source LLMs locally on the platform server. It is used for:

  1. Research agents — Topic Researcher, Research Note Writer — where privacy matters or cost must be zero
  2. Data-local tasks — operations on sensitive client data that should never leave the server
  3. High-volume low-stakes tasks — bulk operations where per-token cost would be prohibitive
  4. On-premise enterprise — customers who deploy Leadmetrics on-site can run all agents on Ollama with no external API dependency

Ollama runs as a sidecar service in Docker Compose alongside the main platform services.


Models Used

AgentModelParametersNotes
Topic Researcherllama3.1:8b8BFast, good at ideation
Research Note Writerllama3.1:8b8BGood at structured extraction
On-premise (all agents)llama3.1:70b70BEnterprise quality; requires significant GPU

Models are pulled on first use via ollama pull <model> and cached locally.


Config Structure

Platform config (env vars)

OLLAMA_BASE_URL=http://ollama:11434 # Docker service name in Compose network OLLAMA_DEFAULT_MODEL=llama3.1:8b OLLAMA_TIMEOUT_MS=120000 # 2 minutes — local inference can be slow on CPU

Per-tenant override (Enterprise only)

Enterprise tenants running on-premise can point to their own Ollama instance:

interface OllamaIntegrationConfig { baseUrl: string; // e.g. "http://internal-ollama.acme.com:11434" model?: string; // Override default model }

Integration Pattern

Adapter class (packages/agent-core/src/adapters/ollama.ts)

Ollama’s /api/chat endpoint returns NDJSON — similar to Claude Code CLI’s output format:

import axios from 'axios'; class OllamaAdapter implements LLMAdapter { private history: OllamaMessage[] = []; constructor( private baseUrl: string, private model: string, private timeout: number, ) {} async *run(systemPrompt: string, userPrompt: string): AsyncGenerator<LLMEvent> { this.history.push({ role: 'user', content: userPrompt }); const response = await axios.post( `${this.baseUrl}/api/chat`, { model: this.model, messages: [ { role: 'system', content: systemPrompt }, ...this.history, ], stream: true, }, { responseType: 'stream', timeout: this.timeout, }, ); let assistantContent = ''; for await (const chunk of readNDJSON(response.data)) { if (chunk.done) { // Final chunk — contains token counts yield { type: 'usage', inputTokens: chunk.prompt_eval_count, outputTokens: chunk.eval_count, }; break; } const text = chunk.message?.content ?? ''; assistantContent += text; yield { type: 'content', text }; } this.history.push({ role: 'assistant', content: assistantContent }); } }

Tool use limitation

Ollama models have limited tool use compared to Claude or GPT-4o. The Ollama adapter does not implement web_search or rag_search tool dispatch. Research agents running on Ollama receive pre-fetched context in the prompt instead:

// In research-agent worker: pre-fetch context before Ollama call const webResults = await webSearch(topic, { maxResults: 5 }); const prompt = assembleResearchPrompt(topic, webResults); // Pass entire context in user prompt — no tool loop needed const output = await ollamaAdapter.run(systemPrompt, prompt);

History management

Like the OpenAI adapter, Ollama history is managed manually by the adapter and stored in activities.llm_history for cross-activity continuity.


Docker Compose Setup

# docker-compose.yml services: ollama: image: ollama/ollama:latest volumes: - ollama_data:/root/.ollama # Model cache persists across restarts ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] # Remove if CPU-only volumes: ollama_data:

First-run model pull

Models must be pulled before first use. Add to the platform startup script:

# startup.sh ollama pull llama3.1:8b ollama pull llama3.1:70b # Enterprise only

Or trigger via the Ollama REST API:

POST http://ollama:11434/api/pull { "name": "llama3.1:8b" }

Performance Characteristics

ConfigThroughputLatency (8B)Notes
CPU only (8 cores)~5 tok/s~30s for 150 tokensAcceptable for background batch
1× NVIDIA RTX 3090~80 tok/s~2s for 150 tokensGood for production
1× NVIDIA A100~150 tok/s~1s for 150 tokensRecommended for 70B models

Research agents run in background BullMQ queues with low priority, so CPU-mode latency is acceptable for most deployments.


Cost Model

Ollama runs on the platform’s own hardware — there is no per-token cost. The cost is the infrastructure: server, GPU, electricity. In the credit system, Ollama-backed activities consume a flat rate of 0.2 credits regardless of token count, since marginal cost is effectively zero.


Test Cases

Unit tests (packages/agent-core/src/adapters/ollama.test.ts)

TestApproach
Streams content from NDJSON chunksMock axios.post to return NDJSON stream; assert content events
Emits usage event on done: true chunkAssert token counts from final chunk
Appends assistant reply to historyAssert history updated after run
Timeout throws ECONNABORTEDMock axios timeout; assert error propagated

Integration tests

TestApproach
Full run against local OllamaRequires Ollama running at localhost:11434 with llama3.1:8b pulled; assert content
Model not found returns 404Request non-existent model; assert error

© 2026 Leadmetrics — Internal use only