Ollama
Category: AI / LLM
Adapter: OllamaAdapter in packages/agent-core/src/adapters/ollama.ts
External SDK: axios (Ollama REST API)
Purpose
Ollama runs open-source LLMs locally on the platform server. It is used for:
- Research agents — Topic Researcher, Research Note Writer — where privacy matters or cost must be zero
- Data-local tasks — operations on sensitive client data that should never leave the server
- High-volume low-stakes tasks — bulk operations where per-token cost would be prohibitive
- On-premise enterprise — customers who deploy Leadmetrics on-site can run all agents on Ollama with no external API dependency
Ollama runs as a sidecar service in Docker Compose alongside the main platform services.
Models Used
| Agent | Model | Parameters | Notes |
|---|---|---|---|
| Topic Researcher | llama3.1:8b | 8B | Fast, good at ideation |
| Research Note Writer | llama3.1:8b | 8B | Good at structured extraction |
| On-premise (all agents) | llama3.1:70b | 70B | Enterprise quality; requires significant GPU |
Models are pulled on first use via ollama pull <model> and cached locally.
Config Structure
Platform config (env vars)
OLLAMA_BASE_URL=http://ollama:11434 # Docker service name in Compose network
OLLAMA_DEFAULT_MODEL=llama3.1:8b
OLLAMA_TIMEOUT_MS=120000 # 2 minutes — local inference can be slow on CPUPer-tenant override (Enterprise only)
Enterprise tenants running on-premise can point to their own Ollama instance:
interface OllamaIntegrationConfig {
baseUrl: string; // e.g. "http://internal-ollama.acme.com:11434"
model?: string; // Override default model
}Integration Pattern
Adapter class (packages/agent-core/src/adapters/ollama.ts)
Ollama’s /api/chat endpoint returns NDJSON — similar to Claude Code CLI’s output format:
import axios from 'axios';
class OllamaAdapter implements LLMAdapter {
private history: OllamaMessage[] = [];
constructor(
private baseUrl: string,
private model: string,
private timeout: number,
) {}
async *run(systemPrompt: string, userPrompt: string): AsyncGenerator<LLMEvent> {
this.history.push({ role: 'user', content: userPrompt });
const response = await axios.post(
`${this.baseUrl}/api/chat`,
{
model: this.model,
messages: [
{ role: 'system', content: systemPrompt },
...this.history,
],
stream: true,
},
{
responseType: 'stream',
timeout: this.timeout,
},
);
let assistantContent = '';
for await (const chunk of readNDJSON(response.data)) {
if (chunk.done) {
// Final chunk — contains token counts
yield {
type: 'usage',
inputTokens: chunk.prompt_eval_count,
outputTokens: chunk.eval_count,
};
break;
}
const text = chunk.message?.content ?? '';
assistantContent += text;
yield { type: 'content', text };
}
this.history.push({ role: 'assistant', content: assistantContent });
}
}Tool use limitation
Ollama models have limited tool use compared to Claude or GPT-4o. The Ollama adapter does not implement web_search or rag_search tool dispatch. Research agents running on Ollama receive pre-fetched context in the prompt instead:
// In research-agent worker: pre-fetch context before Ollama call
const webResults = await webSearch(topic, { maxResults: 5 });
const prompt = assembleResearchPrompt(topic, webResults);
// Pass entire context in user prompt — no tool loop needed
const output = await ollamaAdapter.run(systemPrompt, prompt);History management
Like the OpenAI adapter, Ollama history is managed manually by the adapter and stored in activities.llm_history for cross-activity continuity.
Docker Compose Setup
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
volumes:
- ollama_data:/root/.ollama # Model cache persists across restarts
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu] # Remove if CPU-only
volumes:
ollama_data:First-run model pull
Models must be pulled before first use. Add to the platform startup script:
# startup.sh
ollama pull llama3.1:8b
ollama pull llama3.1:70b # Enterprise onlyOr trigger via the Ollama REST API:
POST http://ollama:11434/api/pull
{ "name": "llama3.1:8b" }Performance Characteristics
| Config | Throughput | Latency (8B) | Notes |
|---|---|---|---|
| CPU only (8 cores) | ~5 tok/s | ~30s for 150 tokens | Acceptable for background batch |
| 1× NVIDIA RTX 3090 | ~80 tok/s | ~2s for 150 tokens | Good for production |
| 1× NVIDIA A100 | ~150 tok/s | ~1s for 150 tokens | Recommended for 70B models |
Research agents run in background BullMQ queues with low priority, so CPU-mode latency is acceptable for most deployments.
Cost Model
Ollama runs on the platform’s own hardware — there is no per-token cost. The cost is the infrastructure: server, GPU, electricity. In the credit system, Ollama-backed activities consume a flat rate of 0.2 credits regardless of token count, since marginal cost is effectively zero.
Test Cases
Unit tests (packages/agent-core/src/adapters/ollama.test.ts)
| Test | Approach |
|---|---|
| Streams content from NDJSON chunks | Mock axios.post to return NDJSON stream; assert content events |
Emits usage event on done: true chunk | Assert token counts from final chunk |
| Appends assistant reply to history | Assert history updated after run |
Timeout throws ECONNABORTED | Mock axios timeout; assert error propagated |
Integration tests
| Test | Approach |
|---|---|
| Full run against local Ollama | Requires Ollama running at localhost:11434 with llama3.1:8b pulled; assert content |
| Model not found returns 404 | Request non-existent model; assert error |
Related
- Adapter — Ollama — full adapter detail
- Claude Provider — primary LLM
- OpenAI Provider — alternative LLM
- Agent Execution Engine — adapter selection logic
- Infrastructure — Docker Compose, GPU setup