Ollama

Category: AI / LLM
Adapter: OllamaAdapter in packages/agent-core/src/adapters/ollama.ts
External SDK: axios (Ollama REST API)

Purpose

Ollama runs open-source LLMs locally on the platform server. It is used for:

Research agents — Topic Researcher, Research Note Writer — where privacy matters or cost must be zero
Data-local tasks — operations on sensitive client data that should never leave the server
High-volume low-stakes tasks — bulk operations where per-token cost would be prohibitive
On-premise enterprise — customers who deploy Leadmetrics on-site can run all agents on Ollama with no external API dependency

Ollama runs as a sidecar service in Docker Compose alongside the main platform services.

Models Used

Agent	Model	Parameters	Notes
Topic Researcher	`llama3.1:8b`	8B	Fast, good at ideation
Research Note Writer	`llama3.1:8b`	8B	Good at structured extraction
On-premise (all agents)	`llama3.1:70b`	70B	Enterprise quality; requires significant GPU

Models are pulled on first use via ollama pull <model> and cached locally.

Config Structure

Platform config (env vars)


OLLAMA_BASE_URL=http://ollama:11434   # Docker service name in Compose network
OLLAMA_DEFAULT_MODEL=llama3.1:8b
OLLAMA_TIMEOUT_MS=120000              # 2 minutes — local inference can be slow on CPU

Per-tenant override (Enterprise only)

Enterprise tenants running on-premise can point to their own Ollama instance:


interface OllamaIntegrationConfig {
  baseUrl: string;  // e.g. "http://internal-ollama.acme.com:11434"
  model?:  string;  // Override default model
}

Integration Pattern

Adapter class (`packages/agent-core/src/adapters/ollama.ts`)

Ollama’s /api/chat endpoint returns NDJSON — similar to Claude Code CLI’s output format:


import axios from 'axios';
 
class OllamaAdapter implements LLMAdapter {
  private history: OllamaMessage[] = [];
 
  constructor(
    private baseUrl: string,
    private model:   string,
    private timeout: number,
  ) {}
 
  async *run(systemPrompt: string, userPrompt: string): AsyncGenerator<LLMEvent> {
    this.history.push({ role: 'user', content: userPrompt });
 
    const response = await axios.post(
      `${this.baseUrl}/api/chat`,
      {
        model:    this.model,
        messages: [
          { role: 'system', content: systemPrompt },
          ...this.history,
        ],
        stream: true,
      },
      {
        responseType: 'stream',
        timeout:      this.timeout,
      },
    );
 
    let assistantContent = '';
 
    for await (const chunk of readNDJSON(response.data)) {
      if (chunk.done) {
        // Final chunk — contains token counts
        yield {
          type:         'usage',
          inputTokens:  chunk.prompt_eval_count,
          outputTokens: chunk.eval_count,
        };
        break;
      }
      const text = chunk.message?.content ?? '';
      assistantContent += text;
      yield { type: 'content', text };
    }
 
    this.history.push({ role: 'assistant', content: assistantContent });
  }
}

Tool use limitation

Ollama models have limited tool use compared to Claude or GPT-4o. The Ollama adapter does not implement web_search or rag_search tool dispatch. Research agents running on Ollama receive pre-fetched context in the prompt instead:


// In research-agent worker: pre-fetch context before Ollama call
const webResults = await webSearch(topic, { maxResults: 5 });
const prompt = assembleResearchPrompt(topic, webResults);
// Pass entire context in user prompt — no tool loop needed
const output = await ollamaAdapter.run(systemPrompt, prompt);

History management

Like the OpenAI adapter, Ollama history is managed manually by the adapter and stored in activities.llm_history for cross-activity continuity.

Docker Compose Setup


# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama    # Model cache persists across restarts
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]   # Remove if CPU-only
 
volumes:
  ollama_data:

First-run model pull

Models must be pulled before first use. Add to the platform startup script:


# startup.sh
ollama pull llama3.1:8b
ollama pull llama3.1:70b   # Enterprise only

Or trigger via the Ollama REST API:


POST http://ollama:11434/api/pull
{ "name": "llama3.1:8b" }

Performance Characteristics

Config	Throughput	Latency (8B)	Notes
CPU only (8 cores)	~5 tok/s	~30s for 150 tokens	Acceptable for background batch
1× NVIDIA RTX 3090	~80 tok/s	~2s for 150 tokens	Good for production
1× NVIDIA A100	~150 tok/s	~1s for 150 tokens	Recommended for 70B models

Research agents run in background BullMQ queues with low priority, so CPU-mode latency is acceptable for most deployments.

Cost Model

Ollama runs on the platform’s own hardware — there is no per-token cost. The cost is the infrastructure: server, GPU, electricity. In the credit system, Ollama-backed activities consume a flat rate of 0.2 credits regardless of token count, since marginal cost is effectively zero.

Test Cases

Unit tests (`packages/agent-core/src/adapters/ollama.test.ts`)

Test	Approach
Streams content from NDJSON chunks	Mock `axios.post` to return NDJSON stream; assert `content` events
Emits `usage` event on `done: true` chunk	Assert token counts from final chunk
Appends assistant reply to history	Assert `history` updated after run
Timeout throws `ECONNABORTED`	Mock axios timeout; assert error propagated

Integration tests

Test	Approach
Full run against local Ollama	Requires Ollama running at `localhost:11434` with `llama3.1:8b` pulled; assert content
Model not found returns 404	Request non-existent model; assert error

Adapter — Ollama — full adapter detail
Claude Provider — primary LLM
OpenAI Provider — alternative LLM
Agent Execution Engine — adapter selection logic
Infrastructure — Docker Compose, GPU setup