Make_Skills
Make_Skills
Concepts

BYO personal Ollama

Point a Make_Skills subagent at a model running on your own hardware. Laptop tunnel, Docker Cloud, VPS, or AWS/GCP GPU.

By default the ollama provider points at host.docker.internal:11434, which works for self-host and not for the hosted site at humancensys.com. Setting OLLAMA_BASE_URL (and optionally OLLAMA_AUTH_HEADER) sends the subagent's traffic to your own endpoint instead.

The per-tenant "register your endpoint" UI on humancensys.com is part of Pillar 0 (tenant abstraction). Until then, set the env var on your own deployment.

What you need

  1. A base URL the api container can reach over HTTPS (e.g. https://my-ollama.example.com).
  2. An auth header to protect the endpoint (e.g. Authorization: Bearer <secret>).

Deployment options

OptionCostAlways-onNotes
Tunnel from a laptopFreeNo (laptop sleeps)Local development, single-user
Docker Cloud / Docker OffloadPay-per-secondOn demandPer-second container runtime
Small VPS (Hetzner / Fly / Render)$5–20/moYesCPU-only models
AWS / GCP GPU instanceVariable (spot or on-demand)Yes70B+ weights, multiple users

Option 1 — Tunnel from a laptop

Cloudflare Tunnel provides a free HTTPS endpoint on *.trycloudflare.com.

# install cloudflared (mac: brew install cloudflared, win: winget install Cloudflare.cloudflared)
ollama serve &  # if not already running
cloudflared tunnel --url http://localhost:11434
# → outputs https://random-words-1234.trycloudflare.com

Add bearer-token auth via a Caddy reverse proxy in front of Ollama, or use Cloudflare Access on the tunnel.

In platform/deploy/.env:

OLLAMA_BASE_URL=https://random-words-1234.trycloudflare.com
OLLAMA_AUTH_HEADER=Bearer your-long-random-secret

When the laptop sleeps, the tunnel goes down and the subagent falls back to the orchestrator's model.


Option 2 — Docker Cloud (Docker Offload)

Docker Offload runs containers in Docker's cloud, billed per-second. The standard ollama/ollama image runs there with a persistent volume for model weights.

# Sketch — exact Docker Cloud syntax may vary; consult docs.docker.com
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    # Recommended: put a Caddy/Traefik sidecar in front for bearer-token auth
    # so port 11434 is never publicly unauthenticated.

volumes:
  ollama-models: {}

Pull weights once (ollama pull llama3.1:8b), then point Make_Skills at the container's public URL.


Option 3 — AWS / GCP GPU

For 70B+ models or multi-user deployments, run Ollama or vLLM on a g5.xlarge / g6.xlarge (AWS) or g2-standard-4 (GCP) with an HTTPS-fronted endpoint and an auth header.

# AWS — Deep Learning AMI on a g5.xlarge
ssh ec2-user@your-instance
docker run -d --gpus all -p 11434:11434 \
  -v ollama:/root/.ollama \
  --name ollama \
  ollama/ollama:latest

# Front with Caddy or Cloudflare Tunnel for HTTPS + auth

For multi-user setups: an Application Load Balancer with Cognito auth, or Cloudflare Tunnel + Access groups. Sleep the instance when idle (EventBridge cron) to control costs.

vLLM is an alternative to Ollama on GPU hosts — it serves the OpenAI-compatible API and batches requests for higher throughput.


Configuring the subagent

With a URL and token, point a subagent at the endpoint:

# subagents/researcher/deepagents.toml
[model]
provider = "ollama"
name = "llama3.1:8b"

Restart the api container.

For OpenAI-compatible endpoints (vLLM, LM Studio, llama.cpp server):

[model]
provider = "openai"
name = "your-model-name"
base_url = "https://your-endpoint.example.com/v1"

The openai provider in langchain-openai honors base_url.

Two-mode notes

ModeURL configuration
Self-hostOLLAMA_BASE_URL + OLLAMA_AUTH_HEADER in platform/deploy/.env.
Hosted-multitenantPer-tenant; registered via a planned UI, stored encrypted in tenant_model_endpoints. Tracked in Pillar 0.

Security checklist

  • Endpoint requires an auth header. Public Ollama URLs without auth are scraped quickly.
  • Rotate the auth secret if it has been shared in tunnel URLs or logs.
  • Rate-limit at the proxy.
  • Keep the auth header out of logs. model_registry.py passes it as a header, not a URL parameter.

See also