The conversation about AI agents has shifted. A year ago, people were debating whether agents were real. Today, the debate is about architecture โ how to run them well, how to run them safely, and how to run them at a cost that makes sense.
I’ve spent considerable time standing up agent environments from scratch, and I keep seeing the same mistakes: people who pick a great model but pair it with a fragile setup, or who obsess over capabilities while ignoring the infrastructure holding it together. The environment matters as much as the model itself.
This is the guide I wish existed when I started. We’ll cover the platform landscape, walk through a production-grade deployment, and dig into why running a local open-source model is no longer a compromise โ it’s often the right engineering call.
The Platform Landscape: Who’s Building What
Let’s be clear about what we’re choosing between. The “AI agent” label gets applied to everything from simple chatbots to multi-agent orchestration platforms, and conflating them wastes your time.
OpenClaw
With 320,000+ GitHub stars and Apache 2.0 licensing, OpenClaw has become the de facto open-source standard for persistent agent deployments. It’s not a framework you build on โ it’s a running gateway daemon that handles sessions, channels, tool routing, and subagent lifecycle out of the box.
The architecture looks like this:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenClaw Gateway โ
โ (127.0.0.1:18789) โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โ Channels โ โ Agents โ โ Tools โ โ
โ โ (WA/TG/) โ โ Sessions โ โ Runtime โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Model Router โ โ
โ โ anthropic/* โ Anthropic API โ โ
โ โ ollama/* โ localhost:11434 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The gateway persists across sessions, maintains memory, and routes between model providers based on configuration. It connects to WhatsApp, Telegram, Discord, Signal, iMessage, and more โ your agent lives wherever you already communicate.
Best for: General-purpose deployments, teams that want community support, anyone building on top of an ecosystem rather than from scratch.
Hermes Agent
Hermes takes a different philosophy: deep, multi-tier memory and aggressive cost optimization. It uses three memory layers โ session, persistent, and skill memory โ and routes tasks across 200+ models via OpenRouter. The self-improving skill system is genuinely novel: the agent identifies capability gaps and generates new skills to fill them.
Best for: Solo founders and power users who care about long-term agent learning and flexible model routing.
NanoClaw
Roughly 500 lines of TypeScript. Zero config files. Five-minute setup. OS-level container isolation (Docker on Linux, Apple Container on macOS). What it trades away is model flexibility โ it’s tightly coupled to Anthropic’s Claude stack.
Best for: Security-conscious small teams who need a fast, auditable deployment with messaging platform support.
CrewAI / LangGraph
These aren’t agents โ they’re frameworks for building multi-agent systems. If you need a researcher, writer, and editor working together in a coordinated pipeline, CrewAI’s role-based orchestration is purpose-built for that. LangGraph gives you maximum flexibility via stateful graph-based workflows. Both require real development investment.
Best for: Teams with specific multi-agent workflow requirements and the engineering resources to build custom.
AutoGPT
Respect to the project that kicked off the entire agent movement. In 2026, it’s mostly a learning tool โ it’s been outpaced in production-readiness by everything else on this list.
The summary:
| Platform | Best For | Setup Time | License |
|---|---|---|---|
| OpenClaw | General-purpose, ecosystem | ~30 min | Apache 2.0 |
| Hermes Agent | Power users, cost optimization | ~15 min | MIT |
| NanoClaw | Security-focused small teams | ~5 min | MIT |
| CrewAI | Multi-agent workflows | ~15 min | MIT |
| LangGraph | Custom build | 30+ min | MIT |
Production Deployment: OpenClaw From Scratch
Here’s how to stand up a production-grade OpenClaw instance. The order matters โ skip security hardening before connecting channels and you’re exposed the moment the first message comes in.
Infrastructure
You need a Linux server with a public IP. For production, the sweet spot is a Hetzner CX22 at $3.79/month (2 vCPU, 4GB RAM, 40GB SSD in Frankfurt). If you’re in North America, Contabo VPS S at $5.49/month (4 vCPU, 8GB RAM) is hard to beat. DigitalOcean at $6/month has the most beginner-friendly UI.
Minimum spec: 2 vCPU, 4GB RAM, 40GB SSD running Ubuntu 22.04+. If you plan to run a local model alongside OpenClaw, you need at least 8GB RAM.
Initial server hardening:
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl git ufw
# Firewall: SSH, HTTP, HTTPS only
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
Port 3008 (OpenClaw’s default) is intentionally not exposed directly. It runs behind a reverse proxy on 443.
Docker + Reverse Proxy
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
Configure Docker log rotation before you start writing logs:
sudo tee /etc/docker/daemon.json > /dev/null <<'EOF'
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
EOF
sudo systemctl restart docker
Use Caddy for the reverse proxy โ it provisions Let’s Encrypt SSL automatically with zero cert management:
sudo apt install -y caddy
/etc/caddy/Caddyfile:
your-domain.com {
reverse_proxy localhost:3008
}
Point your DNS A record at the server IP, then sudo systemctl restart caddy. That’s it โ HTTPS with auto-renewing certs.
Deploying OpenClaw
mkdir -p ~/openclaw && cd ~/openclaw
docker-compose.yml:
version: "3.8"
services:
openclaw:
image: openclaw/openclaw:3.23
container_name: openclaw
restart: always
ports:
- "127.0.0.1:3008:3008" # loopback only โ Caddy handles external access
volumes:
- ./data:/app/data
env_file:
- .env
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3008/health"]
interval: 30s
timeout: 10s
retries: 3
The port binding 127.0.0.1:3008:3008 is intentional. External traffic never hits OpenClaw directly.
Security Hardening (Do This Before Anything Else)
Generate a strong gateway token:
OPENCLAW_GATEWAY_TOKEN=$(openssl rand -hex 32)
echo $OPENCLAW_GATEWAY_TOKEN
Your .env:
# Auth
OPENCLAW_GATEWAY_TOKEN=<your-generated-token>
# Model (Claude Sonnet for quality; DeepSeek for budget)
OPENCLAW_MODEL_PROVIDER=anthropic
OPENCLAW_MODEL_NAME=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-api03-your-key-here
# Cost controls โ start conservative
OPENCLAW_DAILY_TOKEN_LIMIT=100000
# Rate limiting โ 10 messages/user/minute
OPENCLAW_RATE_LIMIT_PER_USER=10
OPENCLAW_RATE_LIMIT_WINDOW=60
# Reduce attack surface
OPENCLAW_PUPPETEER_ENABLED=false
chmod 600 .env # restrict file permissions
docker compose pull && docker compose up -d
Model Configuration With Fallback
For production resilience, add a fallback from a different provider:
# Primary
OPENCLAW_MODEL_PROVIDER=anthropic
OPENCLAW_MODEL_NAME=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-...
# Fallback (different provider = survives Anthropic outages)
OPENCLAW_FALLBACK_1_PROVIDER=openai
OPENCLAW_FALLBACK_1_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
If you want local inference as the primary (more on this below), it looks like this:
{
"agents": {
"defaults": {
"model": "ollama/gemma4:26b",
"fallbacks": ["anthropic/claude-haiku-3-5"]
}
}
}
The Local Model Case: Why It Changes the Architecture
Most guides treat local models as a curiosity or a budget hack. I’d argue they’re a legitimate architectural choice โ and in some cases, the right default.
Here’s the honest breakdown:
| Factor | Local (Gemma 4 26B) | Cloud API (Claude/GPT) |
|---|---|---|
| Cost | $0/month (hardware you own) | $2โ50+/month by usage |
| Privacy | 100% local โ nothing leaves the machine | Data sent to third-party servers |
| Latency | Hardware-dependent (7โ300+ tokens/sec) | Network-dependent, typically fast |
| Tool use accuracy | 85.5% on ฯ2-bench (26B MoE) | Higher (frontier models) |
| Availability | Always on | Subject to API outages + rate limits |
| Offline | Full functionality | Requires internet |
For sensitive business data, proprietary code, or any environment where data residency matters, that privacy column isn’t a nice-to-have. It’s the whole point.
Google Gemma 4: The Model That Changed the Equation
Gemma 4 dropped on April 2, 2026 under Apache 2.0. Its 26B Mixture-of-Experts architecture activates only 3.8B parameters per inference โ meaning it runs at roughly the speed of a 4B model while delivering near-13B quality. It scores 85.5% on ฯ2-bench (agentic tool use), which is the benchmark that actually matters for agent workflows. Native function calling and 256K context out of the box.
Model size selection:
| Model | Active Params | RAM Required | ฯ2-bench | Best For |
|---|---|---|---|---|
| E4B | 4.5B | 8GB+ | 57.5% | Laptops, quick tasks |
| 26B MoE โ | 3.8B active | 16GB+ | 85.5% | OpenClaw sweet spot |
| 31B Dense | 31B | 24GB+ VRAM | 86.4% | Max quality, serious hardware |
For most agent deployments, the 26B MoE is the right call.
Setting Up Ollama + Gemma 4
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Pull the 26B MoE model
ollama pull gemma4:26b
# Verify
curl -s http://localhost:11434/api/tags | jq '.models[].name'
Tune Ollama for OpenClaw’s workload. Create a custom Modelfile to reduce context size (full 256K is overkill and kills throughput for most agent interactions):
FROM gemma4:26b
PARAMETER num_ctx 8192
PARAMETER temperature 0.3
PARAMETER top_p 0.9
ollama create gemma4-openclaw -f Modelfile
temperature 0.3 is important. You want deterministic tool call output โ not creative prose. Lower temperature = more consistent JSON.
Connecting OpenClaw to Ollama
In ~/.openclaw/openclaw.json:
{
"env": {
"OLLAMA_API_KEY": "ollama-local",
"OLLAMA_MAX_LOADED_MODELS": "3",
"OLLAMA_KEEP_ALIVE": "-1"
},
"agents": {
"defaults": {
"model": "ollama/gemma4-openclaw"
}
},
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434"
}
}
}
}
Critical: Use the native Ollama API (http://localhost:11434), not the OpenAI-compatible /v1 endpoint. Using /v1 breaks tool calling โ the model outputs raw JSON as plain text instead of executing the tool call. This is the most common misconfiguration and it’s not obvious from the error.
OLLAMA_KEEP_ALIVE: "-1" keeps loaded models in memory indefinitely. Eliminates cold-start latency for frequently-used models. On 16GB+ hardware, this pays for itself immediately.
The Hybrid Architecture: Local Workers + Cloud Orchestrator
The most cost-effective production pattern isn’t “all local” or “all cloud” โ it’s using a frontier model for orchestration and local models for execution:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Main Agent (Claude Sonnet/Opus) โ
โ Orchestration + User Interaction โ
โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โqwen3 โโllama โโgemma4:26b โ
โ :8b โโ3.1:8bโโ โ
โResrchโโWrite โโ Code + QA โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
All local. All free. All parallel.
The economics are compelling:
Cloud API (Claude Sonnet):
Input: ~$3/M tokens
Output: ~$15/M tokens
Every complex orchestrator turn costs real money.
Local Ollama:
Input: $0
Output: $0
~5W power draw during inference on Apple Silicon.
At $0.30/kWh: roughly $0.004/hour.
One developer I tracked dropped their monthly API spend from ~$40 to under $5 by routing 90% of agent interactions through local models, reserving cloud APIs only for complex reasoning tasks. The 85.5% ฯ2-bench score on Gemma 4 26B means it handles the vast majority of real agent work reliably.
Set this up in OpenClaw with model routing:
{
"agents": {
"defaults": {
"model": "ollama/gemma4-openclaw",
"fallbacks": [
"ollama/qwen3:8b",
"anthropic/claude-haiku-3-5"
]
}
}
}
What a Production-Grade Environment Actually Looks Like
Memory as a First-Class Concern
An agent that forgets between sessions is a very expensive chatbot. OpenClaw’s memory configuration:
OPENCLAW_MEMORY_ENABLED=true
OPENCLAW_MEMORY_PROVIDER=local
OPENCLAW_MEMORY_MAX_CONTEXT=10
OPENCLAW_MEMORY_AUTO_SAVE=true
OPENCLAW_MEMORY_RETENTION_DAYS=0
AUTO_SAVE=true analyzes conversations and extracts persistent facts, preferences, and decisions automatically. The agent builds a knowledge base over time without you having to engineer memory explicitly. For larger deployments (10,000+ memory entries), consider Qdrant or Weaviate instead of the local JSON provider.
Your agent persona prompt should reinforce this explicitly: “When you learn a new fact about a user โ name, preference, project context โ save it to memory. When discussing anything you’ve covered before, check memory first.”
Skills: Extending Capability Safely
Skills are markdown files that give the agent new capabilities. Install them from the CLI:
docker exec openclaw openclaw skills search calendar
docker exec openclaw openclaw skills install google-calendar
Before enabling any community skill, read it. The skill system is powerful, but “powerful” cuts both ways โ a skill with unrestricted tool execution is an attack surface. The CVE-2026-25253 vulnerability discovered in February 2026 highlighted exactly this risk.
Add a skills allowlist to your .env:
OPENCLAW_SKILLS_ALLOWLIST=publisher:openclaw-official
Monitoring
Three layers, in order of priority:
Layer 1 โ Docker: The restart: always policy handles container crashes. The health check in the compose file detects hung processes. This covers most failure modes.
Layer 2 โ External uptime: UptimeRobot (free tier) pinging https://your-domain.com/health every 5 minutes catches network outages and DNS failures that Docker health checks miss. Set up email or SMS alerts.
Layer 3 โ Cost alerts: Set budget alerts in your model provider dashboard. Combine with OpenClaw’s OPENCLAW_DAILY_TOKEN_LIMIT for defense in depth. The provider catches unexpected spend. OpenClaw prevents runaway conversations.
Enable structured logging for full observability:
OPENCLAW_LOG_FORMAT=json
OPENCLAW_LOG_LEVEL=info
OPENCLAW_METRICS_ENABLED=true # Exposes /metrics for Prometheus
Forward to Grafana Loki or Datadog if you want alerting on error rates, rate limit hits, and skill execution failures.
Backup Strategy
The critical data lives in data/. Everything else is reproducible from your compose file and .env.
# Daily backup (add to cron)
tar -czf ~/openclaw-backup-$(date +%Y%m%d).tar.gz ~/openclaw/data/
Copy the archive to S3, Backblaze B2, or any offsite location. Recovery is straightforward: provision a new server, deploy the stack, restore data/, start the container. Your agent resumes with full memory and conversation history.
The Startup Checklist
# 1. Install OpenClaw
npm install -g openclaw
openclaw onboard
# 2. Install Ollama
brew install ollama # macOS
# curl -fsSL https://ollama.com/install.sh | sh # Linux
# 3. Pull models
ollama pull gemma4:26b # Primary โ 26B MoE, 16GB+
ollama pull qwen3:8b # Fallback worker โ 8B, 5GB
# 4. Create tuned Modelfile
echo "FROM gemma4:26b
PARAMETER num_ctx 8192
PARAMETER temperature 0.3" | ollama create gemma4-openclaw -f -
# 5. Verify Ollama is serving
curl -s http://localhost:11434/api/tags | jq '.models[].name'
# 6. Start the gateway
openclaw gateway start
# 7. Verify connectivity
openclaw gateway status # Should show "RPC probe: ok"
# 8. Open the control UI
open http://127.0.0.1:18789/
What This Buys You
Get the environment right and you end up with something genuinely useful: a persistent agent that knows your context, routes intelligently between models, handles routine work locally for free, escalates to frontier models when it needs real reasoning power, and runs on infrastructure you control.
The conversation in the field has moved from “can AI agents do real work?” to “how do you architect them so they do real work reliably?” That shift is worth paying attention to. The answer isn’t just picking the right model โ it’s building the right environment around it.
The stack is mature enough now that there’s no excuse for running a half-configured agent on a cloud-only setup when local inference is this capable. Run the numbers, build the hybrid, own your stack.
Pete Haas is CTO at WellSaid AI and writes regularly about the future of voice and conversational AI at conversationcurve.com.