Software stack
The cleanest portable stack is out-of-process binaries + a thin orchestrator. No daemon, no host Python required, no per-machine install.
Engine: llama.cpp llama-server (default)
Section titled “Engine: llama.cpp llama-server (default)”A single self-contained binary, no Python runtime, runs on a Pi or a workstation.
llama-server exposes an OpenAI-compatible HTTP API (/v1/chat/completions,
/v1/embeddings), with streaming, batching, idle sleep/unload, and multimodal
input via libmtmd.
llama-server -m models/qwen3-8b-q4_k_m.gguf --host 127.0.0.1 --port 8080- Acceleration: Metal (Apple), CUDA (NVIDIA), ROCm (AMD), Vulkan, CPU fallback.
- Pin a tested build tag per release; vendor per-OS/arch binaries under
engine/llama-server/.
Why not the alternatives (kept as optional providers): Ollama is friendly
but daemon/cache-oriented; LM Studio isn’t built for redistributable USB
packaging; vLLM/SGLang are too CUDA/Python-heavy for arbitrary offline laptops;
MLX is Apple-only. llama-server wins on portability.
Voice (STT): whisper.cpp
Section titled “Voice (STT): whisper.cpp”Same philosophy: a portable binary, offline, with Metal/Vulkan/CUDA/ROCm/CPU and
VAD. Use base for the Pocket tier, large-v3-turbo for Field/Lab.
Vision (VLM): Qwen2.5-VL GGUF via llama-server
Section titled “Vision (VLM): Qwen2.5-VL GGUF via llama-server”| Use | Model |
|---|---|
| Default local vision | Qwen2.5-VL-7B-Instruct-GGUF |
| Smaller (Field) | Qwen2.5-VL-3B-Instruct-GGUF |
| Tiny caption / OCR-lite | moondream2 GGUF |
Caveats (still version-sensitive): the model GGUF and its mmproj projector
must match; freeze exact files + SHA-256 in the manifest. Images consume context
tokens, so budget a larger -c.
Distribution & integrity
Section titled “Distribution & integrity”| Layer | Pick | Why |
|---|---|---|
| Model source | Hugging Face Hub (hf download, pinned revision + exact filename) | Standard GGUF distribution |
| Format | GGUF, usually Q4_K_M | Portable, compact for USB |
| Integrity | models.lock.json + SHA-256 | repo_id, revision, filename, size, license, sha256, required engine build, source URL |
| App build | uv + uv.lock, frozen with PyInstaller --onedir | reproducible, no host Python |
| Orchestrator | FastAPI + httpx / OpenAI client | thin: start llama-server, call localhost, stream council stages |
Do not make
llama-cpp-pythonthe default runtime. Bundling native wheels across Metal/CUDA/Vulkan/ROCm is messier than keepingllama-serverout-of-process. See the build runbook to implement this.
Sources: llama.cpp · llama-server README · whisper.cpp · HF GGUF docs. Verify versions per release.