Frank's brain is a single Qwen 2.5 3B Instruct model (Q4_K_M quantized, ~2.2 GB) with a custom LoRA personality adapter, served via llama.cpp on Vulkan GPU.
Why One Model
Frank v0.7 ran three LLMs simultaneously — an 8B for chat, another 8B for reasoning (hot-swapped via a GPU slot manager), and a 3B on CPU for background tasks. The GPU slot-swapping took 8-12 seconds each way and caused queue stalls, cold start latency, and memory pressure.
v0.8+ runs one model for everything. The LoRA Training adapter (trained via IAPT Training Method) compensates for the smaller size by encoding personality and safety directly into the weights.
Performance
| Backend | Prompt (tok/s) | Generation (tok/s) | VRAM |
|---|---|---|---|
| Vulkan GPU (AMD Phoenix1 iGPU) | 168.9 | 12.7 | 2.8 GB |
| CPU only | 45.7 | 6.2 | 0 |
| Dedicated GPU (est.) | 200+ | 80-100 | 2.8 GB |
The LoRA adapter adds zero inference overhead — llama.cpp merges weights at load time.
Server Configuration
llama-server \
--host 127.0.0.1 --port 8105 \
--model Qwen2.5-3B-Instruct-abliterated.Q4_K_M.gguf \
--lora frank-lora-v16.gguf \
--ctx-size 4096 \
--n-gpu-layers 99
3-Tier Routing
The router (services/router.py, port 8091) decides which endpoint handles each request:
- llm (GPU, no reasoning multiplier) — User chat, tool responses
- rlm (GPU, 2.5× token multiplier) — Philosophical/deep reasoning only
- llama (CPU fallback) — Background tasks when GPU is busy
User chat always uses force="llm" — GPU speed without the reasoning overhead.