LLM Backend (Qwen2.5-3B)

Frank's brain is a single Qwen 2.5 3B Instruct model (Q4_K_M quantized, ~2.2 GB) with a custom LoRA personality adapter, served via llama.cpp on Vulkan GPU.

Why One Model

Frank v0.7 ran three LLMs simultaneously — an 8B for chat, another 8B for reasoning (hot-swapped via a GPU slot manager), and a 3B on CPU for background tasks. The GPU slot-swapping took 8-12 seconds each way and caused queue stalls, cold start latency, and memory pressure.

v0.8+ runs one model for everything. The LoRA Training adapter (trained via IAPT Training Method) compensates for the smaller size by encoding personality and safety directly into the weights.

Performance

Backend	Prompt (tok/s)	Generation (tok/s)	VRAM
Vulkan GPU (AMD Phoenix1 iGPU)	168.9	12.7	2.8 GB
CPU only	45.7	6.2	0
Dedicated GPU (est.)	200+	80-100	2.8 GB

The LoRA adapter adds zero inference overhead — llama.cpp merges weights at load time.

Server Configuration

llama-server \
  --host 127.0.0.1 --port 8105 \
  --model Qwen2.5-3B-Instruct-abliterated.Q4_K_M.gguf \
  --lora frank-lora-v16.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 99

3-Tier Routing

The router (services/router.py, port 8091) decides which endpoint handles each request:

llm (GPU, no reasoning multiplier) — User chat, tool responses
rlm (GPU, 2.5× token multiplier) — Philosophical/deep reasoning only
llama (CPU fallback) — Background tasks when GPU is busy

User chat always uses force="llm" — GPU speed without the reasoning overhead.

Why One Model

Performance

Server Configuration

3-Tier Routing

MORE IN INFRASTRUCTURE