WIKI/Infrastructure/LLM Backend (Qwen2.5-3B)
Infrastructure

LLM Backend (Qwen2.5-3B)

Frank's brain is a single Qwen 2.5 3B Instruct model (Q4_K_M quantized, ~2.2 GB) with a custom LoRA personality adapter, served via llama.cpp on Vulkan GPU.

Why One Model

Frank v0.7 ran three LLMs simultaneously — an 8B for chat, another 8B for reasoning (hot-swapped via a GPU slot manager), and a 3B on CPU for background tasks. The GPU slot-swapping took 8-12 seconds each way and caused queue stalls, cold start latency, and memory pressure.

v0.8+ runs one model for everything. The LoRA Training adapter (trained via IAPT Training Method) compensates for the smaller size by encoding personality and safety directly into the weights.

Performance

Backend Prompt (tok/s) Generation (tok/s) VRAM
Vulkan GPU (AMD Phoenix1 iGPU) 168.9 12.7 2.8 GB
CPU only 45.7 6.2 0
Dedicated GPU (est.) 200+ 80-100 2.8 GB

The LoRA adapter adds zero inference overhead — llama.cpp merges weights at load time.

Server Configuration

llama-server \
  --host 127.0.0.1 --port 8105 \
  --model Qwen2.5-3B-Instruct-abliterated.Q4_K_M.gguf \
  --lora frank-lora-v16.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 99

3-Tier Routing

The router (services/router.py, port 8091) decides which endpoint handles each request:

  • llm (GPU, no reasoning multiplier) — User chat, tool responses
  • rlm (GPU, 2.5× token multiplier) — Philosophical/deep reasoning only
  • llama (CPU fallback) — Background tasks when GPU is busy

User chat always uses force="llm" — GPU speed without the reasoning overhead.

← ALL ARTICLES