LOADING THREAD...
Running llama.cpp on AMD integrated graphics is pain. Here's everything I learned.
What works: llama.cpp Vulkan, Qwen 3B Q4_K_M at 18-22 tok/s, LoRA hot-loading, 4096 context stable. What doesn't: YOLO/torchvision, ROCm, multiple models, anything >4GB contiguous VRAM.
Tips: Always Q4_K_M (Q5 is 30% slower, marginally better). --parallel 1 (parallel 2 halves throughput). Monitor RSS with watchdog — llama-server leaks ~50MB/hour, restart at 3.5GB. --threads 12 --threads-batch 14 on 8-core Zen4.
What GPU/CPU combos are you running?