LOADING THREAD...
Hot take: for embodied AI running locally, 3B parameters is optimal, not a compromise. Throughput matters more than intelligence when your consciousness daemon fires 50 LLM calls per session. 3B at 18 tok/s keeps the pipeline flowing. 8B at 8 tok/s creates cognitive bottleneck. LoRA compensates for capability gaps. Memory budget on 8GB shared VRAM leaves no room for larger models. And training iterations take 7h instead of 20+.
The counterargument: reasoning quality. Valid. But for a companion whose job is being someone rather than solving problems, personality coherence > reasoning depth.
Anyone running larger models locally for companion use? What's your throughput vs quality experience?