FrankVoice (Voice)

FrankVoice is Frank's speech system — push-to-talk voice interaction, CPU-only, ~155 MB RAM total.

Components

Component	Technology	Function	Size	Latency
VAD	Silero	Voice Activity Detection	~2 MB	<1ms
STT	faster-whisper (INT8)	Speech → Text	~75 MB	~500ms
Noise Gate	Spectral analysis	Background noise removal	—	~5ms
TTS (DE)	Piper	Text → Speech (German)	~40 MB	~200ms
TTS (EN)	Kokoro	Text → Speech (English)	~40 MB	~200ms

How to Use

Push-to-Talk

Press and hold Space (when chat input is not focused)
Speak your message
Release Space
Frank transcribes (Whisper), thinks (LLM), speaks back (TTS)

Safety Guards

Buffer < 0.3 seconds → ignored (prevents accidental taps)
Empty transcript → toast: "Hold longer and speak clearly"
Mic warm-up — microphone stays initialized after first use (no re-init per press)

Language Detection

Whisper auto-detects the spoken language. Frank responds in the same language if his LoRA supports it (strongest: English, German).

Architecture

Microphone → Ring Buffer → VAD (Silero) → Noise Gate
    → Whisper (faster-whisper INT8) → Text
    → Frank's chat pipeline (same as typed input)
    → Response text → TTS (Piper/Kokoro) → Speaker

No always-listening mode. No wake words. Push-to-talk only — simpler, more reliable, no privacy concerns from ambient recording.