FrankVoice is Frank's speech system — push-to-talk voice interaction, CPU-only, ~155 MB RAM total.
Components
| Component | Technology | Function | Size | Latency |
|---|---|---|---|---|
| VAD | Silero | Voice Activity Detection | ~2 MB | <1ms |
| STT | faster-whisper (INT8) | Speech → Text | ~75 MB | ~500ms |
| Noise Gate | Spectral analysis | Background noise removal | — | ~5ms |
| TTS (DE) | Piper | Text → Speech (German) | ~40 MB | ~200ms |
| TTS (EN) | Kokoro | Text → Speech (English) | ~40 MB | ~200ms |
How to Use
Push-to-Talk
- Press and hold Space (when chat input is not focused)
- Speak your message
- Release Space
- Frank transcribes (Whisper), thinks (LLM), speaks back (TTS)
Safety Guards
- Buffer < 0.3 seconds → ignored (prevents accidental taps)
- Empty transcript → toast: "Hold longer and speak clearly"
- Mic warm-up — microphone stays initialized after first use (no re-init per press)
Language Detection
Whisper auto-detects the spoken language. Frank responds in the same language if his LoRA supports it (strongest: English, German).
Architecture
Microphone → Ring Buffer → VAD (Silero) → Noise Gate
→ Whisper (faster-whisper INT8) → Text
→ Frank's chat pipeline (same as typed input)
→ Response text → TTS (Piper/Kokoro) → Speaker
No always-listening mode. No wake words. Push-to-talk only — simpler, more reliable, no privacy concerns from ambient recording.