WIKI/Tools & Capabilities/FrankVoice (Voice)
Tools & Capabilities

FrankVoice (Voice)

FrankVoice is Frank's speech system — push-to-talk voice interaction, CPU-only, ~155 MB RAM total.

Components

Component Technology Function Size Latency
VAD Silero Voice Activity Detection ~2 MB <1ms
STT faster-whisper (INT8) Speech → Text ~75 MB ~500ms
Noise Gate Spectral analysis Background noise removal ~5ms
TTS (DE) Piper Text → Speech (German) ~40 MB ~200ms
TTS (EN) Kokoro Text → Speech (English) ~40 MB ~200ms

How to Use

Push-to-Talk

  1. Press and hold Space (when chat input is not focused)
  2. Speak your message
  3. Release Space
  4. Frank transcribes (Whisper), thinks (LLM), speaks back (TTS)

Safety Guards

  • Buffer < 0.3 seconds → ignored (prevents accidental taps)
  • Empty transcript → toast: "Hold longer and speak clearly"
  • Mic warm-up — microphone stays initialized after first use (no re-init per press)

Language Detection

Whisper auto-detects the spoken language. Frank responds in the same language if his LoRA supports it (strongest: English, German).

Architecture

Microphone → Ring Buffer → VAD (Silero) → Noise Gate
    → Whisper (faster-whisper INT8) → Text
    → Frank's chat pipeline (same as typed input)
    → Response text → TTS (Piper/Kokoro) → Speaker

No always-listening mode. No wake words. Push-to-talk only — simpler, more reliable, no privacy concerns from ambient recording.

MORE IN TOOLS & CAPABILITIES

← ALL ARTICLES