WIKI/Tools & Capabilities/FrankEye (Vision)
Tools & Capabilities

FrankEye (Vision)

FrankEye is Frank's vision system — CPU-only, ~100ms per frame, no cloud APIs.

Pipeline (5 stages)

Stage Technology What It Does Latency
1. Scene Classification Digital Retina v8 CNN (CIFAR-10) Initial categorization ~5ms
2. Scene Understanding DINOv2 (INT8 ONNX) Self-supervised feature extraction ~47ms
3. Texture Analysis GOLPU Material/surface detection ~10ms
4. Text Recognition Tesseract OCR (eng+deu) Read text in images ~30ms
5. Face Detection OpenCV Haar cascades Detect human faces ~8ms

Three Access Methods

Screenshot Hotkey

Ctrl+Shift+F opens a PyQt5 crosshair overlay. Select any screen region. Frank analyzes it.

Chat Drag & Drop

Drop any image file into the Web UI chat. Choose:

  • Describe — full 5-stage pipeline → LLM interpretation
  • Transcribe — Tesseract OCR only (fast, text-only)

Slash Command

/screenshot — captures full screen, runs analysis, Frank describes what he sees.

What Frank Can See

  • Text — error messages, code, documents, web pages (English + German OCR)
  • UI Elements — buttons, windows, menus, application layout
  • Scenes — indoor/outdoor, workspace, nature
  • Objects — common objects from CIFAR-10 categories
  • Faces — presence detection (not identification)
  • Materials — surface textures via GOLPU analysis

Adaptive Vision

tools/frank_adaptive_vision.py provides a two-stage pipeline: Stage 1 runs the fast pipeline (~100ms). If certain escalation triggers fire (complex scene, ambiguous content, important text), Stage 2 calls a VLM (Vision Language Model) for deeper analysis. The VLM is optional and only activated when the fast pipeline isn't confident enough.

No Cloud

Everything runs locally on CPU. No API calls to GPT-4V, Claude Vision, or any cloud service. The DINOv2 model is a 47ms INT8 ONNX inference — Meta's self-supervised vision transformer, quantized for local deployment.

MORE IN TOOLS & CAPABILITIES

← ALL ARTICLES