FrankEye (Vision)

FrankEye is Frank's vision system — CPU-only, ~100ms per frame, no cloud APIs.

Pipeline (5 stages)

Stage	Technology	What It Does	Latency
1. Scene Classification	Digital Retina v8 CNN (CIFAR-10)	Initial categorization	~5ms
2. Scene Understanding	DINOv2 (INT8 ONNX)	Self-supervised feature extraction	~47ms
3. Texture Analysis	GOLPU	Material/surface detection	~10ms
4. Text Recognition	Tesseract OCR (eng+deu)	Read text in images	~30ms
5. Face Detection	OpenCV Haar cascades	Detect human faces	~8ms

Three Access Methods

Screenshot Hotkey

Ctrl+Shift+F opens a PyQt5 crosshair overlay. Select any screen region. Frank analyzes it.

Chat Drag & Drop

Drop any image file into the Web UI chat. Choose:

Describe — full 5-stage pipeline → LLM interpretation
Transcribe — Tesseract OCR only (fast, text-only)

Slash Command

/screenshot — captures full screen, runs analysis, Frank describes what he sees.

What Frank Can See

Text — error messages, code, documents, web pages (English + German OCR)
UI Elements — buttons, windows, menus, application layout
Scenes — indoor/outdoor, workspace, nature
Objects — common objects from CIFAR-10 categories
Faces — presence detection (not identification)
Materials — surface textures via GOLPU analysis

Adaptive Vision

tools/frank_adaptive_vision.py provides a two-stage pipeline: Stage 1 runs the fast pipeline (~100ms). If certain escalation triggers fire (complex scene, ambiguous content, important text), Stage 2 calls a VLM (Vision Language Model) for deeper analysis. The VLM is optional and only activated when the fast pipeline isn't confident enough.

No Cloud

Everything runs locally on CPU. No API calls to GPT-4V, Claude Vision, or any cloud service. The DINOv2 model is a 47ms INT8 ONNX inference — Meta's self-supervised vision transformer, quantized for local deployment.