Soteira

What it is

A real-time video inference engine that responds to natural-language prompts. Point it at a video stream, ask "is anyone wearing a helmet?" or "is the door open?", and get an answer in ~150ms — running entirely on CPU.

Why I built it

Most "ask your video" demos require a GPU and a pipeline that costs more to run than the data is worth. The interesting question was: how far can you push CPU-only inference if you compose a fast object-detection backbone with a thin LLM head that only sees the structured detection output, never the raw frames?

What's inside

YOLO as the detection backbone — fast enough on CPU when batched.
OpenCV for stream ingestion and frame sampling.
LLM head that consumes structured detection summaries (not pixels) so prompt-to-answer latency stays bounded.
~150ms end-to-end measured on a laptop CPU; numbers vary with stream resolution and prompt complexity.

YOLO as the detection backbone — fast enough on CPU when batched.
OpenCV for stream ingestion and frame sampling.
LLM head that consumes structured detection summaries (not pixels) so prompt-to-answer latency stays bounded.
~150ms end-to-end measured on a laptop CPU; numbers vary with stream resolution and prompt complexity.