I built kfe, an application that uses models from Hugging Face to index multimedia files by extracting visual features, text from images, audio transcriptions, and more, allowing you to search using arbitrary text queries. Supported search methods include standard lexical BM25, embedding-based, and hybrid approaches. There are also similarity search features, like finding similar images to the selected one. All data remains private and stored exclusively on your device, with no external uploads.
A browser-based UI exposes all search options and lets you view and edit generated file metadata (e.g., correct transcriptions or manually describe files for improved retrieval).
The app is cross-platform (tested on M3 Mac, Ubuntu with CUDA, and Windows 10 VM) and works with or without GPU. It uses heavy SOTA (or close to SOTA) open AI models. I made some efforts to load only the necessary models into memory in a lazy manner and evict them after some idle time, so that you can have it running in the background without wasting too much resources. The only dependency is ffmpeg, database and search features are built-in.
Here’s a quick demo showing the UI in action: https://www.youtube.com/watch?v=LSe0QB6dzEY
Source code, installation guide, and more details about models and the inner workings of the app can be found here: https://github.com/Fl0k3n/kfe