This is a python application that enables natural Audio/Video communication with computers and robots. It handles:
Video processing and automatic analysis:
* Provides your computer (or robot) with vision and allows you to speak to your computer / robot naturally to ask about what it is seeing (How many fingers am I holding up, how many people are in the scene, etc).
* Semantic vision processing is handled through an image analsis adapter, so that any video analysis model can be used to provide descriptions of the scene based on a user's query.
* Prebuilt adapters currently include: Google Cloud's Vertex AI API and OpenAI's -vision models (currently in preview and may not be available to all users).
Natural two-way voice communication.
* Wake words are configured in config.py to activate the system.
* You can continue conversations without repeating the wake word for a certain period.
* The computer's voice output is adapter based and can use any TTS (Text-to-Speech) system.
* Prebuilt adapters include: pyttsx3 and Google Cloud's text-to-speech.
Natural language processing is handled by OpenAI models.
* We are currently working on turning this into an adapter, so any NL processor can be used.
Conversations are stored in a PostgreSQL database, with associated embeddings. They are passed along in the messages array to maintain conversational context.
The system automatically ingests functions (tools) through a python decorator openai_function to facilitate extended functionality. Current built-in tools provide
* video analysis (as part of the main application) and
* natural language robotic control (as part of an example in the examples/ directory).
* Functions are automatically passed to openai (when appropriate) during the normal course of a conversation and tool responses are automatically handled.
Broadcasting to external systems through the broadcasting adapter.
* If your conversation requires action in the real world (for other computers, robots, or IoT devices to act), the system has a broadcaster to broadcast system intentions across the network.
* The prebuilt adapter provides Websocket support. I think MQTT would be a good fit as well, but it hasn't been implemented yet.
Long running tasks can be handled through the background processor.
* Background tasks are handled through celery.
* Current background tasks include offloading conversation embeddings and storage.
Some additional features that we're working on include:
* Automatic speaker diarization to facilitate voice recognition and conversation part assignments (with video assist).
* Echo Cancellation to avoid feedback processing (best to be used with a headphone for now)
* Similarity querying against the conversation data to allow for all-time conversation recall.
* Better robot <-> system two-way communication and visual action processing.
More ambitious features include
* automatic dataset training generation based on visual and auditory cues to facilitate automatic Pytorch model training against specific individuals. The idea is to allow the computer / robot to learn about new things and people through direct communication.
And obviously, test coverage.
This is just a side project, but I thought other people might find it interesting / useful.