Show HN: ChatClue, Natural audio/video communication with computers and robots

2 points

2 years ago

Currently tested and running on Ubuntu (Debian) 22 for reference.

This is a python application that enables natural Audio/Video communication with computers and robots. It handles:

Video processing and automatic analysis:

* Provides your computer (or robot) with vision and allows you to speak to your computer / robot naturally to ask about what it is seeing (How many fingers am I holding up, how many people are in the scene, etc).

* Semantic vision processing is handled through an image analsis adapter, so that any video analysis model can be used to provide descriptions of the scene based on a user's query.

* Prebuilt adapters currently include: Google Cloud's Vertex AI API and OpenAI's -vision models (currently in preview and may not be available to all users).

Natural two-way voice communication.

* Wake words are configured in config.py to activate the system.

* You can continue conversations without repeating the wake word for a certain period.

* The computer's voice output is adapter based and can use any TTS (Text-to-Speech) system.

* Prebuilt adapters include: pyttsx3 and Google Cloud's text-to-speech.

Natural language processing is handled by OpenAI models.

* We are currently working on turning this into an adapter, so any NL processor can be used.

Conversations are stored in a PostgreSQL database, with associated embeddings. They are passed along in the messages array to maintain conversational context.

The system automatically ingests functions (tools) through a python decorator openai_function to facilitate extended functionality. Current built-in tools provide

* video analysis (as part of the main application) and

* natural language robotic control (as part of an example in the examples/ directory).

* Functions are automatically passed to openai (when appropriate) during the normal course of a conversation and tool responses are automatically handled.

Broadcasting to external systems through the broadcasting adapter.

* If your conversation requires action in the real world (for other computers, robots, or IoT devices to act), the system has a broadcaster to broadcast system intentions across the network.

* The prebuilt adapter provides Websocket support. I think MQTT would be a good fit as well, but it hasn't been implemented yet.

Long running tasks can be handled through the background processor.

* Background tasks are handled through celery.

* Current background tasks include offloading conversation embeddings and storage.

Some additional features that we're working on include:

* Automatic speaker diarization to facilitate voice recognition and conversation part assignments (with video assist).

* Echo Cancellation to avoid feedback processing (best to be used with a headphone for now)

* Similarity querying against the conversation data to allow for all-time conversation recall.

* Better robot <-> system two-way communication and visual action processing.

More ambitious features include

* automatic dataset training generation based on visual and auditory cues to facilitate automatic Pytorch model training against specific individuals. The idea is to allow the computer / robot to learn about new things and people through direct communication.

And obviously, test coverage.

This is just a side project, but I thought other people might find it interesting / useful.