It uses Whisper for transcription (local), StyleTTS2 for TTS (also local).
Here's a little demo video if it in action: https://twitter.com/lxe/status/1745348827983560991
It uses streaming for both LLM results and TTS, which significantly decreases the interaction latency as compared to something like ChatGPT Voice mode.
The goal was to create a fully self-contained locally-running AI chat app, and this is the result.
Enjoy!