At Cerebrium, we have recently built a few demos showing voice AI capabilities (worlds fastest voice agent & realtime RAG agent) but we wanted to push the boundary and see if we could create realistic, human-like situations to train and onboard teams to perform better - recreating real life scenarios!
An example of this is a sales coach for your sales team, an investor pitch or even prep for a notoriously stressful YC interview . To achieve this there were a few difficult problems to solve, namely:
- How do you recreate a human-like video call both physically and/or emotionally (an angry customer, a fast speaker etc)? - How do you steer the conversation to a specific outcome? - How do you do function calling at low latency?
Here's how we solved each of these problems:
How do you recreate a human-like video call both physically and/or emotionally (an angry customer, a fast speaker etc)
- We used Tavus (tavus.io) to create a realistic AI avatar. Tavus allows you to build AI-generated video experiences with an API. They have created very modular API’s whereby you can select an OpenAI compatible endpoint as your LLM as well as any TTS service. You are also able to train my own human replica with a few video clips. - To get across emotion, we used the new emotional control released by Cartesia (cartesia.ai) - a fast, realistic voice API. They allow you to select from a range of emotions (Angry, Sad, Positive etc) as well as different talking speeds. This allows us to convey various emotions in these simulated environments such as angry customers.
How to you steer the conversation to a specific outcome
- This wasn’t the most complex issue - we simply used function calling to steer a conversation in a specific direction based on answers given by a user. Our implementation isn’t bulletproof, but given more time you could implement some robust methods.
Make the above have low latency interactions?
- What's tough about implementing function-calling is latency. If you are using a API that has function-calling capabilities (Mistral, OpenAI etc) the latencies are very high because the process of function calling is: - Make a request to the API with an instruction. - API determines that you need to use a function. You then run your function and get a result. - You send the result to the API (AGAIN) and then get a response which you can show to the user. - Both API calls incur network time and so we saw the average roundtrip response time of 600-800ms for EACH API call with the TTFT fluctuating around ~300ms each time. Therefore you are looking at a minimum TTFT of ~1s when you take into consideration the two requests and your TTS service. - To get around this we implemented Mistral-7B locally on Cerebrium (cerebrium.ai), which has function calling capabilities. Our TTFT was ~80ms in us-east-1 and we never had to go over the network for the second request (since we were calling the model locally). Therefore our TTFT was roughly 150ms for our LLM which put our response time of voice-to-voice responses at roughly ~300-400ms - 3x lower!
Room for improvement
There is definitely some room for improvement in our implementation that you would need to make this production ready for many company use cases, most notably is the issue of user pauses and detecting when a user has finished their response. Models are not sophisticated enough to know if a user is still thinking or formulating their response. Currently when we hear silence, the model starts responding which really stresses you out in the interview use case!
We made all the code available as well as wrote a tutorial of how we got this up and running. We would love feedback or ideas from the community on how to make this application better. More importantly, it would be great if you commit to the GitHub repo so the community can benefit from it.