Demos Available: https://firefly-ai.notion.site/OpenAI-Examples-34975ae263a9496c84e89fb7b1ea25a4?pvs=4
As you can see, the model's emotion, rhythm, accent, and timbre match the OpenAI speakers, though there is some degradation in audio quality, which we are working on. To avoid any legal issues, we are unable to release the fine-tuned model, but I believe everyone can tune Fish Speech to this level within hours and for around $20.
Our experiment shows that with only 25 seconds of prompts (few-shot learning), without any fine-tuning, the model can mimic most behaviors except for how it reads numbers. To the best of our knowledge, you can clone how someone speaks in English, Chinese, and Japanese with 30 minutes of data using this framework.
Repo: https://github.com/fishaudio/fish-speech