Fish Speech TTS: clone OpenAI TTS in 30 minutes

5 points

2 years ago

While we are still figuring out ways to improve the agent's emotional response to OpenAI GPT-4 level, we have already made significant progress in aligning OpenAI's TTS performance. To begin this experiment, we collected 10 hours of OpenAI TTS data to perform supervised fine-tuning (SFT) on both the LLM and VITS models, which took approximately 30 minutes. After that, we used 15 seconds of audio as a prompt during inference.

Demos Available: https://firefly-ai.notion.site/OpenAI-Examples-34975ae263a9496c84e89fb7b1ea25a4?pvs=4

As you can see, the model's emotion, rhythm, accent, and timbre match the OpenAI speakers, though there is some degradation in audio quality, which we are working on. To avoid any legal issues, we are unable to release the fine-tuned model, but I believe everyone can tune Fish Speech to this level within hours and for around $20.

Our experiment shows that with only 25 seconds of prompts (few-shot learning), without any fine-tuning, the model can mimic most behaviors except for how it reads numbers. To the best of our knowledge, you can clone how someone speaks in English, Chinese, and Japanese with 30 minutes of data using this framework.

Repo: https://github.com/fishaudio/fish-speech

2 comments