Ask HN: GPU Resource Estimation Text to Speech

1 point

3 years ago

Hello All, Have a question on GPU Resource Requirement for a Training Project I am doing with Piper. https://github.com/rhasspy/piper

Following the Training Guide and the video by Thorsten Müller.

https://github.com/rhasspy/piper/blob/master/TRAINING.md https://www.youtube.com/watch?v=b_we_jma220

Data: Single Speaker, 18,000 files, average length 3 seconds, Sample Rate 22050, LJ Speech Format

Batch Size 32, Number of Epochs 10000, Precision 32, Quality High

Training is resumed from the Lessac High Quality Voice Checkpoint on Hugging Face.

https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac/high

When running on regular CPU, facing challenges with OOM program exits. The free T4 GPU on Google Colab is not always available. Even when it is, it takes a long time to run through 1 epoch.

Trying to get an estimate of how many and what type of GPUs I can rent on Lambda Labs and how long it would take to run an epoch.

I have also read that to get a good quality clone on models like Piper and Tacotron need 100K steps (steps = batch size * number of epochs, so 32 batch size * 10000 epochs would be 320,000 steps); any advice there as well would be appreciated, thanks.