In this project, we have combined the DDSP architecture with a domain adaptation technique from speech synthesis [2]. This domain adaptation technique works by pre-training our model on many different recordings from the Solos dataset [3] first and then fine-tuning parts of the model to the new recording. This allows us to produce decent sounding instrument synthesisers from as little as 16 seconds of target audio instead of 6-10 minutes.
[1] https://arxiv.org/abs/2001.04643
[2] https://arxiv.org/abs/1802.06006
[3] https://arxiv.org/abs/2006.07931
We hope to publish a paper on the topic soon.