Audio samples are easily obtained from their podcast, but manual data labeling is painful for a hobby activity. Further, from what I understand, the real difficulty in performant diarizer models is not speaker recognition generally, but specifically speaker recognition while there is overlapping speech between multiple speakers. I am not even sure how to best implement a labeling procedure for segments with overlapping speech.
I started to wonder whether I might bootstrap a decent sample by leveraging TTS vocal cloning models to simulate the five speakers in dialogues with overlapping speech segments. So I ask HN, is this hopelessly naive, or potentially useful technique? Also, any other advice?
[1] https://www.3d6downtheline.com/ [2] https://github.com/MahmoudAshraf97/whisper-diarization/