Built this bot using the RawNet3 (github.com/jungjee/rawnet) speech embedding model.
Used a really tiny dataset (100k clips labeled as real or gen, each one is about 3 sec long), and from my testings, it does really well with single-speaker, clean audio. Has varying results with some of the content mixed with noise on twitter. Still wanted to ship something usable tho.
I wanted to get as much feedback as possible. I prefer the feedback to be as harsh as need be :) lmk if it's shit lol