There are also open-source toolkits you can experiment with:
https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator
But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.
I’m curious:
1. Who is using synthetic-data pipelines in production today?
2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?
Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!