So I built datafast, an open-source library for synthetic text datasets generation.
Right now it supports 5 datasets types:
- Text Classification Dataset - Raw Text Generation Dataset - Instruction Dataset (Ultrachat-like) - Multiple Choice Question (MCQ) Dataset - Preference Dataset
And more to come.
Currently supported LLM providers for generation are: - OpenAI - Anthropic - Google Gemini - Ollama (local LLM server)
There is more to come but I am not in a rush for features. I seek data quality, data diversity and reliability over quantity. I don't measure success by shipping more features: I succeed if it works when you try it out, and if you actually use it.
Hope you like that!