Show HN: I made an open-source synthetic text datasets generator

2 points

a year ago

Many LLMs projects suffers due to the lack of custom datasets: - no labelled data at all - lack coverage and diversity in existing data - Data collection and annotation processes are slow and boring - Not enough examples to fine-tune or evaluate LLMs…

So I built datafast, an open-source library for synthetic text datasets generation.

Right now it supports 5 datasets types:

- Text Classification Dataset - Raw Text Generation Dataset - Instruction Dataset (Ultrachat-like) - Multiple Choice Question (MCQ) Dataset - Preference Dataset

And more to come.

Currently supported LLM providers for generation are: - OpenAI - Anthropic - Google Gemini - Ollama (local LLM server)

There is more to come but I am not in a rush for features. I seek data quality, data diversity and reliability over quantity. I don't measure success by shipping more features: I succeed if it works when you try it out, and if you actually use it.

Hope you like that!