Here’s a demo video: https://www.loom.com/share/4ad30bf1053e46a3846fc5a07495c486
We started working on the auto prompt optimizer because of our own frustration with developing, iterating, and maintaining prompts across different use cases and models. A minor update to the underlying LLM, a change in user requirements, or a shift in application infrastructure can render a carefully crafted prompt useless. As one user put it, “Prompt engineering is not software engineering; it’s wishful thinking.”
We tried prompt optimization tools like DSPy and TextGrad, but realized they require you to adopt new frameworks, craft custom metrics from scratch, and offer limited visibility into the optimization process (or even the final optimized prompt). This lack of transparency left us guessing whether the new prompts are genuinely better or just different.
Our Auto Prompt Optimizer aims to be an easy-to-use yet robust alternative, with maximum visibility into the optimization process and final results. It takes two inputs: a dataset with inputs and expected outputs for a given LLM task, and a target metric (we have 30+ out-of-the-box metrics). The optimizer then starts from your initial prompt and uses the dataset to align the LLM output with your desired outcomes. It does this iteratively, mutating the prompt based on feedback from the target metric. The optimizer automatically selects the examples from the datasets to create few-shot prompts and bake in common techniques such as chain of thought when appropriate.
Here are two examples of the results that include the initial prompt, each version of the new prompt, and its performance on the target metric
- Drug Review Prompt: https://app.relari.ai/demo/prompt/drug-review (a non-standard task where the optimizer created sophisticated instructions with detailed rating rubric and corner case handling)
- Summarization Prompt: https://app.relari.ai/demo/prompt/cnn-highlights (a simple task where the optimizer added more straightforward instructions on styling)
We see the prompt optimizer as a lightweight and practical alternative for adapting LLMs for domain-specific tasks. It can deliver high-quality prompts with as few as 100 data points.
Try it yourself (https://app.relari.ai/). You can upload your dataset or generate a simple synthetic dataset to start the optimization process. It is recommended to use a dataset with at least 30 samples. The optimization process can take up to an hour depending on the size of the dataset and metrics, so we ask you to create an account so we can keep track of each optimization run and will send you an email notice once it’s completed.
What’s next? We’re currently working on support for more advanced features like prompt chaining and agent tool call use cases. For power users, we offer custom metrics and multi-objective optimization to address the most complex use cases.
What’s been your biggest challenge with prompt engineering? Would a dataset-driven approach could improve your prompt workflow? We’d love to hear your thoughts and feedback on our approach.