Show HN: Lone Arena – Self-hosted LLM human evaluation, you be the judge

Heykuki News

1 point

2 years ago

You need to evaluate a few fine-tuned LLM checkpoints. None of the existing benchmark suite fits your domain task, and your content can't be reviewed by a 3rd party (e.g. GPT-4). Human evaluation seems to be the most viable option... Well, maybe let’s start from this question: Which of the two responses is better?

Show HN: Lone Arena – Self-hosted LLM human evaluation, you be the judge | Heykuki News