You need to evaluate a few fine-tuned LLM checkpoints.
None of the existing benchmark suite fits your domain task,
and your content can't be reviewed by a 3rd party (e.g. GPT-4).
Human evaluation seems to be the most viable option...
Well, maybe let’s start from this question:
Which of the two responses is better?
Show HN: Lone Arena – Self-hosted LLM human evaluation, you be the judge | Heykuki News