Matt here. We've been working on building retrieval pipelines for LLMs, and like many others we questioned how changes to our pipeline (e.g. chunking, cleaning) would affect the overall outcome.
We also faced a problem of what data to evaluate against. MTEB datasets are used in the literature, but we found it difficult to use our existing processing pipelines with it. We also didn't want to manually label a dataset, because it would be difficult to hand-label a representative dataset.
Retri-evals is hoping to solve these problems. We pulled out our MTEB abstractions that let us evaluate against open source datasets, and we're going to open source the code we use to automatically generate evaluation datasets from production data.
I'd love to hear your thoughts! We're looking to complement existing solutions in this space with tooling that makes it easier to get to production.