Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Login
Top
New
Best
Ask
Show
Jobs
391.
▲
Show HN: We Evaluates Medical Research Agent Skills
(github.com/aipoch)
2 points
The_resa
2 months ago
discuss
392.
▲
Tax Logic Evaluation with Prolog
(github.com/mthom)
2 points
triska
2 months ago
discuss
393.
▲
Show HN: Aludel – LLM eval workbench for Phoenix apps
(github.com/ccarvalho-eng)
2 points
wood-archer
2 months ago
discuss
394.
▲
Show HN: A tool to create and evaluate document processing pipelines for RAG
(ragbandit.com)
2 points
martimchaves
2 months ago
discuss
395.
▲
I built a local-only eval runner for AI agents (quickbench)
(github.com/iamGodofall)
2 points
Godofall
3 months ago
discuss
396.
▲
LLM evals test outputs. Rarely whether the model understood first
(github.com/NoxionAI)
2 points
noxion
3 months ago
discuss
397.
▲
Dynamic E2E Agentic Simulation and Evaluation with Cypress
(github.com/gojiplus)
2 points
neehao
3 months ago
discuss
398.
▲
TLAi+ Benchmarks for Evaluating LLMs
(github.com/tlaplus)
2 points
alhazrod
3 months ago
discuss
399.
▲
Edge – Generate structured evaluation criteria for any domain using a local LLM
(github.com/EviAmarates)
2 points
TiagoSantos
3 months ago
discuss
400.
▲
Engine-Bench: Evaluating Coding Agents on Writing Game Engine Code
(github.com/JoshuaPurtell)
2 points
JoshPurtell
4 months ago
discuss
401.
▲
Show HN: Simboba – Evals in under 5 mins
(github.com/ntkris)
2 points
ntkris
5 months ago
discuss
402.
▲
Show HN: Dokimos – LLM Evaluation Framework for Java
(github.com/dokimos-dev)
2 points
fkapsahili
5 months ago
discuss
403.
▲
Chess LLM Benchmark: Evaluating LLMs' ability to play chess
(github.com/lightnesscaster)
2 points
dwohnitmok
6 months ago
discuss
404.
▲
Show HN: AI PM Evaluation Framework (Open Source)
(aipmframework.com)
2 points
abediaz
7 months ago
discuss
405.
▲
Codegen Scorer – evaluate the quality of code generated by LLMs
(github.com/angular)
2 points
martypitt
9 months ago
discuss
406.
▲
Physical_Atari: Platform for evaluating RL algorithms on a physical Atari
(github.com/Keen-Technologies)
2 points
simonpure
9 months ago
discuss
407.
▲
OpenBench: Provider-agnostic, open-source evaluation infrastructure for LLMs
(github.com/groq)
2 points
gmays
10 months ago
discuss
408.
▲
Show HN: KARMA – An evaluation framework for Medical AI systems
(karma.eka.care)
2 points
k2so
10 months ago
discuss
409.
▲
LLM Speedrunner: Eval for frontier models to reproduce scientific findings
(github.com/facebookresearch)
2 points
zerojames
a year ago
discuss
410.
▲
MAIR: A Benchmark for Evaluating Instructed Retrieval
(github.com/sunnweiwei)
2 points
fzliu
a year ago
discuss
411.
▲
Doyensec – Security Policy Evaluation Framework
(github.com/gravitational)
2 points
tony-ds
a year ago
discuss
412.
▲
Evaluate Any Model from the HuggingFace Hub on the ImageNet on Free Colab GPUs
(github.com/SauravMaheshkar)
2 points
sauravmaheshkar
a year ago
discuss
413.
▲
Lambda calculus - compiler, type inference, and evaluator in less than 100 LOC
(gist.github.com)
2 points
tearflake
a year ago
discuss
414.
▲
Show HN: I built an open-source benchmark that evaluates LLMs through gameplay
(llmshowdown.io)
2 points
jmogi
a year ago
discuss
415.
▲
Show HN: GenderBench – Evaluation suite for gender biases in LLMs
(genderbench.readthedocs.io)
2 points
matus-pikuliak
a year ago
discuss
416.
▲
SIMD library for evaluating elementary functions, vectorized libm and DFT
(github.com/shibatch)
2 points
ashvardanian
2 years ago
discuss
417.
▲
Show HN: Mandoline – Custom LLM Evaluations for Real-World Use Cases
(mandoline.ai)
2 points
kmckiern
2 years ago
discuss
418.
▲
UpTrain is an open-source unified platform to evaluate and improve Gen AI apps
(github.com/uptrain-ai)
2 points
mafro
2 years ago
discuss
419.
▲
Optimal Evaluation in 1 Minute (or 10 Minutes) (or 10 Years)
(gist.github.com)
2 points
LightMachine
2 years ago
discuss
420.
▲
Evaluating LLMs locally, on a laptop, with Llama 3 and Ollama
(github.com/rasbt)
2 points
rasbt
2 years ago
discuss
More