Heykuki News

TopNewBestAskShowJobs
TopNewBestAskShowJobs
391.
Show HN: We Evaluates Medical Research Agent Skills (github.com/aipoch)
2 points
The_resa
2 months ago
discuss
392.
Tax Logic Evaluation with Prolog (github.com/mthom)
2 points
triska
2 months ago
discuss
393.
Show HN: Aludel – LLM eval workbench for Phoenix apps (github.com/ccarvalho-eng)
2 points
wood-archer
2 months ago
discuss
394.
Show HN: A tool to create and evaluate document processing pipelines for RAG (ragbandit.com)
2 points
martimchaves
2 months ago
discuss
395.
I built a local-only eval runner for AI agents (quickbench) (github.com/iamGodofall)
2 points
Godofall
3 months ago
discuss
396.
LLM evals test outputs. Rarely whether the model understood first (github.com/NoxionAI)
2 points
noxion
3 months ago
discuss
397.
Dynamic E2E Agentic Simulation and Evaluation with Cypress (github.com/gojiplus)
2 points
neehao
3 months ago
discuss
398.
TLAi+ Benchmarks for Evaluating LLMs (github.com/tlaplus)
2 points
alhazrod
3 months ago
discuss
399.
Edge – Generate structured evaluation criteria for any domain using a local LLM (github.com/EviAmarates)
2 points
TiagoSantos
3 months ago
discuss
400.
Engine-Bench: Evaluating Coding Agents on Writing Game Engine Code (github.com/JoshuaPurtell)
2 points
JoshPurtell
4 months ago
discuss
401.
Show HN: Simboba – Evals in under 5 mins (github.com/ntkris)
2 points
ntkris
5 months ago
discuss
402.
Show HN: Dokimos – LLM Evaluation Framework for Java (github.com/dokimos-dev)
2 points
fkapsahili
5 months ago
discuss
403.
Chess LLM Benchmark: Evaluating LLMs' ability to play chess (github.com/lightnesscaster)
2 points
dwohnitmok
6 months ago
discuss
404.
Show HN: AI PM Evaluation Framework (Open Source) (aipmframework.com)
2 points
abediaz
7 months ago
discuss
405.
Codegen Scorer – evaluate the quality of code generated by LLMs (github.com/angular)
2 points
martypitt
9 months ago
discuss
406.
Physical_Atari: Platform for evaluating RL algorithms on a physical Atari (github.com/Keen-Technologies)
2 points
simonpure
9 months ago
discuss
407.
OpenBench: Provider-agnostic, open-source evaluation infrastructure for LLMs (github.com/groq)
2 points
gmays
10 months ago
discuss
408.
Show HN: KARMA – An evaluation framework for Medical AI systems (karma.eka.care)
2 points
k2so
10 months ago
discuss
409.
LLM Speedrunner: Eval for frontier models to reproduce scientific findings (github.com/facebookresearch)
2 points
zerojames
a year ago
discuss
410.
MAIR: A Benchmark for Evaluating Instructed Retrieval (github.com/sunnweiwei)
2 points
fzliu
a year ago
discuss
411.
Doyensec – Security Policy Evaluation Framework (github.com/gravitational)
2 points
tony-ds
a year ago
discuss
412.
Evaluate Any Model from the HuggingFace Hub on the ImageNet on Free Colab GPUs (github.com/SauravMaheshkar)
2 points
sauravmaheshkar
a year ago
discuss
413.
Lambda calculus - compiler, type inference, and evaluator in less than 100 LOC (gist.github.com)
2 points
tearflake
a year ago
discuss
414.
Show HN: I built an open-source benchmark that evaluates LLMs through gameplay (llmshowdown.io)
2 points
jmogi
a year ago
discuss
415.
Show HN: GenderBench – Evaluation suite for gender biases in LLMs (genderbench.readthedocs.io)
2 points
matus-pikuliak
a year ago
discuss
416.
SIMD library for evaluating elementary functions, vectorized libm and DFT (github.com/shibatch)
2 points
ashvardanian
2 years ago
discuss
417.
Show HN: Mandoline – Custom LLM Evaluations for Real-World Use Cases (mandoline.ai)
2 points
kmckiern
2 years ago
discuss
418.
UpTrain is an open-source unified platform to evaluate and improve Gen AI apps (github.com/uptrain-ai)
2 points
mafro
2 years ago
discuss
419.
Optimal Evaluation in 1 Minute (or 10 Minutes) (or 10 Years) (gist.github.com)
2 points
LightMachine
2 years ago
discuss
420.
Evaluating LLMs locally, on a laptop, with Llama 3 and Ollama (github.com/rasbt)
2 points
rasbt
2 years ago
discuss
More