Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Login
Top
New
Best
Ask
Show
Jobs
511.
▲
Show HN: Orangensaft – A mini Python-like language with LLM eval in lang runtime
(github.com/jargnar)
1 point
jargnar
4 months ago
discuss
512.
▲
Show HN: Praetorian Guard – Free AI tool to self-evaluate your CV (educational)
(github.com/simonesan-afk)
1 point
saimonsan
4 months ago
discuss
513.
▲
MiRAGE: Open-source framework for multimodal RAG evaluation
1 point
mmhetric
4 months ago
discuss
514.
▲
The Vocabulary Priming Confound in LLM Evaluation [pdf]
(github.com/Palmerschallon)
1 point
palmerschallon
4 months ago
discuss
515.
▲
Open source agents to evaluate, debug, and optimize your prompts
(github.com/comet-ml)
1 point
ChefboyOG
5 months ago
discuss
516.
▲
Simboba: Evals for your AI product in under 5 mins
(github.com/ntkris)
1 point
handfuloflight
5 months ago
discuss
517.
▲
Live-trade-bench: Live evaluation of trading agents
(github.com/ulab-uiuc)
1 point
simonpure
5 months ago
discuss
518.
▲
Show HN: Dokimos – LLM evaluation framework for Java
(github.com/dokimos-dev)
1 point
fkapsahili
5 months ago
discuss
519.
▲
Benchmark that evaluates LLMs using 759 NYT Connections puzzles
(github.com/lechmazur)
1 point
ShrugLife
6 months ago
discuss
520.
▲
Show HN: smallevals – Local LLM Evaluation Framework with Tiny 0.6B Models
(github.com/mburaksayici)
1 point
mburaksayici
6 months ago
discuss
521.
▲
Open source LLM prompt eval and optimization CLI
(github.com/davismartens)
1 point
davismartens
6 months ago
discuss
522.
▲
Show HN: StructEval - a structured output evaluation and comparison tool
(github.com/jhiker)
1 point
jwesleyharding
7 months ago
discuss
523.
▲
Rogue – The AI Agent Evaluator
(github.com/qualifire-dev)
1 point
maxloh
7 months ago
discuss
524.
▲
Show HN: Local RAG Eval Harness – reproducible benchmarksfor retrieval pipelines
1 point
myroslavmokhamm
8 months ago
discuss
525.
▲
TinyExpr: Parser, compiler, and evaluation engine for math expressions
(github.com/codeplea)
1 point
gregsadetsky
8 months ago
discuss
526.
▲
Benchmark code for evaluating different ASR packages and APIs
(github.com/huggingface)
1 point
pinter69
9 months ago
discuss
527.
▲
Show HN: PromptDev – Prompt eval and testing for AI agents across providers
(github.com/artefactop)
1 point
sabatesduran
9 months ago
discuss
528.
▲
numexpr: fast numerical array expression evaluator for Python
(github.com/pydata)
1 point
cl3misch
10 months ago
discuss
529.
▲
Quality and Safety Evaluations for AI Agents on Azure
(github.com/aymenfurter)
1 point
jacksensi
10 months ago
discuss
530.
▲
Show HN: Hypersigil – Prompt management UI – test, evaluate, deploy
(github.com/hypersigilhq)
1 point
piterrro
10 months ago
discuss
531.
▲
Safe-MCP: Security Analysis Framework for Evaluation of Model Context Protocol
(github.com/fkautz)
1 point
mooreds
10 months ago
discuss
532.
▲
RawBench: A minimal prompt evaluation framework
(github.com/0xsomesh)
1 point
handfuloflight
a year ago
discuss
533.
▲
Assayer: Python-RQ watchdog for ML model checkpoint monitoring and evaluation
(github.com/amoudgl)
1 point
amoudgl
a year ago
discuss
534.
▲
Show HN: Digit-Class Prime Product Framework (Prime Factorization Evals for LMs)
(github.com/arthurcolle)
1 point
arthurcolle
a year ago
discuss
535.
▲
E2E LLM evals, with less focus on metrics and more focus on binary assertions
(github.com/openchatai)
1 point
gharbat
a year ago
discuss
536.
▲
Ask HN: What RAG evaluations do you care about?
1 point
ArnavAgrawal03
a year ago
discuss
537.
▲
NoLiMa: Long-Context Evaluation Beyond Literal Matching
(github.com/adobe-research)
1 point
llm_nerd
a year ago
discuss
538.
▲
Evaluating and Training Multi-Modal Large Language Models for Action Recognition
(github.com/AdaptiveMotorControlLab)
1 point
moatmoat
a year ago
discuss
539.
▲
An Implementation of Eval() for Rust
(github.com/evcxr)
1 point
jcbhmr
a year ago
discuss
540.
▲
I built a Python pipeline to evaluate the Exosome Complex in AlphaFold &CombFold
(github.com/christopheragnus)
1 point
christopher8827
2 years ago
discuss
More