Heykuki News

TopNewBestAskShowJobs
TopNewBestAskShowJobs
511.
Show HN: Orangensaft – A mini Python-like language with LLM eval in lang runtime (github.com/jargnar)
1 point
jargnar
4 months ago
discuss
512.
Show HN: Praetorian Guard – Free AI tool to self-evaluate your CV (educational) (github.com/simonesan-afk)
1 point
saimonsan
4 months ago
discuss
513.
MiRAGE: Open-source framework for multimodal RAG evaluation
1 point
mmhetric
4 months ago
discuss
514.
The Vocabulary Priming Confound in LLM Evaluation [pdf] (github.com/Palmerschallon)
1 point
palmerschallon
4 months ago
discuss
515.
Open source agents to evaluate, debug, and optimize your prompts (github.com/comet-ml)
1 point
ChefboyOG
5 months ago
discuss
516.
Simboba: Evals for your AI product in under 5 mins (github.com/ntkris)
1 point
handfuloflight
5 months ago
discuss
517.
Live-trade-bench: Live evaluation of trading agents (github.com/ulab-uiuc)
1 point
simonpure
5 months ago
discuss
518.
Show HN: Dokimos – LLM evaluation framework for Java (github.com/dokimos-dev)
1 point
fkapsahili
5 months ago
discuss
519.
Benchmark that evaluates LLMs using 759 NYT Connections puzzles (github.com/lechmazur)
1 point
ShrugLife
6 months ago
discuss
520.
Show HN: smallevals – Local LLM Evaluation Framework with Tiny 0.6B Models (github.com/mburaksayici)
1 point
mburaksayici
6 months ago
discuss
521.
Open source LLM prompt eval and optimization CLI (github.com/davismartens)
1 point
davismartens
6 months ago
discuss
522.
Show HN: StructEval - a structured output evaluation and comparison tool (github.com/jhiker)
1 point
jwesleyharding
7 months ago
discuss
523.
Rogue – The AI Agent Evaluator (github.com/qualifire-dev)
1 point
maxloh
7 months ago
discuss
524.
Show HN: Local RAG Eval Harness – reproducible benchmarksfor retrieval pipelines
1 point
myroslavmokhamm
8 months ago
discuss
525.
TinyExpr: Parser, compiler, and evaluation engine for math expressions (github.com/codeplea)
1 point
gregsadetsky
8 months ago
discuss
526.
Benchmark code for evaluating different ASR packages and APIs (github.com/huggingface)
1 point
pinter69
9 months ago
discuss
527.
Show HN: PromptDev – Prompt eval and testing for AI agents across providers (github.com/artefactop)
1 point
sabatesduran
9 months ago
discuss
528.
numexpr: fast numerical array expression evaluator for Python (github.com/pydata)
1 point
cl3misch
10 months ago
discuss
529.
Quality and Safety Evaluations for AI Agents on Azure (github.com/aymenfurter)
1 point
jacksensi
10 months ago
discuss
530.
Show HN: Hypersigil – Prompt management UI – test, evaluate, deploy (github.com/hypersigilhq)
1 point
piterrro
10 months ago
discuss
531.
Safe-MCP: Security Analysis Framework for Evaluation of Model Context Protocol (github.com/fkautz)
1 point
mooreds
10 months ago
discuss
532.
RawBench: A minimal prompt evaluation framework (github.com/0xsomesh)
1 point
handfuloflight
a year ago
discuss
533.
Assayer: Python-RQ watchdog for ML model checkpoint monitoring and evaluation (github.com/amoudgl)
1 point
amoudgl
a year ago
discuss
534.
Show HN: Digit-Class Prime Product Framework (Prime Factorization Evals for LMs) (github.com/arthurcolle)
1 point
arthurcolle
a year ago
discuss
535.
E2E LLM evals, with less focus on metrics and more focus on binary assertions (github.com/openchatai)
1 point
gharbat
a year ago
discuss
536.
Ask HN: What RAG evaluations do you care about?
1 point
ArnavAgrawal03
a year ago
discuss
537.
NoLiMa: Long-Context Evaluation Beyond Literal Matching (github.com/adobe-research)
1 point
llm_nerd
a year ago
discuss
538.
Evaluating and Training Multi-Modal Large Language Models for Action Recognition (github.com/AdaptiveMotorControlLab)
1 point
moatmoat
a year ago
discuss
539.
An Implementation of Eval() for Rust (github.com/evcxr)
1 point
jcbhmr
a year ago
discuss
540.
I built a Python pipeline to evaluate the Exosome Complex in AlphaFold &CombFold (github.com/christopheragnus)
1 point
christopher8827
2 years ago
discuss
More