Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Login
Top
New
Best
Ask
Show
Jobs
91.
▲
Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps
106 points
antonap
2 years ago
15 comments
92.
▲
Show HN: Web-eval-agent – Let the coding agent debug itself
(github.com/Operative-Sh)
84 points
neversettles
a year ago
12 comments
93.
▲
Show HN: Ellipsis – Automatic pull request reviews
(ellipsis.dev)
18 points
hunterbrooks
2 years ago
11 comments
94.
▲
Show HN: Honcho – Open-source memory infrastructure, powered by custom models
(github.com/plastic-labs)
8 points
vvoruganti
4 months ago
discuss
95.
▲
Bad MCP design costs your agent 5x more tokens
6 points
JohnnyZhang483
19 hours ago
discuss
96.
▲
Show HN: Agent Tinman – Autonomous failure discovery for LLM systems
(github.com/oliveskin)
4 points
oliveskin
4 months ago
discuss
97.
▲
Show HN: Open Operator Evals – real-world benchmarks for LLM web agents
(github.com/nottelabs)
3 points
monoid73
a year ago
1 comment
98.
▲
Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys)
(news.ycombinator.com)
2 points
geminimir
9 months ago
discuss
99.
▲
Show HN: I made web agents reliable with smaller LLMs via natural language
(github.com/nottelabs)
2 points
giordanol
a year ago
discuss
100.
▲
Deprecating A/B tests with offline policy evaluation
1 point
econti
5 years ago
discuss
101.
▲
Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs
(github.com/darkrishabh)
79 points
darkrishabh
a month ago
37 comments
102.
▲
Show HN: Continuous-eval – Granular evaluation of GenAI pipelines
(github.com/relari-ai)
10 points
antonap
2 years ago
2 comments
103.
▲
Show HN: I designed a ChatGPT prompt evaluator to ruin your fun;)
(github.com/alignedai)
8 points
buildaligned
3 years ago
1 comment
104.
▲
Show HN: Image Eval – An evaluation toolkit for image generation models
(github.com/Storia-AI)
7 points
nutellalover
3 years ago
discuss
105.
▲
Open RAG Eval
(github.com/vectara)
6 points
TastyLamps
a year ago
1 comment
106.
▲
In a sample of >1000 games, GPT-3.5-turbo-instruct plays chess with ~1800 elo
(github.com/adamkarvonen)
4 points
sebzim4500
3 years ago
4 comments
107.
▲
Show HN: Eval.js – a JavaScript interpreter written in JavaScript
(github.com/marten-de-vries)
4 points
marten-de-vries
11 years ago
1 comment
108.
▲
Open Game Eval: an eval for agentic Lua game development in Roblox
(github.com/Roblox)
3 points
kartayyar
6 months ago
discuss
109.
▲
Show HN: TypeScript type-level math expression parser and evaluator
(github.com/dqbd)
3 points
dqbd
3 years ago
discuss
110.
▲
GPT4 Learning from Reflection
(github.com/GammaTauAI)
3 points
agomez314
3 years ago
discuss
111.
▲
Can LLMs accurately evaluate their own confidence?
(github.com/anerli)
2 points
anerli
a year ago
2 comments
112.
▲
Show HN: CLI tool to analyze your Vector Embeddings!
(github.com/dakshjain-1616)
2 points
gauravvij137
4 months ago
1 comment
113.
▲
Show HN: OpenSciEval-AI Deriving Prime Theorem from Chaos
(github.com/maris205)
2 points
mairswang
6 months ago
1 comment
114.
▲
Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys)
(github.com/marketplace)
2 points
geminimir
10 months ago
1 comment
115.
▲
Keyboard Layout Evaluation
(github.com/bclnr)
2 points
Egoist
4 years ago
1 comment
116.
▲
Evaluation Code – GPT-5 on Multimodal Medical Reasoning
(github.com/wangshansong1)
2 points
Topfi
9 months ago
discuss
117.
▲
Opensource operators evals
(github.com/nottelabs)
2 points
kernelito
a year ago
discuss
118.
▲
Show HN: Python library to run a “function” over a set of data via ChatGPT
(github.com/TylerGlaiel)
2 points
TylerGlaiel
3 years ago
discuss
119.
▲
Show HN: Spark-LLM-eval – Distributed LLM evaluation for Spark
(github.com/bassrehab)
1 point
subhadipmitra
6 months ago
1 comment
120.
▲
LLM-eval-kit: Distributed LLM evaluation framework (v0.3.0)
(github.com/benmeryem-tech)
1 point
benmeryem_ai
a month ago
discuss
More