Search: github.com/eval | Heykuki News

Heykuki News

Top New Best Ask Show Jobs

Top New Best Ask Show Jobs

91.

Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps

106 points

2 years ago

92.

Show HN: Web-eval-agent – Let the coding agent debug itself (github.com/Operative-Sh)

84 points

a year ago

93.

Show HN: Ellipsis – Automatic pull request reviews (ellipsis.dev)

18 points

2 years ago

94.

Show HN: Honcho – Open-source memory infrastructure, powered by custom models (github.com/plastic-labs)

8 points

4 months ago

95.

Bad MCP design costs your agent 5x more tokens

6 points

19 hours ago

96.

Show HN: Agent Tinman – Autonomous failure discovery for LLM systems (github.com/oliveskin)

4 points

4 months ago

97.

Show HN: Open Operator Evals – real-world benchmarks for LLM web agents (github.com/nottelabs)

3 points

a year ago

98.

Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys) (news.ycombinator.com)

2 points

9 months ago

99.

Show HN: I made web agents reliable with smaller LLMs via natural language (github.com/nottelabs)

2 points

a year ago

100.

Deprecating A/B tests with offline policy evaluation

1 point

5 years ago

101.

Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs (github.com/darkrishabh)

79 points

a month ago

102.

Show HN: Continuous-eval – Granular evaluation of GenAI pipelines (github.com/relari-ai)

10 points

2 years ago

103.

Show HN: I designed a ChatGPT prompt evaluator to ruin your fun;) (github.com/alignedai)

8 points

3 years ago

104.

Show HN: Image Eval – An evaluation toolkit for image generation models (github.com/Storia-AI)

7 points

3 years ago

105.

Open RAG Eval (github.com/vectara)

6 points

a year ago

106.

In a sample of >1000 games, GPT-3.5-turbo-instruct plays chess with ~1800 elo (github.com/adamkarvonen)

4 points

3 years ago

107.

Show HN: Eval.js – a JavaScript interpreter written in JavaScript (github.com/marten-de-vries)

4 points

marten-de-vries

11 years ago

108.

Open Game Eval: an eval for agentic Lua game development in Roblox (github.com/Roblox)

3 points

6 months ago

109.

Show HN: TypeScript type-level math expression parser and evaluator (github.com/dqbd)

3 points

3 years ago

110.

GPT4 Learning from Reflection (github.com/GammaTauAI)

3 points

3 years ago

111.

Can LLMs accurately evaluate their own confidence? (github.com/anerli)

2 points

a year ago

112.

Show HN: CLI tool to analyze your Vector Embeddings! (github.com/dakshjain-1616)

2 points

4 months ago

113.

Show HN: OpenSciEval-AI Deriving Prime Theorem from Chaos (github.com/maris205)

2 points

6 months ago

114.

Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys) (github.com/marketplace)

2 points

10 months ago

115.

Keyboard Layout Evaluation (github.com/bclnr)

2 points

4 years ago

116.

Evaluation Code – GPT-5 on Multimodal Medical Reasoning (github.com/wangshansong1)

2 points

9 months ago

117.

Opensource operators evals (github.com/nottelabs)

2 points

a year ago

118.

Show HN: Python library to run a “function” over a set of data via ChatGPT (github.com/TylerGlaiel)

2 points

3 years ago

119.

Show HN: Spark-LLM-eval – Distributed LLM evaluation for Spark (github.com/bassrehab)

1 point

6 months ago

120.

LLM-eval-kit: Distributed LLM evaluation framework (v0.3.0) (github.com/benmeryem-tech)

1 point

a month ago