Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Login
Top
New
Best
Ask
Show
Jobs
931.
▲
BrowseComp-Plus: A More Fair and Transparent Benchmark of Deep-Research Agent
(github.com/texttron)
2 points
colonCapitalDee
4 days ago
discuss
932.
▲
Show HN: AgentThreatBench – Benchmark for AI Agent Memory Security
(github.com/OWASP)
2 points
vgudur297
8 days ago
discuss
933.
▲
Prompter – Compare and benchmark Ollama models side-by-side in your terminal
(github.com/whonixnetworks)
2 points
whonixnetworks
14 days ago
discuss
934.
▲
Show HN: 97% on SWE-bench Verified with subscription-token agents
(github.com/kimjune01)
2 points
kimjune01
16 days ago
discuss
935.
▲
Show HN: Verdict – model evals on your own data, not someone else's benchmark
(github.com/aevyraai)
2 points
agunapal
a month ago
discuss
936.
▲
talkie-coder: From 1930 to SWE-bench
(github.com/RicardoDominguez)
2 points
Philpax
a month ago
discuss
937.
▲
Open macro placement benchmark and $20k challenge (HRT-sponsored)
(github.com/partcleda)
2 points
anonymousmoos
2 months ago
discuss
938.
▲
Show HN: WMB-100K – Open benchmark for AI memory systems at 100K turns
(github.com/Irina1920)
2 points
wontopos
2 months ago
discuss
939.
▲
Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost
(app.uniclaw.ai)
2 points
skysniper
2 months ago
discuss
940.
▲
An open source benchmarking framework for IT automation
(github.com/itbench-hub)
2 points
pranay01
2 months ago
discuss
941.
▲
Mitata: Benchmark tooling that loves you
(github.com/evanwashere)
2 points
jcbhmr
3 months ago
discuss
942.
▲
Help me improving this benchmark for vector engines
(github.com/M4iKZ)
2 points
M4iKZ
3 months ago
discuss
943.
▲
Some critical issues with the SWE-bench-Pro environments
(github.com/SWE-agent)
2 points
snoopyswe
3 months ago
discuss
944.
▲
BetterKV – A multithreaded Rust Redis alternative, 10-30x faster in benchmarks
2 points
1jmdev
3 months ago
discuss
945.
▲
Show HN: ModelSweep - Open-Source Benchmarking for Local LLMs
(github.com/leonickson1)
2 points
leonickson
3 months ago
discuss
946.
▲
FratBench – Social Calibration Benchmark (OAI Scores Dead Last) [pdf]
(github.com/richar-wang)
2 points
richardwang5
3 months ago
discuss
947.
▲
TLAi+ Benchmarks for Evaluating LLMs
(github.com/tlaplus)
2 points
alhazrod
3 months ago
discuss
948.
▲
An Nginx Engineer Took over AI's Benchmark Tool
(github.com/hongzhidao)
2 points
zhidao9
4 months ago
discuss
949.
▲
KiteSQL: Rust-native embedded SQL with TPC-C benchmarks and WASM support
(github.com/KipData)
2 points
Jacques2Marais
4 months ago
discuss
950.
▲
WorkBench-Pro – PC benchmark designed for developer workflows
(github.com/johanmcad)
2 points
johanmcad
4 months ago
discuss
951.
▲
Benchmark Comparison: JSONL vs. TOON output for JSON-render efficiency
(github.com/vercel-labs)
2 points
lafalce
5 months ago
discuss
952.
▲
Show HN: Rerankers – Models, benchmarks, and papers for RAG
(github.com/agentset-ai)
2 points
midamurat
5 months ago
discuss
953.
▲
Show HN: sc-membench for modern memory bandwidth and latency benchmarks
(github.com/spareCores)
2 points
daroczig
5 months ago
discuss
954.
▲
Show HN: Long-horizon LLM coherence benchmark (500 cycles)
(zenodo.org)
2 points
teugent
5 months ago
discuss
955.
▲
Epiplexity to Beat DeepMind's Alchemy Meta RL Benchmark
(github.com/RandMan444)
2 points
Phillip98798
5 months ago
discuss
956.
▲
Show HN: JSONBench, a Benchmark for Data Analytics on JSON
(github.com/ClickHouse)
2 points
saisrirampur
5 months ago
discuss
957.
▲
Stop benchmarking LLMs. Make them fight
(github.com/AGI-Eval-Official)
2 points
jinqueeny
5 months ago
discuss
958.
▲
Show HN: Sigma Runtime – 550-cycle identity stability benchmark on GPT-5.2
(github.com/sigmastratum)
2 points
teugent
6 months ago
discuss
959.
▲
Benchmarking LLMs on whether they can play FizzBuzz
(github.com/venkatasg)
2 points
_venkatasg
6 months ago
discuss
960.
▲
Running a 270M LLM on Android (architecture and benchmarks)
2 points
ayushranjan99
7 months ago
discuss
More