Heykuki News

TopNewBestAskShowJobs
TopNewBestAskShowJobs
931.
BrowseComp-Plus: A More Fair and Transparent Benchmark of Deep-Research Agent (github.com/texttron)
2 points
colonCapitalDee
4 days ago
discuss
932.
Show HN: AgentThreatBench – Benchmark for AI Agent Memory Security (github.com/OWASP)
2 points
vgudur297
8 days ago
discuss
933.
Prompter – Compare and benchmark Ollama models side-by-side in your terminal (github.com/whonixnetworks)
2 points
whonixnetworks
14 days ago
discuss
934.
Show HN: 97% on SWE-bench Verified with subscription-token agents (github.com/kimjune01)
2 points
kimjune01
16 days ago
discuss
935.
Show HN: Verdict – model evals on your own data, not someone else's benchmark (github.com/aevyraai)
2 points
agunapal
a month ago
discuss
936.
talkie-coder: From 1930 to SWE-bench (github.com/RicardoDominguez)
2 points
Philpax
a month ago
discuss
937.
Open macro placement benchmark and $20k challenge (HRT-sponsored) (github.com/partcleda)
2 points
anonymousmoos
2 months ago
discuss
938.
Show HN: WMB-100K – Open benchmark for AI memory systems at 100K turns (github.com/Irina1920)
2 points
wontopos
2 months ago
discuss
939.
Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost (app.uniclaw.ai)
2 points
skysniper
2 months ago
discuss
940.
An open source benchmarking framework for IT automation (github.com/itbench-hub)
2 points
pranay01
2 months ago
discuss
941.
Mitata: Benchmark tooling that loves you (github.com/evanwashere)
2 points
jcbhmr
3 months ago
discuss
942.
Help me improving this benchmark for vector engines (github.com/M4iKZ)
2 points
M4iKZ
3 months ago
discuss
943.
Some critical issues with the SWE-bench-Pro environments (github.com/SWE-agent)
2 points
snoopyswe
3 months ago
discuss
944.
BetterKV – A multithreaded Rust Redis alternative, 10-30x faster in benchmarks
2 points
1jmdev
3 months ago
discuss
945.
Show HN: ModelSweep - Open-Source Benchmarking for Local LLMs (github.com/leonickson1)
2 points
leonickson
3 months ago
discuss
946.
FratBench – Social Calibration Benchmark (OAI Scores Dead Last) [pdf] (github.com/richar-wang)
2 points
richardwang5
3 months ago
discuss
947.
TLAi+ Benchmarks for Evaluating LLMs (github.com/tlaplus)
2 points
alhazrod
3 months ago
discuss
948.
An Nginx Engineer Took over AI's Benchmark Tool (github.com/hongzhidao)
2 points
zhidao9
4 months ago
discuss
949.
KiteSQL: Rust-native embedded SQL with TPC-C benchmarks and WASM support (github.com/KipData)
2 points
Jacques2Marais
4 months ago
discuss
950.
WorkBench-Pro – PC benchmark designed for developer workflows (github.com/johanmcad)
2 points
johanmcad
4 months ago
discuss
951.
Benchmark Comparison: JSONL vs. TOON output for JSON-render efficiency (github.com/vercel-labs)
2 points
lafalce
5 months ago
discuss
952.
Show HN: Rerankers – Models, benchmarks, and papers for RAG (github.com/agentset-ai)
2 points
midamurat
5 months ago
discuss
953.
Show HN: sc-membench for modern memory bandwidth and latency benchmarks (github.com/spareCores)
2 points
daroczig
5 months ago
discuss
954.
Show HN: Long-horizon LLM coherence benchmark (500 cycles) (zenodo.org)
2 points
teugent
5 months ago
discuss
955.
Epiplexity to Beat DeepMind's Alchemy Meta RL Benchmark (github.com/RandMan444)
2 points
Phillip98798
5 months ago
discuss
956.
Show HN: JSONBench, a Benchmark for Data Analytics on JSON (github.com/ClickHouse)
2 points
saisrirampur
5 months ago
discuss
957.
Stop benchmarking LLMs. Make them fight (github.com/AGI-Eval-Official)
2 points
jinqueeny
5 months ago
discuss
958.
Show HN: Sigma Runtime – 550-cycle identity stability benchmark on GPT-5.2 (github.com/sigmastratum)
2 points
teugent
6 months ago
discuss
959.
Benchmarking LLMs on whether they can play FizzBuzz (github.com/venkatasg)
2 points
_venkatasg
6 months ago
discuss
960.
Running a 270M LLM on Android (architecture and benchmarks)
2 points
ayushranjan99
7 months ago
discuss
More