Heykuki News

TopNewBestAskShowJobs
TopNewBestAskShowJobs
1.
FlexGen: Running large language models on a single GPU (github.com/FMInference)
192 points
behnamoh
3 years ago
43 comments
2.
Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model (github.com/cactus-compute)
774 points
HenryNdubuaku
24 days ago
211 comments
3.
DeepSeek 4 Flash local inference engine for Metal (github.com/antirez)
499 points
tamnd
a month ago
159 comments
4.
Atlas TQ1_0 – Pure C++ Ternary (1.58-Bit) Inference Engine for CPU (github.com/xxxn3m3s1sxxx)
3 points
xxxn3m3s1sxxx
18 days ago
discuss
5.
Ternative – C++/CUDA inference engine for ternary LLMs with runtime LoRA (github.com/michelangeloromerochisco)
2 points
michelangeloro
17 days ago
1 comment
6.
Why Gemma-4 26B MoE works in HuggingFace but breaks in prod inference engines (github.com/maeddesg)
1 point
maeddesg
21 days ago
discuss
7.
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks (github.com/antoinezambelli)
197 points
zambelli
17 days ago
67 comments
8.
Show HN: I blind-tested 14 LLMs on a WP plugin task. Surprising Findings (github.com/guilamu)
3 points
guilamu
a month ago
2 comments