Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Login
Top
New
Best
Ask
Show
Jobs
1.
▲
FlexGen: Running large language models on a single GPU
(github.com/FMInference)
192 points
behnamoh
3 years ago
43 comments
2.
▲
Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
(github.com/cactus-compute)
774 points
HenryNdubuaku
24 days ago
211 comments
3.
▲
DeepSeek 4 Flash local inference engine for Metal
(github.com/antirez)
499 points
tamnd
a month ago
159 comments
4.
▲
Atlas TQ1_0 – Pure C++ Ternary (1.58-Bit) Inference Engine for CPU
(github.com/xxxn3m3s1sxxx)
3 points
xxxn3m3s1sxxx
18 days ago
discuss
5.
▲
Ternative – C++/CUDA inference engine for ternary LLMs with runtime LoRA
(github.com/michelangeloromerochisco)
2 points
michelangeloro
17 days ago
1 comment
6.
▲
Why Gemma-4 26B MoE works in HuggingFace but breaks in prod inference engines
(github.com/maeddesg)
1 point
maeddesg
21 days ago
discuss
7.
▲
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
(github.com/antoinezambelli)
197 points
zambelli
17 days ago
67 comments
8.
▲
Show HN: I blind-tested 14 LLMs on a WP plugin task. Surprising Findings
(github.com/guilamu)
3 points
guilamu
a month ago
2 comments