Search: github.com/eval | Heykuki News

Heykuki News

Top New Best Ask Show Jobs

Top New Best Ask Show Jobs

391.

Show HN: We Evaluates Medical Research Agent Skills (github.com/aipoch)

2 points

2 months ago

392.

Tax Logic Evaluation with Prolog (github.com/mthom)

2 points

2 months ago

393.

Show HN: Aludel – LLM eval workbench for Phoenix apps (github.com/ccarvalho-eng)

2 points

2 months ago

394.

Show HN: A tool to create and evaluate document processing pipelines for RAG (ragbandit.com)

2 points

2 months ago

395.

I built a local-only eval runner for AI agents (quickbench) (github.com/iamGodofall)

2 points

3 months ago

396.

LLM evals test outputs. Rarely whether the model understood first (github.com/NoxionAI)

2 points

3 months ago

397.

Dynamic E2E Agentic Simulation and Evaluation with Cypress (github.com/gojiplus)

2 points

3 months ago

398.

TLAi+ Benchmarks for Evaluating LLMs (github.com/tlaplus)

2 points

3 months ago

399.

Edge – Generate structured evaluation criteria for any domain using a local LLM (github.com/EviAmarates)

2 points

3 months ago

400.

Engine-Bench: Evaluating Coding Agents on Writing Game Engine Code (github.com/JoshuaPurtell)

2 points

4 months ago

401.

Show HN: Simboba – Evals in under 5 mins (github.com/ntkris)

2 points

5 months ago

402.

Show HN: Dokimos – LLM Evaluation Framework for Java (github.com/dokimos-dev)

2 points

5 months ago

403.

Chess LLM Benchmark: Evaluating LLMs' ability to play chess (github.com/lightnesscaster)

2 points

6 months ago

404.

Show HN: AI PM Evaluation Framework (Open Source) (aipmframework.com)

2 points

7 months ago

405.

Codegen Scorer – evaluate the quality of code generated by LLMs (github.com/angular)

2 points

9 months ago

406.

Physical_Atari: Platform for evaluating RL algorithms on a physical Atari (github.com/Keen-Technologies)

2 points

9 months ago

407.

OpenBench: Provider-agnostic, open-source evaluation infrastructure for LLMs (github.com/groq)

2 points

10 months ago

408.

Show HN: KARMA – An evaluation framework for Medical AI systems (karma.eka.care)

2 points

10 months ago

409.

LLM Speedrunner: Eval for frontier models to reproduce scientific findings (github.com/facebookresearch)

2 points

a year ago

410.

MAIR: A Benchmark for Evaluating Instructed Retrieval (github.com/sunnweiwei)

2 points

a year ago

411.

Doyensec – Security Policy Evaluation Framework (github.com/gravitational)

2 points

a year ago

412.

Evaluate Any Model from the HuggingFace Hub on the ImageNet on Free Colab GPUs (github.com/SauravMaheshkar)

2 points

sauravmaheshkar

a year ago

413.

Lambda calculus - compiler, type inference, and evaluator in less than 100 LOC (gist.github.com)

2 points

a year ago

414.

Show HN: I built an open-source benchmark that evaluates LLMs through gameplay (llmshowdown.io)

2 points

a year ago

415.

Show HN: GenderBench – Evaluation suite for gender biases in LLMs (genderbench.readthedocs.io)

2 points

a year ago

416.

SIMD library for evaluating elementary functions, vectorized libm and DFT (github.com/shibatch)

2 points

2 years ago

417.

Show HN: Mandoline – Custom LLM Evaluations for Real-World Use Cases (mandoline.ai)

2 points

2 years ago

418.

UpTrain is an open-source unified platform to evaluate and improve Gen AI apps (github.com/uptrain-ai)

2 points

2 years ago

419.

Optimal Evaluation in 1 Minute (or 10 Minutes) (or 10 Years) (gist.github.com)

2 points

2 years ago

420.

Evaluating LLMs locally, on a laptop, with Llama 3 and Ollama (github.com/rasbt)

2 points

2 years ago