close

DEV Community

# benchmark

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.

I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.

Comments
13 min read
We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

Comments
11 min read
Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy

Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy

Comments
8 min read
Open-Source A3M Router Tops RouterArena Benchmark

Open-Source A3M Router Tops RouterArena Benchmark

Comments
1 min read
How does an AI agent pick from 686 skills in a second?

How does an AI agent pick from 686 skills in a second?

Comments
7 min read
LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

Comments
5 min read
Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

Comments
3 min read
Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.

Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.

Comments 1
4 min read
We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

Comments
5 min read
10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.

10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.

Comments
4 min read
I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

Comments
3 min read
I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind

I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind

Image 1
Comments
5 min read
Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

Comments
2 min read
Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

Comments
10 min read
Why Most Browser AI Demos Fail on Real Hardware

Why Most Browser AI Demos Fail on Real Hardware

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.