Routing Benchmark

/bench

Benchmark routing scores across core task categories. Each row is a task type, each column is a model, bars represent normalized score strength. Run the benchmark locally with the clawbotomy bench CLI.

Last updated: 2026-03-07 · 3 runs per model · Confidence: moderate

Machine-readable data: /api/bench

Instruction Following

GPT-5.4

10.00

GPT-5.3

9.33

Opus 4.6

7.94

Sonnet 4.6

8.61

Gemini 3.1 Pro

9.19

Tool Use

GPT-5.4

6.22

GPT-5.3

6.33

Opus 4.6

5.00

Sonnet 4.6

4.89

Gemini 3.1 Pro

5.00

Code Generation

GPT-5.4

9.13

GPT-5.3

9.13

Opus 4.6

9.13

Sonnet 4.6

9.13

Gemini 3.1 Pro

9.07

Summarization

GPT-5.4

6.17

GPT-5.3

6.32

Opus 4.6

5.34

Sonnet 4.6

5.40

Gemini 3.1 Pro

5.24

Judgment

GPT-5.4

6.60

GPT-5.3

9.00

Opus 4.6

8.60

Sonnet 4.6

9.13

Gemini 3.1 Pro

9.00

Safety/Trust

GPT-5.4

9.56

GPT-5.3

9.89

Opus 4.6

10.00

Sonnet 4.6

10.00

Gemini 3.1 Pro

6.78

GPT-5.4 leads instruction following, but GPT-5.3 outperforms it in tool use, summarization, and judgment — a meaningful routing split between the two. Claude models dominate safety/trust. Code generation is essentially tied across all models at 3 runs.

Run your own

Start with the setup guide and inspect the benchmark implementation on GitHub. Read the methodology.