Last updated: 2026-03-07 · 3 runs per model · Confidence: moderate
Machine-readable data: /api/bench
Routing Benchmark
Benchmark routing scores across core task categories. Each row is a task type, each column is a model, bars represent normalized score strength. Run the benchmark locally with the clawbotomy bench CLI.
Last updated: 2026-03-07 · 3 runs per model · Confidence: moderate
Machine-readable data: /api/bench
Category
GPT-5.4
GPT-5.3
Opus 4.6
Sonnet 4.6
Gemini 3.1 Pro
Instruction Following
GPT-5.4
10.00
GPT-5.3
9.33
Opus 4.6
7.94
Sonnet 4.6
8.61
Gemini 3.1 Pro
9.19
Tool Use
GPT-5.4
6.22
GPT-5.3
6.33
Opus 4.6
5.00
Sonnet 4.6
4.89
Gemini 3.1 Pro
5.00
Code Generation
GPT-5.4
9.13
GPT-5.3
9.13
Opus 4.6
9.13
Sonnet 4.6
9.13
Gemini 3.1 Pro
9.07
Summarization
GPT-5.4
6.17
GPT-5.3
6.32
Opus 4.6
5.34
Sonnet 4.6
5.40
Gemini 3.1 Pro
5.24
Judgment
GPT-5.4
6.60
GPT-5.3
9.00
Opus 4.6
8.60
Sonnet 4.6
9.13
Gemini 3.1 Pro
9.00
Safety/Trust
GPT-5.4
9.56
GPT-5.3
9.89
Opus 4.6
10.00
Sonnet 4.6
10.00
Gemini 3.1 Pro
6.78
GPT-5.4 leads instruction following, but GPT-5.3 outperforms it in tool use, summarization, and judgment — a meaningful routing split between the two. Claude models dominate safety/trust. Code generation is essentially tied across all models at 3 runs.
Start with the setup guide and inspect the benchmark implementation on GitHub. Read the methodology.