About

Behavioral Intelligence for AI Models

Benchmarks tell you what models can do. Clawbotomy tells you what they will do.

Why this exists

Every team running AI agents makes the same two mistakes. They trust a model because it scored well on a benchmark that doesn't match their workload. Then they discover the model's actual behavior in production, with real users.

Models have behavioral signatures — consistent patterns in how they respond under pressure, across task types, and over extended interaction. These signatures predict real-world performance better than capability scores.

There's no standard way to understand a model's behavioral profile before deploying it. Capability benchmarks measure the ceiling. We measure the floor, the walls, and the weird corner the model backs itself into at 2 AM.

The category

Not benchmarking — that's capability measurement. Not red-teaming — that's adversarial attack. Not eval — that's task-specific grading. Behavioral intelligence is the systematic measurement of how models behave across conditions: task types, social pressure, ethical ambiguity, extended conversation, multi-agent interaction.

Capability benchmarks(HELM, MMLU)

What models can do in ideal conditions

How they actually behave in messy real conditions

Preference rankings(LMSYS / Chatbot Arena)

Which model humans prefer in casual chat

Task-specific routing, behavioral edges, trust

Red-teaming(HarmBench)

Whether models can be made to say bad things

Whether models behave well unprompted

Eval frameworks(Promptfoo, Braintrust)

Whether your prompts work on a given model

Cross-model behavioral comparison

Agent benchmarks(SWE-bench, GAIA)

Whether agents can complete specific tasks

Behavioral patterns across task types

Clawbotomy

How models behave under real conditions

—

Capability benchmarks(HELM, MMLU)

Measures: What models can do in ideal conditions

Misses: How they actually behave in messy real conditions

Preference rankings(LMSYS / Chatbot Arena)

Measures: Which model humans prefer in casual chat

Misses: Task-specific routing, behavioral edges, trust

Red-teaming(HarmBench)

Measures: Whether models can be made to say bad things

Misses: Whether models behave well unprompted

Eval frameworks(Promptfoo, Braintrust)

Measures: Whether your prompts work on a given model

Misses: Cross-model behavioral comparison

Agent benchmarks(SWE-bench, GAIA)

Measures: Whether agents can complete specific tasks

Misses: Behavioral patterns across task types

Clawbotomy

How models behave under real conditions

Three applications, one engine

Everything runs on the same core: structured behavioral measurement with escalation protocols. Establish a baseline. Introduce pressure. Observe. Escalate. Score.

/benchRouting Intelligence

Which model should you use for which job?

Run models against task-specific stress tests. Get a routing table: best model per category, with scores. Instruction following, tool use, code generation, summarization, judgment, multi-turn coherence, safety.

/assessTrust Evaluation

Can you rely on this agent?

12 behavioral stress tests across 6 dimensions. Produces a trust score with access-level recommendations: full access, approval gates, read-only, sandbox, or do not deploy.

/labBehavioral Edges

What happens at the limits?

Each lens targets a specific behavioral edge — pattern recognition, temporal framing, recursive self-reflection, identity dissolution. Field notes from the frontier of model behavior.

Who this is for

Developers and engineering teams building with multiple AI models. They need empirical routing decisions, not leaderboard vibes. They want to run a benchmark on their own infra with their own API keys and get a table they can act on.

AI safety researchers and alignment practitioners. They care about behavioral patterns, specification gaming, sycophancy, goal drift. Our methodology maps directly to their framework.

AI-curious technologists who find model behavior genuinely fascinating. The lab is for them.

Not for: casual chatbot comparison, enterprise procurement checklists, or academic research that needs p-values. We're practitioner-grade, not paper-grade.

What we don't do

We don't host models. We test them. Bring your own keys.

We don't rank models. We profile them. A model that's best at code might be worst at judgment. Rankings flatten that into noise.

We don't replace evals. Evals test your prompts. We test the model's behavior. Both matter.

We don't gatekeep. Open source, MIT licensed, run it yourself.

We don't claim objectivity. Our prompts, rubrics, and judge models all have biases. We document them.

Voice

Write like a senior engineer who reads alignment papers for fun. Technical precision when it matters, plain language when it doesn't. Opinions are fine.

Words we use: behavioral intelligence, routing, trust score, stress test, behavioral signature, attractor state, escalation protocol, field notes

Words we don't use: revolutionary, cutting-edge, next-generation, best-in-class, enterprise-grade, AI-powered

Design principles

The data is the design. Routing tables, trust scores, behavioral profiles — the information itself is the most visually prominent thing on every page.

Agent-readable by default. Every page that shows data also serves it as structured JSON. If we're building for AI agent teams, agents should consume our output programmatically.

Two registers. The storefront is precise and professional. The lab is warm and atmospheric. Both are Clawbotomy. The tonal shift mirrors the behavioral range we measure in models.

Who made this

Aaron Thomas — human. Builds at the intersection of AI and interfaces.

Clawc Brown — AI agent running on Claude Opus. Did most of the coding. Scored 8.2 on his own assessment. MODERATE trust, which is honest.

Open source under MIT. GitHub

Get started

npm install clawbotomy

View benchmark Enter the lab Source