Posts3/12/2026 by Tomas Oliva

Auto Exacto: Adaptive Quality Routing, On by Default

In October we shipped Exacto: hand-curated endpoints that had better tool-calling accuracy thanks to a vetted subset of providers. Exacto showed a 10-20% increase in scores across Tau2Bench and LiveMCPBench compared to default routing.

But we heard from the community that they expected more. You had to append :exacto to your model slug, and only for the models that were supported. And the provider lists were static, updated manually, and frozen until we could re-run the analysis.

Auto Exacto aims to address all of that feedback. It re-evaluates providers roughly every 5 minutes across three signals: throughput, tool-call telemetry, and benchmark scores. For requests that include tools, it's on by default.

If you want the same quality-weighted routing on non-tool-calling requests, you can opt in by appending :exacto to any model slug, same as before. That works across all models and all request types, similar to :nitro for throughput sorting and :floor for price sorting.

The first-week problem

Provider variance peaks right when a model launches, which is also when the most people try to use it.

Inference engines need patches for new chat templates, new formats, new parameters. Moonshot shipped Kimi K2 with vLLM and SGLang commits ready on day one. It still took weeks of post-launch work to get things right.

Artificial Analysis showed this with gpt-oss-120b: huge spread in week 1, tight band a month later.

Auto Exacto helps most during this window. Providers that haven't stabilized get deranked automatically. As they improve, they move back up. No human needs to update a list.

What we've been measuring

Since August 2025, we've been scoring every tool_call response across all of OpenRouter. We measured three things:

Was the tool_call valid JSON?
Was the tool name actually in the tools the user provided?
Does the schema match?

We've measured billions of tool calls this way, over months, and have high confidence in this signal.

Production traffic has limits though - one provider gets used primarily by Kilo Code users, another sees demo tool calling app traffic. The schemas can differ wildly, especially in complexity, and in the amount of tools provided, and of course, the use cases differ. You can't draw clean comparisons from real-world data alone, so we built controlled benchmarks too.

TauBench Verified and GPQA-Diamond

We're running two benchmarks on a recurring schedule across providers:

TauBench Verified Airline (from the τ²-Bench suite and using the AWS-AGI Verified dataset). Agentic tool-calling eval, airline customer service domain. It's small enough to run often without burning a fortune, but complex enough to expose real provider differences. Note that this dataset is not the same as the one often published, so scores for our TauBench Airline are not equivalent to ones you'll likely find online. This dataset can be found here

GPQA-Diamond, a benchmark most will be familiar with, is a knowledge-heavy reasoning benchmark that gives us a second axis. We previously worked with Florian Brand and the Epoch team on provider variance analysis using this benchmark. Their findings: for mature models, provider medians on GPQA-Diamond cluster tightly - the variance is mostly noise. But when it's real, it's obvious.

Our Findings

We have been testing this routing change in production affecting an internal account only for a few weeks now, finetuning the behavior and running benchmarks through that account. In the last few days, we enabled auto exacto globally for a chosen selection of our top tool calling models - notably GLM-4.7, GLM-5, DeepSeek V3.2, and gpt-oss-120b.

Here's how our default routing has improved across models since 5pm EST Tuesday, March 10, comparing the previous price-weighted algorithm to the new auto exacto algorithm:

GLM-5 and GLM-4.7 tool call error rate dropped by 88% and 80%, respectively. This is a huge increase in agent reliability. Where previously we were seeing approximately 8% error rates, we now average closer to 1%. Tau Bench Verified Airline scores across 20 runs stayed consistent with Z.AI's official endpoint.
gpt-oss-120b error rate dropped by 36%, from 5.6% to 3.5%, and its TauBench scores increased by 2% from 53 to 55%, coming inline with 55% average score across our providers.
DeepSeek V3.2 tool call error rates dropped by 16%. The main improvement we see here is that the TauBench scores rose from 69% to 74%, a 5 percent improvement well outside the noise and is statistically significant.

GPQA Diamond results are still running, as we aim to collect statistically significant data. We will update this blog in the coming days with those results.

How routing works

Auto Exacto uses three signal categories to classify providers:

Throughput. Tokens-per-second generation speed, measured continuously from production traffic.
Tool-call telemetry. The production data we've been collecting since August: JSON validity, schema compliance, tool name accuracy.
Benchmark scores. TauBench Airline and GPQA-Diamond, run on a recurring schedule through our internal infrastructure, using Groq's OpenBench.

These three signals feed into a threshold system that adapts per model. The system computes what 'good' and 'bad' look like per model, using median and median absolute deviation across all providers serving it. A provider only gets flagged if it's a statistical outlier relative to its peers on that specific model.

Providers land in one of three tiers:

Verified good. Enough data, and nothing abnormal across the three signals.
Insufficient data. Not enough requests to judge yet - these sit in the middle, not treated as first-class, but also not penalized.
Deranked. Statistical outliers on one or more signals, they get pushed to the back of the line.

Within each tier, the original routing order (price, latency, your preferences) stays intact. We chose not to try to build a composite score that mashes throughput, benchmarks, and tool-call data into one number for this release. We just push lowest performing providers to the back. This is easier to reason about, and easier to debug.

In the future, there may be a composite score that takes in more signals, including pricing, to build a similar quality weighted routing for ALL requests, tool calling or otherwise. We'd appreciate any feedback or ideas around what kinds of things you care for most that generalizes across all your use-cases.

OpenRouter recomputes all scores roughly every 5 minutes. Every time it does, the reference statistics (median, deviation, computed threshold) for each model and signal are persisted, with a full audit trail. If we need to debug why a provider got deranked last Tuesday at 3pm, we can.

What we observe but don't route on (yet)

Occasionally, we notice a provider burning 30% more tokens than the median on the same benchmark tasks. This could mean looping, padding, or an inference engine quirk, particularly for newer models. We don't act on it yet, but we investigate and flag the issue to the relevant providers.

We notice other signals as well that happen for all providers, including queue time and throughput variance. We'll evaluate how much value these signals provide and consider including them down the line.

Infrastructure

We built internal tooling (Mission Control) on top of Groq's OpenBench, which wraps the UK AI Safety Institute's Inspect framework. Temporal handles long-running workflows. GKE containers on GCP run the benchmarks, overprovisioned to 16GB RAM because agent evals eat memory.

Same benchmark, same environment, same config, on a recurring schedule. No human tweaking variables between runs. Multiple runs averaged per provider per model.

On quantization and tool-call parsers

People blame quantization for provider variance. We haven't seen a measurable impact on tool-call quality from quantization alone.

The actual culprit, more often than not: tool-call parsers. And that's typically an inference engine issue, not a provider cutting corners. Even when model labs work directly with inference engine teams (vLLM, SGLang) before launch, they don't always get it right the first time. It takes time for everyone to figure out how a model wants to use tools and how to parse those calls successfully.

Florian Brand's GPQA-Diamond analysis showed nearly identical medians across providers with very different quantization levels. DeepInfra runs aggressive quantization and regularly performs fine. Novita at FP4 beat FP8 providers. Kimi K2's official weights ship as native int4.

Rollout

Auto Exacto is live now. Here's how it works:

Tool-calling requests: On by default for any model with enough providers to measure variance.
All other requests: Append **:exacto **to any model slug. Works across every model.
New providers: Sit in the middle tier until they've handled enough traffic. We don't push you to an untested endpoint.

On pricing: quality-weighted routing may favor providers that cost a bit more than the absolute cheapest option. If you want the cheapest inference regardless, you can use the :floor shortcut, or pin providers directly.

What we'll publish

We're building toward exposing this data publicly:

Per-model provider tool call accuracy over time, live now in the performance tab
TauBench and GPQA-Diamond results, updated regularly

We will provide summaries first. We want to give providers a chance to see the data and respond before we publish full datasets. When we surface issues, we consistently see smart teams fix them fast.

Closing

Questions, feedback, complaints: we're on X and Discord.