A drop-in gateway that routes every request to the cheapest Claude model that's provably good enough — informed by the whole conversation, never downgrading hard work — and shows you exactly what you saved, per prompt and per session, recalculated from the actual tokens each request consumed.
Sign up, connect your gateway, and see savings in your dashboard. After the trial you pay just 20% of the savings we deliver — no savings, no bill.
# no API key, no spend — see it route in two minutes $ pip install "modelpilot @ git+https://github.com/krethikram-sudo/modelpilot.git" $ modelpilot demo --offline 1 classification SWITCH → claude-haiku-4-5 conf 0.85 2 debugging stay → claude-opus-4-8 quality protected 3 extraction SWITCH → claude-haiku-4-5 conf 0.85 ... REALIZED savings: $0.07 POTENTIAL: $0.12 (48% of baseline)
Nothing about your traffic changes. ModelPilot scores every request and shows you, on the dashboard, exactly what autopilot would save — so you decide with evidence, at zero risk.
Confidence-gated routing with a built-in randomized holdout, so the savings are measured, not estimated — and an escalation valve whose costs count against us. Flip one flag when the dashboard convinces you.
The whole product is built to earn that switch: run in guidance for a day, open the dashboard, and you'll see the would-have-saved number, the side-by-side proof that the cheaper model's output holds up, and a one-flag path to capture it.
One base-URL change. Your app
behaves exactly as before; ModelPilot just watches and scores.
modelpilot gateway --mode guidance
It shows what autopilot would save (per chat and cumulative), and a side-by-side of your real prompts — recommended model vs your standard model — so you can see the output is just as good.
When the evidence is
convincing, flip to --mode autopilot. The randomized holdout
keeps proving quality parity while you bank the savings.
The thing naive routers get wrong: prompt caches are model-scoped, so switching models mid-conversation can forfeit your cache discount and cost more. ModelPilot prices the cache break before every switch and vetoes any move that doesn't actually pay.
| Hard problem | How ModelPilot handles it |
|---|---|
| Prompt caches are model-scoped — naive switching loses money | Cache-aware economics price the rewrite before any mid-conversation switch and veto switches that don't pay |
| Does the router itself add latency? | The decision is local heuristics: ~0.05 ms per request — negligible next to multi-second generation |
| Structured outputs / JSON schemas can break brittle parsers | Requests with a tool or output-schema contract are never downgraded below Sonnet, so the response shape your code depends on is preserved |
| "Summarize" can be trivial or expert-level | Calibrated on a golden dataset with a ≤2% false-downgrade target; when unsure, it doesn't touch your request |
| "why?" after a hard exchange looks trivial | Session-context routing: follow-ups inherit the conversation's difficulty; mechanical tasks over existing content stay cheap |
| Vendors grade their own homework | Randomized holdout + escalation costs deducted + side-by-side output
comparisons you can run yourself: modelpilot compare --judge |
A generic router leaves money on the table on your workload. ModelPilot learns four things from your own traffic — every change earned by evidence on your prompts, never a blunt global default:
Each category's confidence gate loosens where routing has proven safe at volume and tightens the instant anything escalates — live, no restart.
It samples your own prompts, runs the next-cheaper model against the
baseline, and lowers the floor only when the cheaper output is judged
non-inferior on your data. modelpilot learn-floors
Domain phrasing ("redline this clause", "triage this ticket") a generic
router dumps into a catch-all gets mapped to the right category, so it's
actually optimized. modelpilot learn-rules
Finds uncached repeated context and oversized prompts in your ledger and
tells you the dollars caching or trimming would save — it never rewrites
your prompts. modelpilot prompt-audit
Every downgrade is gated by non-inferiority on your own prompts and protected by the structured-output guard, so "cheaper" never means "riskier". The longer it runs, the more it can safely route down — savings compound with use.
Don't wait for the learning loop to warm up. Drop in a starter pack for your space and routing fits your workload from the first request — then your own traffic refines it:
MODELPILOT_POLICY=packs/doc-extraction.json modelpilot gateway --mode guidance
doc-extraction, support, coding —
route bulk extraction, classification, triage and simple codegen to the
cheapest model that holds up (where the savings are largest).
legal, healthcare — conservative by design: never
below Sonnet, with a built-in compliance profile (model floor + cautious
gate) for legal and clinical work.
Packs are a strong starting prior, not a black box: every guardrail
still applies, and learn-floors confirms or pulls back each
category on your own traffic over time.
A deployment profile makes routing obey your constraints, so cost optimization never fights compliance or your contract:
Allow- and block-lists keep routing inside the models you've cleared. A blocked cheap model falls back to the next approved tier — never silently somewhere you didn't sanction.
Set a min_model and nothing — not the heuristics, not the
learned floors — ever routes below it.
Drop in your committed-use / enterprise pricing and every number reflects your bill, not list price.
One dial — conservative, balanced, or aggressive — sets how confident the router must be before it switches.
It runs in your infrastructure with your own API key: keys and
prompt text stay on your side — the ledger stores token counts and routing
metadata, never prompt content. Telemetry is opt-in and aggregate-only
(modelpilot telemetry --preview shows the exact payload first —
never prompts, outputs, or keys).
AWS Bedrock has a cost router — Intelligent Prompt Routing. If you're on the first-party Anthropic API it isn't available to you at all, and ModelPilot is the only cost router for your traffic. If you are on Bedrock, the honest comparison:
| Bedrock IPR | ModelPilot | |
|---|---|---|
| Models it routes | Two older models per router (Claude 3.x era) | The current lineup — Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 |
| Quality evidence | Vendor prediction; no holdout, no output proof | RCT holdout + both outputs side by side on your own prompts |
| Whose interest | The vendor billing you also picks the model | Customer-side; our only job is your bill going down |
| Lock-in | AWS Bedrock only | Your infra, first-party key, portable |
Run the head-to-head on your own prompts:
modelpilot compare --bedrock-router <your-router-arn> --judge.
This is what converts guidance users to autopilot. Run your captured prompts through both arms — ModelPilot's routing vs everything on the biggest model — and get one page with costs and outputs side by side, plus independent non-inferiority verdicts per prompt:
$ modelpilot compare --from-captures --judge routed 14/20 prompts · saved 61% ($1.87) · non-inferior 100% report: compare_report.html
Savings claims you can audit: actual token counts at list prices,
the same prompts in both arms, and every output on the page for your own eyes —
not just a judge's verdict. It gets better as you go: modelpilot tune
learns from your traffic — loosening routing on categories that prove safe for
your workload and backing off any that don't — so savings climb the more you use it.
# 1. Install pip install "modelpilot @ git+https://github.com/krethikram-sudo/modelpilot.git" # 2. Prove it to yourself (no key, no spend) modelpilot demo --offline # 3. Guidance mode: set your API key, point your app at it # (your free 7-day trial starts now — no license needed) export ANTHROPIC_API_KEY=sk-ant-... modelpilot gateway --mode guidance --port 8400 # Python: anthropic.Anthropic(base_url="http://localhost:8400") # any SDK: ANTHROPIC_BASE_URL=http://localhost:8400 # 4. Watch the dashboard — it shows what autopilot would save + the proof open http://localhost:8400/modelpilot/dashboard # 5. Convinced? One flag (free during your trial). modelpilot gateway --mode autopilot --port 8400 # 6. After 7 days, keep it running with a license export MODELPILOT_LICENSE="<your license token>"
Your first 7 days are free — the trial starts on first run, no key needed,
full functionality; after that a license keeps it running. The offline
demo is always free. Your API keys pass through and are never stored;
the ledger holds token counts and routing metadata — no prompt text. Full details in the
README.