The first question every eng and finance leader asks. Here's the straight answer: allocating what you've already spent is exact; forecasting future work is a measured range — and we backtest it on your own data instead of quoting a number.
Conflating them is how vendors lose credibility. We keep them separate:
Token counts come straight from your provider's admin/usage API, and we cost them with a cache-aware model — cache reads bill ~0.1× and writes ~1.25× of base input, the thing naive trackers get wrong by 5–10× on agentic workloads. So the dollars are computed, not estimated. The only question on the backward-looking side is coverage — what fraction of spend we could tie to a real ticket — and we report that number plainly, including what we couldn't attribute (reconciled to your invoice, never dropped).
Per-item AI cost is heavy-tailed — the same "bugfix" might be one cheap call or a 40-turn agentic loop. So we never hand you a single number for a single ticket; we give a p10–p90 band. But here's the key: the aggregate is far more predictable than any one item. Across a quarter of work, the per-item over- and under-shoots partially cancel (that's why our roadmap total is variance-pooled, not a naive sum of worst cases). Realistically:
This is the part that matters. Outlay backtests its own forecast on your closed tickets (leave-one-out cross-validation) and reports the median error before you trust it. If the forecast isn't good enough on your data, you'll see that in week one — we won't hide it. Most tools quote a savings or accuracy figure from someone else's deployment; we show you the number from yours.
To budget against a roadmap, you need to cost work that has no history of its own. The estimate is only as good as what you feed it, so Outlay reads three things:
From those, each planned item is classified, sized, and placed within your team's own historical cost range for that work type — carrying a confidence that rises with the input: high with story points + a fitted size model; medium when sized from requirements and design docs; low / declined on a bare title or a work type with no history. It even tells you what to add to tighten it. So a sprint or epic becomes a compute budget with an honest confidence interval — and the more scope you give it, the tighter that interval gets.
More accurate: more history per work type · stable workflows and model mix · estimating at the quarter/team level rather than per ticket · good attribution coverage · story points on your backlog (size conditioning measurably tightens the estimate).
Less accurate — and the biggest one is change: AI usage per developer has grown roughly 18× in well under a year, and a forecast trained on history widens when the regime shifts — a new agent adopted, autonomy turned up, a vendor price change. We're upfront that a fast-moving team's forecast carries a wider band, and the backtest will show it.
We won't promise a crystal ball. What we promise is the first real forward visibility you have — a quarter budget with a measured error bar, and a flag on the epic about to blow its estimate weeks before the invoice. Even a ±20% forecast that warns you early is infinitely better than the blank page you have today — and unlike everyone else, we tell you exactly how good it is on your numbers.
In a two-week pilot we backtest the forecast on your real history and show you the error before you trust a dollar of it.