Parallel Works
Back to Blog
aifinopsllm-gatewaycost-governancehpcthought-leadership

Spending on AI Is Easy. Governing It Is Hard.

Matthew Shaxted

Matthew Shaxted

Founder & CEO

19 min read
Share:

Enterprises learned the hard way that buying GPUs is not the same as putting them to work. That lesson is now arriving one layer up, in tokens. The fix lives in the request path.

A budget that lasted four months

In late May, Fortune reported that Uber had spent its entire 2026 AI coding budget in four months. The company had rolled out Claude Code to its engineers and, to drive adoption, stood up an internal leaderboard that ranked teams by how much AI tooling they used. Adoption was not the problem. The problem surfaced when leadership tried to connect the spend to anything a customer would notice. Asked whether the rising AI bill was translating into shipped value, president and COO Andrew Macdonald put it plainly: "That link is not there yet."

This is not a story about one company overspending. A few days earlier, Fortune had also written about Microsoft's own reporting on how quickly agent token costs accumulate once a tool is in daily use, and how hard those costs are to forecast before the fact. The point is not that AI is too expensive to be worth it; for most of this work it is plainly cheaper than the alternative, and the unit economics keep improving. The point is narrower and more uncomfortable: organizations turned on AI tooling at the speed their engineers wanted, then discovered that consumption and value had quietly decoupled, and that nobody owned the number in between.

If that arc feels familiar, it should. It is the GPU story told again at a higher layer. A few years ago the lesson was that acquiring accelerators is a procurement exercise, while making them productive is a platform engineering problem, and that the gap between the two is where capital strands as idle hardware. Tokens invert the failure mode. The capital is not idle; it is being consumed enthusiastically. What gets stranded is the ability to say which spend produced value, for which team, against which budget, and to stop the spend that does not. Idle GPUs and runaway tokens are the same disease, an absence of governance between the purchase and the outcome, presenting with opposite symptoms.

Why tokens are harder to govern than GPUs

A GPU-hour is a forgiving unit. It is roughly constant, it is easy to meter, and an idle card announces itself in a utilization graph. A token is none of those things, and three properties of modern AI usage make token spend behave in ways that defeat the tools most organizations point at it.

Consumption scales non-linearly with adoption. The shift from chat assistants to agents changed the cost curve. An agent does not answer once; it reasons, calls a tool, reads the result, reasons again, and repeats, and every step re-sends the accumulated context. Gartner analysis this spring put agentic tasks at roughly five to thirty times the tokens of a comparable chatbot interaction, and a Microsoft Research study on agentic coding found that a single user request can fan out into ten or twenty model calls, with input tokens, not output tokens, driving most of the bill. Reasoning models add another multiplier, emitting long internal traces that the customer never sees but always pays for.

Falling unit prices do not rescue the budget. The per-token price of frontier models has dropped sharply, and it is tempting to assume the bill follows. It does not. When the cost per task falls and the number of tasks per user rises faster, total spend climbs even as the unit price collapses. Cheaper tokens, more of them, used more reflexively, is a recipe for a larger invoice, not a smaller one.

The spend is invisible to the tools meant to watch it. This is the part that catches finance teams. The established cost-management stack, Apptio and Cloudability, Kubecost, CloudHealth, was built to attribute cloud infrastructure: instances, clusters, storage. None of it sees a token. It cannot tell you that the marketing team's retrieval pipeline spent more last week than the entire research org, because the data never reaches it. The FinOps Foundation's 2026 State of FinOps work named AI the fastest-growing new category of spend its members are being asked to manage, and reported that a large majority of practitioners saw AI costs exceed the projections they started the year with. Gartner's survey of data, analytics, and AI leaders found that fewer than half had put any financial guardrail around AI at all.

Put those three together and the conclusion is structural, not a matter of better dashboards. You cannot forecast token spend precisely, because it depends on how agents behave on inputs you have not seen yet. You cannot reconcile it monthly, because by the time the invoice arrives the money is gone. And you cannot attribute it with cloud-cost tooling, because that tooling is blind to the unit. Token governance has to happen where the tokens are spent, at the moment they are spent. It has to live in the request path.

Why a gateway, and why it has to sit inline

If governance has to happen there, then something has to occupy the request path. That something is an AI gateway: a service that every model call flows through, presenting a single endpoint to callers and brokering the request out to whichever provider should serve it. The architectural point is not the convenience of one endpoint, though that matters. It is that a component sitting inline can do two things a component watching from the side cannot: meter every request completely rather than sampling logs after the fact, and refuse a request before it reaches a model rather than discovering the overage at month end.

This is the role ACTIVATE AI plays. It exposes one OpenAI-compatible endpoint that fronts every provider an organization uses, commercial frontier APIs and private models alike, and it makes that endpoint the single point at which identity, budget, and policy are resolved. Because it is OpenAI-shaped, adopting it is a base-URL change rather than a rewrite. Existing SDKs, agent frameworks, and editor integrations such as Cursor, Continue, LangChain, and Open WebUI keep working, pointed at the gateway instead of directly at a vendor.

Figure 1. Every call enters the gateway, where the key resolves to an identity and an allocation, the budget is checked before the request reaches any model, and the response is metered on the way back.

The lifecycle of a single call makes the design concrete. A caller sends an ordinary chat-completions request carrying a gateway API key. The gateway authenticates that key and resolves it to a specific user, team, project, and allocation, because in ACTIVATE a key is not a bearer token to a vendor; it is bound to an allocation, and spend against it is scoped before the request goes anywhere. The gateway checks the allocation's remaining budget. If the budget is exhausted, the call is rejected with a payment-required error and never reaches a model, which is the difference between a guardrail and a report. If there is budget, the gateway routes the request to the resolved provider, streams the response back, counts input and output tokens, prices them against the organization's rate card, writes a usage event, and decrements the allocation.

Figure 2. A key is created with an expiry and a budget allocation, so it is bound to a budget before it can issue a single token.

From the developer's side, none of that is visible. The integration looks like any other OpenAI client:

from openai import OpenAI
 
client = OpenAI(
    base_url="https://<your-org>.parallel.works/api/ai/v1",  # representative path
    api_key=PW_GATEWAY_KEY,                                   # a scoped, allocation-bound key
)
 
resp = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-5",   # provider/model namespacing through one endpoint
    messages=[{"role": "user", "content": "Summarize this incident report."}],
    stream=True,
)

The same call against a private model is a one-word change to the model field. And when an allocation is spent, the caller does not get a silent overage; it gets an explicit, machine-readable refusal:

HTTP/1.1 402 Payment Required
Content-Type: application/json
 
{
  "error": {
    "type": "budget_exceeded",
    "message": "Allocation 'genai-research-q2' has reached its token budget.",
    "team": "ml-platform",
    "project": "genai-research",
    "allocation": "genai-research-q2"
  }
}

The exact field set follows the API reference; the body above is representative, like the base URL in the earlier example.

That refusal is the whole argument for sitting inline. The overage a company would otherwise discover months later, after the budget is gone, becomes a 402 the first time a team crosses its ceiling, surfaced to the team that owns the budget while there is still one to defend.

Metering as a pipeline, not a monthly reconciliation

Putting a gateway in the path solves capture. Turning captured requests into governance that finance and engineering can both trust takes a little more structure. ACTIVATE treats every consumable action, not just a token call, as an event flowing through four stages: capture, normalize, attribute, and enforce.

At capture, every request that touches the gateway emits a usage event. That event is rich on purpose. It records identity, the user, team, project, allocation, and cost center the request belongs to; the model, named by provider and model with the deployment recorded, so that azure/gpt-4o and anthropic/claude-sonnet-4-5 are distinguishable line items; the token counts, input, output, and total; performance, including time to first token and total latency; the cost, computed at capture against the rate card in force; and the status, whether the call succeeded, errored, or was rejected for budget. Crucially, identity is stamped at capture, not reconstructed later by joining logs against a directory. You are never guessing after the fact whose spend this was.

At normalize, raw counts become billable units through a per-organization rate card that can vary by cluster, partition, model, and provider. The same mechanism that prices a token prices a core-hour, which is what lets a token and a GPU-hour appear on the same statement in comparable terms.

At attribute, the stamped identity rolls the event up the hierarchy, so a single request contributes to its user's total, its team's total, its project's total, and its allocation's total at once. Reporting is therefore a read, not a reconstruction.

At enforce, each request is gated at the gateway against the allocation's last settled budget state, while a reconciliation sweep aggregates accumulated spend across usage events on a configurable interval an administrator sets. The interval matters more than it first appears. Reconciling monthly bounds your worst-case overage to a month of unconstrained spend, which is precisely how a year's budget evaporates in four. Enforcing on a tight interval, a few minutes rather than a billing cycle, bounds the worst case to roughly one interval of consumption. The guarantee ACTIVATE offers is not that spend is predicted perfectly; it is that an exhausted allocation cannot keep spending for longer than that interval before the gateway stops honoring it.

One ledger for tokens, GPU-hours, and storage

Most of what is described above, an inline gateway with virtual keys, per-key budgets, and token accounting, is the established shape of the AI-gateway category, and several tools implement parts of it well. LiteLLM issues virtual keys and supports budgets at multiple time scales, for instance a daily and a monthly cap on the same key. Portkey lets you set a dollar budget that expires a key when reached, and a separate token ceiling. These are real capabilities, and for a team whose only concern is frontier-API token spend they may be enough.

The structural limit shows up the moment the question widens. A pure AI gateway meters tokens because tokens are all that flow through it. It has no concept of a GPU-hour, an HPC core-hour, or a gigabyte-month, because those resources never pass through its request path, so it can report a team's API spend but never set it beside what that team's training jobs cost on the cluster. The cloud FinOps tools have the mirror-image blind spot: they see the cluster and the storage and are blind to the tokens. Neither can produce a single bill, because neither owns both halves of the spend.

ACTIVATE can, for an unglamorous reason: it owns the compute layer. The same control plane that schedules a Slurm job and provisions a Kubernetes namespace also runs the AI gateway, so the same ledger that records a job's core-hours records a call's tokens. Tokens, GPU-hours, node-hours, and storage are all just metered resources entering the same four-stage pipeline, and they exit it as one chargeback view: this team, this project, this allocation, this much spent across AI and compute and storage together.

Figure 3. Tokens, compute, and storage enter the same metering pipeline and leave it as one bill, attributed to the same identities and capped by the same allocations.

That single ledger is also where an organization sets its own prices, not only where it reads the provider's. The rate card is the control point. Each model, public or private, can be priced at the provider's raw cost, at cost plus an internal overhead, or at a promotional rate for a strategic team, and every usage record reflects whatever rate was in force when the call happened. Because the card is versioned, a price change applies from a known date forward while historical records keep the rate they were captured at, so last quarter's chargeback does not move when this quarter's prices do. This is how a platform or shared-services group funds itself: it meters consumption across every provider, applies its own rate card on top, and bills internal customers a blended figure per team and project that they recognize, rather than handing them a stack of vendor invoices nobody can reconcile. The same control point lets a research organization recover overhead on a private cluster and a commercial API through one mechanism, in one currency, on one statement.

Figure 4. The AI Usage view rolls spend and tokens up by model, user, and allocation over a chosen window. Figures shown are from a demonstration organization.

The same divide shows up against every adjacent tool category, row by row:

CapabilityACTIVATEPure AI gateways (LiteLLM, Portkey, and similar)Cloud FinOps tools (Kubecost, Apptio, CloudHealth)CSP-native
Meters LLM tokens per requestYesYesNoPartial
Hard pre-request budget enforcementYesPartialNoNo
Meters GPU-hours, HPC core-hours, storageYesNoYes (cloud only)Partial
One bill across tokens and computeYesNoNoNo
One rate card and markup across tokens and computeYesNoNoNo
Multi-cloud and provider-agnosticYesYesPartialNo
Private, on-prem, and IL5 deploymentYesNoNoPartial

The "partial" entries are deliberate. Several gateways enforce budgets in the sense of expiring a key or returning a 429 once a limit is crossed, which is genuine and useful; the gap is that enforcement covers tokens only and lives outside any compute accounting. Several also expose custom per-token pricing, which helps with showback, but it prices tokens alone and cannot reach the GPU-hours and storage on the other half of the bill. The CSP-native column refers to the cloud providers' own metering, such as AWS Bedrock or Google Vertex, which stays bound to a single provider's models. The point of the table is not that the other tools are bad at what they do. It is that AI governance has been sold in slices, and an organization running both an HPC estate and a fleet of agents ends up reconciling three systems that each see a third of the picture.

The other fragmentation: capacity, not just cost

Cost is the fragmentation that finance feels. There is a second one that platform teams feel, and the gateway addresses it with the same machinery. Commercial APIs fragment an organization along financial lines, a separate bill, key, and audit trail per vendor. Private models, the ones an organization runs on GPUs it rented or bought, fragment along capacity lines instead. Two teams sharing a self-hosted endpoint contend for the same accelerators, and without a fair-share mechanism the loudest workload wins.

Because ACTIVATE provisions private inference into the same schedulers and namespaces it governs, the allocation that caps a team's frontier-API tokens also governs its share of a private deployment. An organization registers a private model served by vLLM, Ollama, TGI, or NVIDIA NIM, and it appears behind the same gateway endpoint as the commercial providers, subject to the same keys, budgets, and audit trail. The weights never leave the environment, which is the property that makes the pattern usable in regulated and classified settings. ACTIVATE provides the surrounding stack, current NVIDIA drivers, a CUDA toolkit, NCCL tuned to the interconnect, and container runtimes, as part of the base image. The model-specific toolchains, PyTorch, vLLM, and the serving frameworks, are left to the team to install on the shared filesystem, because production teams reasonably want to pin those themselves.

Figure 5. Registering a provider. Any OpenAI-compatible endpoint, Azure OpenAI, or a private model joins the same gateway behind one endpoint.

The serving foundation is open source. The retrieval-augmented vLLM stack that backs ACTIVATE's private deployments, vLLM for high-throughput GPU inference paired with a vector store and an OpenAI-compatible front end, is published as a workflow you can read and run yourself. You are not asked to take the gateway's behavior on faith; the inference layer underneath it is inspectable. Operating it day to day stays in the same vocabulary as the rest of the platform:

$ pw ai providers list          # registered providers, commercial and private
$ pw ai models list             # models available to you, by provider
$ pw ai chats                   # your governed conversations from the shell
Figure 6. The same governed surface from the shell. The CLI lists every model available to the caller, commercial and private, by provider.

The CLI, the OpenAI-compatible API, and the Go, Python, and TypeScript SDKs are three views of the same governed surface, so automation, agent frameworks, and interactive use all route through the same enforcement and land in the same ledger.

Why this gets more important, not less

The direction of travel makes a governance layer in the request path harder to treat as optional. Gartner expects worldwide AI spending to grow on the order of forty-seven percent in 2026, with generative AI a leading contributor, and the center of gravity is shifting from training to inference as organizations move from experiments to production. The same analysts project that a large share of enterprise applications will embed agents by the end of the year, up from a small fraction the year before. Agents are exactly the workloads whose token consumption is hardest to predict and easiest to let run, which means the gap between adoption and governance that caught Uber will be a more common place to fall, not a rarer one.

For organizations that already run serious compute, national labs, federal programs, research enterprises, the case is sharper still, because AI did not replace their HPC; it sat on top of it. ACTIVATE runs across AWS, Azure, Google Cloud, and Oracle Cloud, on-premises, and in air-gapped environments, and the GovCloud boundary operates under an HPCMP authorization at IL5, FedRAMP High aligned, with private and sovereign deployment supported. It runs in production across federal, defense, and research organizations, governing Slurm jobs and LLM calls through one control plane. For those teams, a token gateway that cannot also see the cluster would just be a fourth console to reconcile.

The lesson of the GPU era was that capacity without an operating model strands capital. The lesson arriving now is the same one at a finer grain: access to every model without governance over every token strands it faster, because tokens are spent in milliseconds and agents spend them in loops. The fix is not to spend less on AI. It is to put the spend in the request path, bind it to an identity and an allocation, meter it beside the compute it runs on, and refuse it when the budget is gone, so that the question Uber's COO could not answer, what did we get for it, has an owner and a number long before the budget runs out.


Matthew Shaxted is the CEO of Parallel Works, which he co-founded in 2015 as a spin-out of Argonne National Laboratory. Parallel Works builds ACTIVATE, a control plane that governs HPC and AI workloads across clouds and on-premises infrastructure. The retrieval-augmented vLLM stack referenced here is open source; the ACTIVATE AI gateway is documented at parallelworks.com/activate/ai. He can be reached at shaxted@parallelworks.com.

About the author

Matthew Shaxted

Matthew Shaxted

Founder & CEO