Methodology

How your cost estimate is calculated

The Goldfinch Agent Cost Grader produces pre-deployment cost scenarios based on published provider pricing, agent-type benchmarks, and your architecture choices. This page explains the model in full.

This grader produces estimates, not actuals. Estimates are based on provider pricing as of April 2026 and behavioral benchmarks derived from documented agent patterns. Your actual costs will differ based on runtime behavior. The purpose is to inform budget planning before spend begins — not to predict a billing invoice to the dollar.

1. What your inputs represent

The grader collects five categories of input. Each maps directly to a cost variable in the model.

InputWhat it controls in the model
Agent type(s)Selects the token benchmark profile (input and output token ranges per task, conservative to upside). A support agent has a different profile than a code agent.
Tasks per monthScales the per-task cost to a monthly total and annual projection. Ranges are used (not point estimates) to reflect the variability of agent workloads.
Model / providerSets the per-token pricing from published provider documentation. The model tier is the single largest driver of absolute cost — a 10× difference between tiers is common.
Caching policyApplies a reduction multiplier to input token costs. Partial caching reduces average input billing by ~18%. Aggressive caching by ~45%. Source: Anthropic and OpenAI pricing documentation, April 2026.
Tool calls and retrieval rateAdds an overhead multiplier. Tool calls generate additional input tokens (function definitions, results) and output tokens (call payloads). Overhead ranges from +5% (rarely) to +35% (almost always).
Human-in-the-loop rateAdds a cost premium representing review overhead — latency buffers, retry patterns on rejected completions, and additional passes. Ranges from +3% (low HITL) to +15% (high HITL).
Retry configurationSets the Agentic Resource Exhaustion risk factor and the ARE variance buffer applied to your budget envelope. No retry controls is the highest-risk configuration.

2. Pricing sources

All per-token pricing is sourced directly from provider pricing pages. We do not use aggregator sites, community estimates, or third-party databases. Prices are updated manually when providers publish changes.

Sources — April 2026

OpenAI pricing page — input/output rates, cached input rates, Batch API discount
Anthropic pricing page — model tier rates, prompt caching write and read multipliers
Google Gemini API pricing — input/output rates, context caching storage, thinking token output pricing
Mistral and Cohere pricing pages — where models are included in the grader

Reasoning models carry a 2× output token multiplier, consistent with provider documentation describing how thinking tokens are billed as output tokens.

Batch API pricing is available at approximately 50% of synchronous pricing for OpenAI and Anthropic — the grader surfaces this as a Shift-Left Costing lever where applicable.

3. Agent type benchmarks

Each agent type carries a token benchmark profile — a range of input and output token counts per task, from conservative to upside.

Agent typeProfile rationale
Support agentCustomer-facing, moderate context window, structured outputs. Relatively predictable per-task token volume.
Document agentHigh input token volume — full document ingestion per task. Output is typically structured and constrained.
Code agentHigh output token volume — code generation produces longer completions. Input includes context (files, instructions, history).
Ops agentMulti-step workflows with tool calls. Moderate per-step token count but higher task counts and retry exposure.
Research agentHighest token volume — iterative retrieval, large context windows, extensive output synthesis.

When multiple agent types are selected, the model blends the benchmark profiles proportionally. The blended benchmark is shown in the grader before you confirm your estimate.


4. The three scenario bands

The grader always produces three cost scenarios. A single-point estimate is not produced — it would misrepresent the uncertainty inherent in pre-deployment agent economics.

Rainy Day
Conservative estimate
Higher token counts, lower cache hit rates, elevated tool call overhead. Plans for behavior at the high end of your input range.
Baseline
Most likely
Mid-range token volumes, expected cache performance, average tool call rates. The planning anchor for budget conversations.
Blue Sky
Optimized estimate
Lower token counts, strong cache performance, minimal overhead. Represents the outcome of well-executed policy choices.

The budget envelope is the Baseline monthly cost plus an ARE variance buffer — a percentage added to account for cost unpredictability introduced by your retry and architecture configuration. The buffer ranges from 10% to 35% depending on your ARE risk level.


5. Agentic Resource Exhaustion (ARE) risk score

The ARE risk score reflects exposure to uncontrolled cost escalation from agent behavior. See the full ARE definition for background. The score is determined by retry configuration, tool call rate, and human review rate.

Low riskRobust retry controls with exponential backoff and per-session spend caps. Tool calls are infrequent. ARE variance buffer: 10%.
Medium riskBasic retry controls or undocumented retry policy. Moderate tool call rate. Some unbounded execution paths. ARE variance buffer: 20–25%.
High riskNo retry logic configured, or fully automated execution with high tool call rate and no human review. ARE variance buffer: 30–35%.

6. Shift-Left Costing readiness score

The Shift-Left Costing (SLC) readiness score (0–100) measures how much of your agent's cost profile has been considered and governed at the architecture stage — before deployment.

The score rewards five policy choices: prompt caching adoption, model tier selection discipline, retry and spend controls, output verbosity constraints, and appropriate human review routing.

A higher SLC score correlates with a narrower spread between your Rainy Day and Blue Sky scenarios — meaning your cost model is more predictable and your budget conversations with finance are more defensible.


7. Known limitations

This model does not account for egress, storage, or infrastructure costs. Provider pricing for vector databases, object storage, compute, and network egress are not included. For most agent programs, LLM API costs are the dominant variable cost — but at scale, storage and infrastructure costs become material.

Multi-agent coordination overhead is not modeled at the individual agent level. If you are running orchestrated multi-agent pipelines, inter-agent context exchange can add 10–50× tokens per task compared to single-agent execution. The tool call overhead multiplier partially captures this, but orchestration cost is not explicitly modeled.

Pricing changes without notice. Provider pricing changes with some frequency. We update the grader's pricing data manually when changes are published, but there may be a lag between a provider update and our model reflecting it. The report notes the pricing date.

Behavioral variability is real. Agent token consumption is behavior-driven, not seat-based. Actual costs depend on how users interact with your agents, what tasks are submitted, and how the agent reasons through them. Pre-deployment estimates will diverge from actuals — the goal is a defensible planning range, not a precise forecast.

Methodology version: April 2026Pricing sources: provider documentation, April 2026Goldfinch Economics LLC

Score your agent costs before you deploy

Free. No account required. Takes 3 minutes.
Open the Cost Grader →