Pricing model
Public models are billed per token, unlike Compute which bills per active GPU hour. The billable unit is the input token count and the output token count of every successful call.
Billing terms
| Term | Definition |
|---|---|
| Billable unit | One input token and one output token of a successful chat-completion call. No rounding tricks; every token reported by the provider is billed. |
| Streaming | Streaming responses are billed on the same input/output token counts. The platform aggregates the final usage payload emitted by the provider at the end of the stream. |
| Failed calls | Calls that fail before any token is produced (provider timeout, auth error, content filter reject) are not billed. Calls that produce partial output are billed on the produced tokens. |
| Ledger | Every public-model call is recorded as a ledger entry keyed under inference/public/<model-name>/<id> and grouped under the public-model resource group on the billing page. |
Limits and availability
- Chat only: The current catalog supports chat completions. Embeddings and image generation are not exposed as public models yet — use Compute for those workloads.
- Provider failover: Models with more than one provider (e.g. deepseek-v3.2) are routed through the platform’s public-model router. The provider list is configured per model and tried in priority order; a 502 from the first provider triggers a retry on the next.
- Regional residency: Variants ending in -european are pinned to a single AWS region. The global variants can be served from any region the provider supports.
- No scheduling controls: Public models have no schedule, start/stop, or smart constraints. They are always available as long as the provider is healthy.