aaa-llm LLM Infrastructure ·

🎯 Token-based Billing Systems Design

1️⃣ Core Framework

When discussing Token-based Billing Systems, I frame it as:

Why token-based billing exists
Token metering
Request lifecycle accounting
Pricing and rating engine
Usage aggregation
Quota and budget enforcement
Billing correctness and idempotency
Trade-offs: accuracy vs latency vs cost

2️⃣ Why Token-based Billing Exists

LLM cost is closely tied to token usage.

A request usually consumes:

Input tokens
Output tokens
Cached tokens
Tool-call tokens
Embedding tokens
Reasoning tokens
Multimodal tokens

Basic Billing Flow

User Request
→ Count Tokens
→ Run Inference
→ Count Output Tokens
→ Calculate Cost
→ Store Usage Record
→ Aggregate Bill

👉 Interview Answer

Token-based billing is used because LLM infrastructure cost is strongly related to token usage.

The system meters input tokens, output tokens, model type, cached tokens, and other usage dimensions, then converts usage into cost using a pricing engine.

3️⃣ What Is a Token?

Token Definition

A token is a unit of text processed by the model.

It can be:

A word
Part of a word
Punctuation
Whitespace
Special token

Example

"Hello world"
→ ["Hello", " world"]

Why It Matters

Token count affects:

Latency
Cost
Context window usage
GPU memory
Inference time

👉 Interview Answer

A token is the unit of text that the model processes.

Billing is based on tokens because token count directly affects compute, latency, memory, and cost.

4️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Auth / Quota Check
→ Request Validator
→ Token Meter
→ Inference Service
→ Usage Event Producer
→ Usage Event Stream
→ Aggregation Service
→ Rating Engine
→ Invoice / Billing System

Core Components

Token Meter

Counts input and output tokens.

Usage Event Producer

Emits usage events after request completion.

Aggregation Service

Aggregates usage by user, organization, model, and time window.

Rating Engine

Converts usage into cost.

Billing System

Generates invoices or charges.

👉 Interview Answer

A token billing system usually includes token metering, usage event generation, streaming ingestion, aggregation, rating, quota enforcement, and invoice generation.

The key requirement is accurate, reliable, and auditable usage accounting.

5️⃣ Token Metering

What to Meter

The system should meter:

Prompt input tokens
Completion output tokens
Cached input tokens
Embedding tokens
Tool-call tokens
Model name
Request ID
Organization ID
Timestamp

Example Usage Record

{
  "request_id": "req_123",
  "organization_id": "org_456",
  "model": "large-model",
  "input_tokens": 1200,
  "output_tokens": 300,
  "cached_tokens": 600,
  "timestamp": "2026-05-24T10:00:00Z"
}

👉 Interview Answer

Token metering records how many tokens were consumed by each request.

A good usage record should include request ID, organization, model, input tokens, output tokens, cached tokens, timestamp, and pricing-relevant dimensions.

6️⃣ Request Lifecycle Accounting

Request Lifecycle

Request accepted
→ Input tokens counted
→ Inference starts
→ Output tokens generated
→ Final token count captured
→ Usage event emitted

Important Cases

The system must handle:

Successful requests
Failed requests
Cancelled requests
Streamed responses
Partial outputs
Retries
Timeouts

Example

Client disconnects after 100 output tokens
→ Bill only generated tokens
→ Emit partial usage event

👉 Interview Answer

Billing must follow the full request lifecycle.

The system should correctly account for successful, failed, cancelled, timed-out, retried, and partially streamed requests.

7️⃣ Pricing Model

Pricing Dimensions

Pricing may depend on:

Model
Input tokens
Output tokens
Cached tokens
Embedding tokens
Batch vs real-time mode
Region
Customer tier
Volume discounts

Example Pricing

Input tokens: $X per million
Output tokens: $Y per million
Cached tokens: discounted

Why Output Tokens May Cost More

Output generation requires sequential decoding, which is often more expensive than processing input tokens.

👉 Interview Answer

The pricing model converts measured usage into cost.

Pricing may differ by model, input tokens, output tokens, cached tokens, batch mode, region, and customer tier.

8️⃣ Rating Engine

What Is Rating?

Rating converts usage into billable cost.

Usage Event
→ Pricing Rules
→ Cost

Rating Example

{
  "input_tokens": 1000,
  "output_tokens": 500,
  "input_price": 0.000001,
  "output_price": 0.000003,
  "total_cost": 0.0025
}

Rating Engine Responsibilities

Apply pricing rules
Apply discounts
Apply free tier
Apply credits
Handle currency
Handle billing periods
Preserve audit trail

👉 Interview Answer

The rating engine applies pricing rules to token usage.

It calculates cost, applies discounts, credits, free tier rules, and stores auditable cost records.

9️⃣ Usage Event Pipeline

Why Event Pipeline Is Needed

Billing data must be durable and reliable.

Event Flow

Inference Service
→ Usage Event
→ Durable Queue / Stream
→ Usage Processor
→ Aggregation Store
→ Billing System

Requirements

Durable delivery
Idempotent processing
Ordering where needed
Replay support
Dead-letter queue
Audit logs

👉 Interview Answer

Usage events should flow through a durable event pipeline.

This allows reliable processing, retries, replay, aggregation, and auditing.

Billing events should never rely only on in-memory state.

🔟 Idempotency and Deduplication

Why It Matters

Requests may be retried.

Events may be delivered more than once.

Without deduplication, customers may be double-billed.

Idempotency Key

request_id + usage_event_type + attempt_id

Deduplication Flow

Usage event received
→ Check idempotency key
→ If already processed, skip
→ Else process and store

👉 Interview Answer

Idempotency is critical in billing systems.

Since requests and events may be retried, the system must deduplicate usage events using stable request IDs or idempotency keys to avoid double charging.

1️⃣1️⃣ Quota and Budget Enforcement

Why Quotas Matter

Users and organizations need spending control.

Common Controls

Token-per-minute limit
Request-per-minute limit
Daily budget
Monthly budget
Model-specific quota
Organization-level quota
Hard and soft limits

Flow

Request arrives
→ Estimate token usage
→ Check quota / budget
→ Allow or reject

Challenge

Output tokens are unknown before generation.

👉 Interview Answer

Quota enforcement protects both the platform and customers.

The system should check estimated usage before inference and reconcile actual usage after completion.

Output token cost is harder because it is only known after generation.

1️⃣2️⃣ Prepaid Credits vs Postpaid Billing

Prepaid

Users buy credits first.

Credit balance
→ Deduct usage
→ Stop when balance is low

Advantages

Lower financial risk
Easier for consumer products
Hard spending limit

Postpaid

Users are billed after usage.

Usage accumulated
→ Monthly invoice

Advantages

Better for enterprise customers
Easier high-volume usage
Supports contracts and discounts

👉 Interview Answer

Token billing systems can support prepaid or postpaid models.

Prepaid credits reduce financial risk and enforce hard limits.

Postpaid billing is common for enterprise customers with invoices, contracts, and volume discounts.

1️⃣3️⃣ Streaming Billing

Why Streaming Is Special

Output tokens are generated gradually.

The system may stream tokens before knowing final usage.

Streaming Billing Flow

Start request
→ Count input tokens
→ Stream output tokens
→ Count generated tokens
→ Emit final usage event

Cancellation Case

Client cancels after 200 tokens
→ Bill 200 generated tokens

👉 Interview Answer

Streaming responses require billing based on actual generated tokens.

The system should count output tokens as they are produced and emit a final usage event when the stream completes, fails, or is cancelled.

1️⃣4️⃣ Aggregation

Why Aggregate Usage?

Raw events are too detailed for invoices.

Aggregation summarizes usage by:

Organization
User
API key
Model
Time window
Token type
Region
Product feature

Example

org_123
model = large-model
date = 2026-05-24
input_tokens = 10M
output_tokens = 2M
cost = $X

👉 Interview Answer

Usage aggregation converts raw request-level usage events into billing summaries.

Aggregation usually groups by organization, model, token type, time window, API key, and billing period.

1️⃣5️⃣ Auditability

Why Audit Matters

Customers may dispute bills.

The system must prove:

Which requests were billed
How many tokens were counted
Which model was used
Which price applied
Which discounts applied
When events were processed

Audit Record

Request ID
Usage event
Token counts
Pricing rule version
Calculated cost
Processing timestamp

👉 Interview Answer

Billing systems must be auditable.

For each charge, the system should be able to trace the request, token counts, model, pricing rule, discounts, and processing history.

1️⃣6️⃣ Best Practices

Practical Rules

Meter usage at request level
Use durable usage events
Make billing processing idempotent
Separate metering from rating
Track pricing rule versions
Support replay and backfill
Reconcile estimated vs actual usage
Enforce quotas before inference
Store audit records
Monitor billing anomalies

Design Principle

Meter first.
Rate later.
Bill with auditability.

👉 Interview Answer

A good token-based billing system separates metering, rating, aggregation, and invoicing.

It should use durable events, idempotency, audit logs, quota enforcement, and pricing-rule versioning.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

Token-based billing is used for LLM systems because infrastructure cost is strongly related to token usage.

A production billing system needs to meter input tokens, output tokens, cached tokens, embedding tokens, model type, organization, timestamp, and other pricing dimensions.

The high-level architecture usually includes an API gateway, quota checker, token meter, inference service, usage event producer, durable event stream, usage processor, aggregation store, rating engine, and billing or invoicing system.

During the request lifecycle, the system counts input tokens before inference, tracks output tokens during generation, and emits a usage event when the request completes, fails, times out, or is cancelled.

Streaming responses require special care because output tokens are produced gradually.

The system should bill based on the actual number of generated tokens, even if the client disconnects early.

The usage event pipeline must be durable and replayable.

Billing should not depend only on in-memory state.

Since retries and duplicate events can happen, idempotency and deduplication are critical to avoid double billing.

I would separate metering from rating.

Metering records raw usage.

Rating applies pricing rules, discounts, credits, free tier rules, currency handling, and billing-period logic.

Quota and budget enforcement should happen before inference using estimated usage, then actual usage should be reconciled after completion.

Because output tokens are unknown in advance, the system may reserve budget upfront and settle the final amount later.

Finally, auditability is essential.

Every charge should be traceable back to request ID, token counts, model, pricing rule version, discounts, and processing timestamps.

The core principle is: meter first, rate later, and bill with auditability.

⭐ Final Insight

Token-based Billing 的核心不是简单地：

“数一下 token，然后乘价格”

真正的 production billing system 是：

Token Metering

Usage Events

Durable Stream

Idempotency

Aggregation

Rating Engine

Quota Enforcement

Audit Logs

Invoice Generation。

最重要的原则是：

Meter first.

Rate later.

Bill with auditability.

中文部分

🎯 Token-based Billing Systems Design

1️⃣ 核心框架

讨论 Token-based Billing Systems 时，我通常从这些方面分析：

为什么需要 token-based billing
Token metering
Request lifecycle accounting
Pricing and rating engine
Usage aggregation
Quota and budget enforcement
Billing correctness and idempotency
核心权衡：accuracy vs latency vs cost

2️⃣ 为什么需要 Token-based Billing？

LLM cost 和 token usage 高度相关。

一个 request 通常消耗：

Input tokens
Output tokens
Cached tokens
Tool-call tokens
Embedding tokens
Reasoning tokens
Multimodal tokens

Basic Billing Flow

User Request
→ Count Tokens
→ Run Inference
→ Count Output Tokens
→ Calculate Cost
→ Store Usage Record
→ Aggregate Bill

👉 面试回答

Token-based billing 被使用，是因为 LLM infrastructure cost 和 token usage 强相关。

系统会计量 input tokens、output tokens、 model type、cached tokens 和其他 usage dimensions，然后通过 pricing engine 转换成 cost。

3️⃣ 什么是 Token？

Token Definition

Token 是 model 处理文本的基本单位。

它可能是：

一个词
一个词的一部分
标点
空格
Special token

Example

"Hello world"
→ ["Hello", " world"]

为什么重要？

Token count 影响：

Latency
Cost
Context window usage
GPU memory
Inference time

👉 面试回答

Token 是 model 处理文本的基本单位。

Billing 基于 tokens，因为 token count 直接影响 compute、 latency、memory 和 cost。

4️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Auth / Quota Check
→ Request Validator
→ Token Meter
→ Inference Service
→ Usage Event Producer
→ Usage Event Stream
→ Aggregation Service
→ Rating Engine
→ Invoice / Billing System

Core Components

Token Meter

计算 input 和 output tokens。

Usage Event Producer

Request 完成后产生 usage events。

Aggregation Service

按 user、organization、model 和 time window 聚合 usage。

Rating Engine

把 usage 转换成 cost。

Billing System

生成 invoice 或 charge。

👉 面试回答

Token billing system 通常包括 token metering、 usage event generation、streaming ingestion、 aggregation、rating、quota enforcement 和 invoice generation。

关键要求是 usage accounting 必须 accurate、reliable 和 auditable。

5️⃣ Token Metering

需要计量什么？

系统应该计量：

Prompt input tokens
Completion output tokens
Cached input tokens
Embedding tokens
Tool-call tokens
Model name
Request ID
Organization ID
Timestamp

Example Usage Record

{
  "request_id": "req_123",
  "organization_id": "org_456",
  "model": "large-model",
  "input_tokens": 1200,
  "output_tokens": 300,
  "cached_tokens": 600,
  "timestamp": "2026-05-24T10:00:00Z"
}

👉 面试回答

Token metering 会记录每个 request 消耗了多少 tokens。

好的 usage record 应该包含 request ID、 organization、model、input tokens、 output tokens、cached tokens、timestamp 和 pricing-relevant dimensions。

6️⃣ Request Lifecycle Accounting

Request Lifecycle

Request accepted
→ Input tokens counted
→ Inference starts
→ Output tokens generated
→ Final token count captured
→ Usage event emitted

Important Cases

系统必须处理：

Successful requests
Failed requests
Cancelled requests
Streamed responses
Partial outputs
Retries
Timeouts

Example

Client disconnects after 100 output tokens
→ Bill only generated tokens
→ Emit partial usage event

👉 面试回答

Billing 必须跟随完整 request lifecycle。

系统需要正确处理 successful、failed、 cancelled、timed-out、retried 和 partially streamed requests。

7️⃣ Pricing Model

Pricing Dimensions

Pricing 可能取决于：

Model
Input tokens
Output tokens
Cached tokens
Embedding tokens
Batch vs real-time mode
Region
Customer tier
Volume discounts

Example Pricing

Input tokens: $X per million
Output tokens: $Y per million
Cached tokens: discounted

为什么 Output Tokens 可能更贵？

Output generation 需要 sequential decoding，通常比处理 input tokens 更昂贵。

👉 面试回答

Pricing model 把 measured usage 转换成 cost。

Pricing 可能根据 model、input tokens、 output tokens、cached tokens、batch mode、 region 和 customer tier 不同而变化。

8️⃣ Rating Engine

什么是 Rating？

Rating 把 usage 转换成 billable cost。

Usage Event
→ Pricing Rules
→ Cost

Rating Example

{
  "input_tokens": 1000,
  "output_tokens": 500,
  "input_price": 0.000001,
  "output_price": 0.000003,
  "total_cost": 0.0025
}

Rating Engine Responsibilities

Apply pricing rules
Apply discounts
Apply free tier
Apply credits
Handle currency
Handle billing periods
Preserve audit trail

👉 面试回答

Rating engine 会把 pricing rules 应用到 token usage 上。

它负责计算 cost，应用 discounts、credits、 free tier rules，并存储 auditable cost records。

9️⃣ Usage Event Pipeline

为什么需要 Event Pipeline？

Billing data 必须 durable and reliable。

Event Flow

Inference Service
→ Usage Event
→ Durable Queue / Stream
→ Usage Processor
→ Aggregation Store
→ Billing System

Requirements

Durable delivery
Idempotent processing
Ordering where needed
Replay support
Dead-letter queue
Audit logs

👉 面试回答

Usage events 应该进入 durable event pipeline。

这支持 reliable processing、retries、 replay、aggregation 和 auditing。

Billing events 不应该只依赖 in-memory state。

🔟 Idempotency and Deduplication

为什么重要？

Requests 可能 retry。

Events 可能被重复 deliver。

如果没有 deduplication， customers 可能被 double-billed。

Idempotency Key

request_id + usage_event_type + attempt_id

Deduplication Flow

Usage event received
→ Check idempotency key
→ If already processed, skip
→ Else process and store

👉 面试回答

Idempotency 对 billing systems 非常关键。

因为 requests 和 events 都可能 retry，系统必须使用 stable request IDs 或 idempotency keys 去重 usage events，避免 double charging。

1️⃣1️⃣ Quota and Budget Enforcement

为什么 Quotas 重要？

Users 和 organizations 需要控制 spending。

Common Controls

Token-per-minute limit
Request-per-minute limit
Daily budget
Monthly budget
Model-specific quota
Organization-level quota
Hard and soft limits

Flow

Request arrives
→ Estimate token usage
→ Check quota / budget
→ Allow or reject

Challenge

Output tokens 在 generation 前未知。

👉 面试回答

Quota enforcement 保护 platform 和 customers。

系统应该在 inference 前基于 estimated usage 做检查，并在 completion 后用 actual usage reconcile。

Output token cost 更难处理，因为 generation 后才知道。

1️⃣2️⃣ Prepaid Credits vs Postpaid Billing

Prepaid

Users 先购买 credits。

Credit balance
→ Deduct usage
→ Stop when balance is low

Advantages

Lower financial risk
Easier for consumer products
Hard spending limit

Postpaid

Users 先使用，之后付费。

Usage accumulated
→ Monthly invoice

Advantages

Better for enterprise customers
Easier high-volume usage
Supports contracts and discounts

👉 面试回答

Token billing systems 可以支持 prepaid 或 postpaid models。

Prepaid credits 降低 financial risk，并支持 hard limits。

Postpaid billing 常用于 enterprise customers，支持 invoices、contracts 和 volume discounts。

1️⃣3️⃣ Streaming Billing

为什么 Streaming 特殊？

Output tokens 是逐渐生成的。

系统可能在知道最终 usage 前已经开始 stream tokens。

Streaming Billing Flow

Start request
→ Count input tokens
→ Stream output tokens
→ Count generated tokens
→ Emit final usage event

Cancellation Case

Client cancels after 200 tokens
→ Bill 200 generated tokens

👉 面试回答

Streaming responses 需要基于实际 generated tokens 计费。

系统应该在 tokens 生成时计数，并在 stream completed、failed 或 cancelled 时发送 final usage event。

1️⃣4️⃣ Aggregation

为什么要 Aggregate Usage？

Raw events 对 invoices 来说太细。

Aggregation 按以下维度汇总 usage：

Organization
User
API key
Model
Time window
Token type
Region
Product feature

Example

org_123
model = large-model
date = 2026-05-24
input_tokens = 10M
output_tokens = 2M
cost = $X

👉 面试回答

Usage aggregation 把 request-level raw events 转换成 billing summaries。

Aggregation 通常按 organization、model、 token type、time window、API key 和 billing period 分组。

1️⃣5️⃣ Auditability

为什么 Audit 重要？

Customers 可能 dispute bills。

系统必须证明：

哪些 requests 被 billed
Count 了多少 tokens
使用了哪个 model
应用了哪个 price
应用了哪些 discounts
Events 何时被处理

Audit Record

Request ID
Usage event
Token counts
Pricing rule version
Calculated cost
Processing timestamp

👉 面试回答

Billing systems 必须 auditable。

对每一笔 charge，系统都应该能追踪 request、token counts、 model、pricing rule、discounts 和 processing history。

1️⃣6️⃣ Best Practices

Practical Rules

Meter usage at request level
Use durable usage events
Make billing processing idempotent
Separate metering from rating
Track pricing rule versions
Support replay and backfill
Reconcile estimated vs actual usage
Enforce quotas before inference
Store audit records
Monitor billing anomalies

Design Principle

Meter first.
Rate later.
Bill with auditability.

👉 面试回答

好的 token-based billing system 会分离 metering、rating、aggregation 和 invoicing。

它应该使用 durable events、idempotency、 audit logs、quota enforcement 和 pricing-rule versioning。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

Token-based billing 用于 LLM systems，是因为 infrastructure cost 和 token usage 高度相关。

Production billing system 需要计量 input tokens、output tokens、 cached tokens、embedding tokens、 model type、organization、timestamp 和其他 pricing dimensions。

High-level architecture 通常包含 API gateway、quota checker、token meter、 inference service、usage event producer、 durable event stream、usage processor、 aggregation store、rating engine 和 billing 或 invoicing system。

在 request lifecycle 中，系统会在 inference 前计算 input tokens，在 generation 期间追踪 output tokens，并在 request completed、failed、timed out 或 cancelled 时发送 usage event。

Streaming responses 需要特别处理，因为 output tokens 是逐渐产生的。

即使 client early disconnect，系统也应该根据实际生成的 tokens 计费。

Usage event pipeline 必须 durable and replayable。

Billing 不应该只依赖 in-memory state。

因为 retries 和 duplicate events 可能发生， idempotency 和 deduplication 非常关键，用来避免 double billing。

我会把 metering 和 rating 分开。

Metering 记录 raw usage。

Rating 应用 pricing rules、discounts、 credits、free tier rules、currency handling 和 billing-period logic。

Quota 和 budget enforcement 应该在 inference 前基于 estimated usage 执行，然后在 completion 后用 actual usage reconcile。

因为 output tokens 事前未知，系统可以 upfront reserve budget，之后再 settle final amount。

最后，auditability 非常重要。

每一笔 charge 都应该能追溯到 request ID、 token counts、model、pricing rule version、 discounts 和 processing timestamps。

核心原则是： meter first， rate later， and bill with auditability。

⭐ Final Insight

Token-based Billing 的核心不是简单地：

“数一下 token，然后乘价格”

真正的 production billing system 是：

Token Metering

Usage Events

Durable Stream

Idempotency

Aggregation

Rating Engine

Quota Enforcement

Audit Logs

Invoice Generation。

最重要的原则是：

Meter first.

Rate later.

Bill with auditability.