The Three-Layer Architecture of AI Tokens: Why the Middle Is Eating the Stack

Something interesting is happening in the way smart people talk about AI infrastructure.

For the past two years, the conversation was about models — which one is biggest, which one writes the best code, which one will reach AGI first. That conversation hasn't gone away, but at recent AI infrastructure summits a different framing has been quietly taking over. Industry experts and academic researchers have started describing the token economy as a three-layer stack, not unlike the way we eventually came to think about cloud computing.

The framing goes like this:

Layer 1 — Producers. The model labs that actually train and serve frontier LLMs.
Layer 2 — Aggregators. The middleware that normalizes APIs, pools capacity, and bills users.
Layer 3 — Schedulers. The intelligence that routes each request to the right model at the right price.

If you build with AI today, you almost certainly live in Layer 1 — talking directly to one or two model providers. And if you've felt the pain of vendor lock-in, capacity outages, or surprise bills, the three-layer framing explains exactly why that pain exists and where it's going to be solved.

Spoiler: it's going to be solved in the middle. This article is about why.

The Single-Model Era Is Quietly Ending

In 2023, the typical AI app was a wrapper around gpt-3.5-turbo. In 2024, it was a wrapper around gpt-4 with a fallback to gpt-3.5 for cost. That was the entire architecture.

Look at a production AI app shipped in 2026 and the picture has fundamentally changed. A real example from a B2B SaaS team I spoke with last month:

Customer-facing chat: DeepSeek V3 for general turns, GPT-4o only on escalation
Internal RAG over Chinese documents: Qwen 2.5-72B
Long-document summarization: Kimi K2 (because of its million-token context)
Structured extraction: GLM-4-Flash (cheap and reliable)
Coding agent: Claude 3.5 Sonnet
Embeddings: a self-hosted open model

Six models. Six different APIs. Six different billing dashboards. Six different rate-limit policies. Six different ways to get paged at 3 a.m.

This is not because the team is over-engineering. It's because no single model is best at everything anymore, and the price-performance gap between models has gotten so wide that picking the wrong one for a task can multiply your bill by 30x. A request that costs $0.0003 on DeepSeek can cost $0.01 on GPT-4o for output that's qualitatively identical for the task at hand.

If you're still building "the OpenAI app," you're building yesterday's architecture. The multi-model app is the new default, and the multi-model app needs a different kind of infrastructure underneath it.

The Three-Layer Architecture, Properly Explained

Let me unpack the three layers in a way that makes sense if you've ever shipped code.

Layer 1: Producers — The Token Factories

Producers are the labs that train frontier models and operate the inference clusters that turn prompts into tokens. OpenAI, Anthropic, Google, Meta, DeepSeek, Moonshot, Zhipu, Alibaba's Qwen team, Mistral — these are all producers.

Producers compete on three things:

Capability — benchmark scores, reasoning depth, context length, multimodality.
Unit economics — cost per token, throughput per GPU.
Specialization — Chinese-language quality, coding ability, long-context recall, function calling.

What producers don't compete on is consistency. Every producer's API is subtly different. Authentication differs. Streaming formats differ. Function-calling schemas differ. Even the meaning of temperature drifts between vendors. This is not malice; it's just the natural state of a market where every player is moving at maximum speed.

Producers also can't afford to optimize for your workload. Their job is to keep the GPUs hot. Your job is to keep your users happy. Those goals are not always aligned.

Layer 2: Aggregators — The Universal Translators

The aggregator's job is to make the producer layer look like a single, well-behaved system.

A real aggregator does at least seven things:

Protocol normalization. One request schema (typically the OpenAI Chat Completions format) maps to every backend model.
Identity and billing. One API key, one wallet, one invoice — instead of six accounts in six countries with six different KYC processes.
Capacity pooling. Aggregators buy commitments from multiple producers and resell on demand, so individual developers don't have to predict their own usage.
Geographic accessibility. Producers in mainland China, Europe, and the US each have their own access rules. An aggregator can be the only practical way for a developer in, say, Brazil to use a Chinese model legally and reliably.
Payment flexibility. Most developers globally can't easily pay for, say, a DeepSeek API. Aggregators accept PayPal, cards, crypto — whatever the market actually uses.
Observability. Logs, latency metrics, error rates, and spend dashboards in one place.
Compatibility shimming. When a backend producer changes their schema (and they always do), the aggregator absorbs the breakage so your code doesn't.

If this list sounds familiar, it should. Stripe did this for payment processors. Cloudflare did this for origin servers. Twilio did this for telcos. In every case, the "boring" middle layer ended up being more strategically important — and often more valuable — than the producers it sat in front of.

Layer 3: Schedulers — The Routing Brain

Schedulers sit on top of the aggregator and decide, on a per-request basis, which model should handle the call.

A good scheduler considers:

Task type (reasoning vs. summarization vs. extraction vs. translation)
Required quality tier (is this customer-facing or background?)
Current price per million tokens for each candidate model
Current health and latency of each model
Fallback policy if the first choice fails

Today, the scheduler is usually a few hundred lines of code inside your application. In a couple of years, it will look more like a managed service, much the way Kubernetes eventually swallowed everyone's bespoke deployment scripts.

Why the Middle Layer Eats the Stack

Here's the part that I think gets undersold. In a three-layer architecture, the middle layer is structurally the most strategic place to be — and the place most independent developers and startups should be paying attention to.

1. The middle layer is where lock-in dies

The biggest hidden tax in AI development right now is switching cost. Re-integrating a new model takes a week. Re-integrating five new models takes a quarter. Most teams just don't do it, and they overpay forever as a result.

An aggregator normalizes the interface. Once you're behind one, switching from GPT-4o to DeepSeek V3 is a string change, not a sprint.

2. The middle layer is where economics work

Producers price for their best customers — typically large enterprises with predictable, high-volume commits. Solo developers and small startups pay rack rate. Aggregators sit between the two: they negotiate volume rates with producers and resell in small chunks to long-tail developers. The arbitrage funds everyone in the middle.

This is exactly why AWS exists. EC2 isn't cheaper than running your own server because Amazon has cheaper electricity. It's cheaper because Amazon buys electricity at industrial scale and sells it to you in minute increments.

3. The middle layer is where reliability lives

No single producer has 100% uptime. Anyone who's been on Anthropic during a capacity squeeze, or on OpenAI during a launch day, knows this in their bones. The only durable answer is multi-provider failover — and you can't do multi-provider failover until you have a unified interface to fail over with. That's the middle layer.

4. The middle layer is where new geographies open up

The most underrated story in AI right now is that the price-performance frontier has shifted. The cheapest token that meets quality bar for many real tasks is no longer made in California. It's made in Hangzhou, in Beijing, in Hangzhou again. DeepSeek V3 is roughly 30x cheaper than GPT-4o on output tokens and ties or beats it on a large fraction of coding and reasoning tasks. Qwen 2.5 is genuinely competitive with Claude for many enterprise use cases. GLM-4 ships an extremely cheap "Flash" tier that's perfect for structured extraction.

Most non-Chinese developers have never used these models. Not because they're inferior — they often aren't — but because the access path is hard: foreign credit cards don't always work, KYC is in a foreign language, payment limits are restrictive, and the regional latency from outside Asia can be brutal without proper routing.

This is, structurally, an aggregator problem. Solve it once for everybody.

5. The middle layer is where the standards will eventually live

One of the consistent points at recent infrastructure conferences is that the AI industry has a standards gap. There's no equivalent of TCP/IP, or POSIX, or even OpenAPI for how a model should expose itself to the world. We're in the pre-standardization era, which is exactly when middleware companies create de facto standards.

The Chat Completions schema — invented by OpenAI, adopted by everyone else because it was already there — is the first such standard. There will be more. They will almost certainly emerge from the aggregator layer, because that's where the pressure to standardize is highest.

What a Production-Grade Middle Layer Actually Looks Like

If you've never used an aggregator, here's what working with one feels like in practice. The example below uses Python, but the same shape applies in any language.

from openai import OpenAI

# One API key. Every model.
client = OpenAI(
    api_key="YOUR_HAOTOKAI_KEY",
    base_url="https://api.haotokai.com/v1"
)

# Cheap, fast, Chinese-language-strong
qwen_reply = client.chat.completions.create(
    model="qwen2.5-72b-instruct",
    messages=[{"role": "user", "content": "Summarize this doc..."}]
)

# Long-context — million-token window
kimi_reply = client.chat.completions.create(
    model="moonshot-v1-128k",
    messages=[{"role": "user", "content": full_book_text}]
)

# Reasoning-heavy task
deepseek_reply = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Design a sharding scheme..."}]
)

# Structured extraction at near-zero cost
glm_reply = client.chat.completions.create(
    model="glm-4-flash",
    messages=[{"role": "user", "content": "Extract all invoice line items..."}]
)

Same SDK. Same request shape. Same billing wallet. Same observability. No new authentication, no new error handling, no new rate-limit logic.

That's the whole point. The middle layer's job is to disappear.

Why this matters more than it sounds

Every line of integration code you don't write is a line you don't have to debug, secure, or migrate when the upstream API breaks. A good aggregator turns "I'd love to try DeepSeek but I don't have time" into a one-character change in the model field.

Where Haotokai Fits

This is the part where I should be transparent: Haotokai is a Layer-2 aggregator, and it's the product I work on. The reason we built it is exactly the thesis of this article — the middle layer is where most developers' real pain lives, and there wasn't a good option for developers outside China who wanted clean access to the Chinese model ecosystem.

Concretely, Haotokai gives you:

One OpenAI-compatible endpoint across DeepSeek (V3, R1), Qwen 2.5, GLM-4, Kimi (Moonshot), Spark (iFlytek), and more.
Pricing that mirrors the source providers, so the cheap Chinese models stay cheap — typically 60–90% below GPT-4o-class pricing.
PayPal, card, and crypto payment, so you don't need a Chinese bank account to use Chinese tokens.
One dashboard, one wallet, one invoice for everything you spend across providers.
Drop-in compatibility with the OpenAI SDK and any framework built on top of it (LangChain, LlamaIndex, Vercel AI SDK, etc.).
$20 in free credit to try every model side by side before you commit to anything.

If you're already running a multi-model setup, Haotokai consolidates the integration mess. If you're a single-model shop curious about the price-performance frontier outside the US labs, Haotokai is probably the lowest-friction way to experiment.

If you want to dig deeper into specific Chinese models before you decide, we've written more on each one:

The Honest Counter-Arguments

I'd be wasting your time if I didn't address the obvious objections.

"Aggregators are just middlemen taking a cut."

Mathematically, yes — there's a markup. Practically, the markup is small (usually 5–15%), and it's dwarfed by the savings from being able to route to cheaper models. If switching 70% of your traffic to a model that's 10x cheaper saves you 65% on your bill, a 10% middleware fee is rounding error.

"I'm worried about another point of failure."

Reasonable concern, but in practice a well-run aggregator improves reliability because it can fail over between producers automatically. Single-producer setups have no fallback. Multi-producer setups behind an aggregator have several.

"What about data privacy?"

Pick an aggregator that doesn't log prompts and doesn't train on your data, and the privacy posture is essentially the same as going direct. For workloads that require dedicated compliance (HIPAA, SOC 2, regional data residency), you may want to stick with a producer that offers those certifications. For everything else, the aggregator is fine.

"I'll just build my own routing layer."

You can, and many teams do. The question is whether routing is your business. For Stripe, payment routing is the business. For Cloudflare, traffic routing is the business. For your AI startup, the chatbot or the agent or the document tool is the business. Build the differentiated thing; rent the boring infrastructure.

What to Take Away

The three-layer framing for AI tokens isn't a marketing slide. It's a useful description of where the industry is actually heading, and once you see it you can't unsee it.

Producers will keep training better models and competing on capability.
Schedulers will become a managed service category over the next 2–3 years.
Aggregators in the middle will quietly become the place where most developers actually live.

If you're building a serious AI application today, the highest-leverage architectural decision you can make is to stop talking to producers directly and start talking to a normalized middle layer. It's the same lesson the web learned with CDNs, that mobile learned with cross-platform SDKs, and that payments learned with Stripe. The middle is where the leverage is.

The single-model era is over. The multi-model era needs a middle layer. That middle layer is the next critical piece of AI infrastructure.