Gone are the days when building an AI application meant choosing one LLM provider and sticking with it. Today’s smartest applications use multiple models: a fast, cheap one for routine tasks, a premium model for complex reasoning, and specialized models for code, images, or audio.
But managing multiple API keys, different SDKs, and inconsistent response formats used to be a nightmare. Not anymore.
In this guide, we’ll show you how to build a production-ready multi-LLM application using a single API key with a unified AI API platform. You’ll learn the architecture, routing strategies, and implementation patterns that top AI teams use to build flexible, cost-effective applications.
Why Build a Multi-LLM Application?
Before we dive into the code, let’s cover why you’d want multiple LLMs in the first place. The benefits fall into four categories:
1. Cost Optimization
Not every request needs GPT-5.2. A customer support FAQ bot can run on a $0.28/MTok model instead of a $14/MTok one. Intelligent routing can cut your AI bill by 70-90%.
2. Reliability & Fallback
No provider has 100% uptime. If one model goes down, you automatically fail over to another. Multi-LLM architecture means zero downtime for your users.
3. Specialization
Different models excel at different things: - DeepSeek for cost-effective coding - Qwen for multilingual support - GLM for Chinese language understanding - GPT-5 for complex reasoning - Claude for long document analysis
4. Negotiation Power
Being locked into one provider gives them all the leverage. When you can switch models with a single line of code, you’re in control of pricing and terms.
Architecture Overview: The Unified API Approach
A multi-LLM application built on a unified API has three layers:
┌─────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────┤
│ Unified API Gateway │
│ (Routing, Caching, Fallback, Billing) │
├─────────────────────────────────────────┤
│ DeepSeek │ Qwen │ GLM │ GPT │ Claude │
└─────────────────────────────────────────┘
The unified API layer handles: - Single authentication: One API key for all models - Standardized format: OpenAI-compatible request/response for every model - Intelligent routing: Automatically choose the best model per request - Fallback logic: Retry with a different model if one fails - Unified billing: One invoice for all usage - Analytics dashboard: Compare performance and costs across models
Step 1: Set Up Your Unified API
First, you need access to a unified AI API platform. For this tutorial, we’ll use Haotokai, which aggregates Chinese AI models (DeepSeek, Qwen, GLM, Moonshot, etc.) through a single OpenAI-compatible endpoint.
# Install the OpenAI SDK (works with Haotokai's compatible endpoint)
# pip install openai
from openai import OpenAI
# Initialize with your Haotokai API key
client = OpenAI(
api_key="YOUR_HAOTOKAI_API_KEY",
base_url="https://api.haotokai.com/v1"
)
That’s it. With one client initialization, you now have access to 10+ AI models from different providers. No multiple SDKs, no different authentication schemes, no format conversions.
Step 2: Basic Multi-LLM Chat
Let’s start with something simple: calling different models based on user selection.
def chat_with_model(message: str, model: str = "deepseek-v4-flash") -> str:
"""
Send a chat message to any available model.
Available models on Haotokai:
- deepseek-v4-flash (fast, cheap, great for most tasks)
- deepseek-v4-pro (premium reasoning)
- qwen2.5-72b-instruct (strong multilingual)
- glm-4 (excellent Chinese + coding)
- moonshot-v1-128k (long context)
"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": message}],
temperature=0.7,
max_tokens=1000
)
return response.choices[0].message.content
# Try different models with the same code
print("DeepSeek Flash:", chat_with_model("Explain quantum computing", "deepseek-v4-flash"))
print("Qwen 72B:", chat_with_model("Explain quantum computing", "qwen2.5-72b-instruct"))
print("GLM-4:", chat_with_model("Explain quantum computing", "glm-4"))
Notice what’s happening here: the exact same code works for every model. You just change the model parameter. No SDK changes, no request format changes, no response parsing changes.
Step 3: Intelligent Model Routing
The real power of multi-LLM comes from automatically choosing the right model for each task. Let’s build a smart router.
from typing import Literal
import re
TaskType = Literal["simple", "moderate", "complex", "coding", "creative"]
def classify_task(prompt: str) -> TaskType:
"""Classify a task to determine which model to use."""
# Check for coding keywords
coding_keywords = ["code", "programming", "python", "javascript", "function",
"debug", "algorithm", "database", "api", "deploy", "refactor"]
if any(kw in prompt.lower() for kw in coding_keywords):
return "coding"
# Check for complexity indicators
complex_indicators = ["analyze", "compare", "evaluate", "strateg",
"complex", "advanced", "research", "synthesis"]
complex_count = sum(1 for ind in complex_indicators if ind in prompt.lower())
# Check length (longer prompts often need more context/reasoning)
prompt_length = len(prompt)
if complex_count >= 2 or prompt_length > 2000:
return "complex"
elif complex_count == 1 or prompt_length > 500:
return "moderate"
elif any(creative_kw in prompt.lower() for creative_kw in
["write", "story", "creative", "poem", "script", "ad"]):
return "creative"
else:
return "simple"
def get_model_for_task(task_type: TaskType) -> str:
"""Map task types to the optimal model."""
routing_table = {
"simple": "deepseek-v4-flash", # $0.28/MTok output - great for simple tasks
"moderate": "deepseek-v4-flash", # Flash handles most moderate tasks well
"complex": "deepseek-v4-pro", # Pro for complex reasoning
"coding": "deepseek-v4-flash", # Flash has strong coding (79% SWE-bench)
"creative": "qwen2.5-72b-instruct" # Qwen for creative writing
}
return routing_table[task_type]
def smart_chat(prompt: str) -> dict:
"""
Smart chat that automatically routes to the best model.
Returns both the response and metadata about which model was used.
"""
task_type = classify_task(prompt)
model = get_model_for_task(task_type)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7 if task_type != "creative" else 1.0,
max_tokens=2000
)
return {
"response": response.choices[0].message.content,
"model_used": model,
"task_type": task_type,
"tokens": {
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens,
"total": response.usage.total_tokens
}
}
# Test it out
result1 = smart_chat("What's the capital of France?")
print(f"Task: {result1['task_type']}, Model: {result1['model_used']}")
# Task: simple, Model: deepseek-v4-flash
result2 = smart_chat("Write a Python function to implement a binary search tree with insertion, deletion, and traversal methods.")
print(f"Task: {result2['task_type']}, Model: {result2['model_used']}")
# Task: coding, Model: deepseek-v4-flash
result3 = smart_chat("Analyze the competitive landscape of AI API providers in 2026 and recommend a strategy for a startup looking to enter the market. Consider pricing, model capabilities, and go-to-market approaches.")
print(f"Task: {result3['task_type']}, Model: {result3['model_used']}")
# Task: complex, Model: deepseek-v4-pro
This basic router will get you 80% of the benefit with minimal code. For production use, you’d want to:
- Add more sophisticated classification (possibly using an LLM itself)
- Track model performance per task type
- A/B test different routing strategies
- Add cost thresholds to prevent unexpected spending
Step 4: Fallback and Reliability
One of the biggest advantages of multi-LLM architecture is reliability. If one model is down or rate-limited, you automatically try another.
import time
from typing import List
def chat_with_fallback(
prompt: str,
preferred_models: List[str] = ["deepseek-v4-flash", "qwen2.5-72b-instruct", "glm-4"],
max_retries: int = 3
) -> dict:
"""
Attempt to chat with fallback models if the primary fails.
"""
last_error = None
for model in preferred_models:
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=1000,
timeout=30
)
return {
"success": True,
"response": response.choices[0].message.content,
"model_used": model,
"attempts": attempt + 1
}
except Exception as e:
last_error = e
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
continue
# If we exhausted retries for this model, try the next one
print(f"Model {model} failed, trying next...")
return {
"success": False,
"error": str(last_error),
"models_tried": preferred_models
}
# Usage
result = chat_with_fallback("Explain transformer architecture")
if result["success"]:
print(f"Got response from {result['model_used']}: {result['response'][:100]}...")
else:
print(f"All models failed: {result['error']}")
With this pattern, your application stays available even when individual providers experience outages. For mission-critical workflows, you can even add cross-provider fallbacks (e.g., from Chinese models to OpenAI/Azure as a last resort).
Step 5: Streaming with Multi-LLM
Streaming is essential for chat applications. Good news — it works the same way across all models with a unified API:
def stream_chat(prompt: str, model: str = "deepseek-v4-flash"):
"""Stream responses from any model."""
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
temperature=0.7,
max_tokens=1000
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
# Usage
for token in stream_chat("Write a haiku about programming", "qwen2.5-72b-instruct"):
print(token, end="", flush=True)
Step 6: Production Considerations
Cost Tracking & Budget Controls
class MultiLLMApplication:
def __init__(self, api_key: str, monthly_budget: float = 100.0):
self.client = OpenAI(api_key=api_key, base_url="https://api.haotokai.com/v1")
self.monthly_budget = monthly_budget
self.monthly_spend = 0.0
self.usage_log = []
# Model pricing (per 1M tokens)
self.pricing = {
"deepseek-v4-flash": {"input": 0.14, "output": 0.28},
"deepseek-v4-pro": {"input": 0.435, "output": 0.87},
"qwen2.5-72b-instruct": {"input": 0.80, "output": 1.60},
"glm-4": {"input": 0.50, "output": 1.00},
}
def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
"""Calculate the cost of an API call."""
prices = self.pricing.get(model, {"input": 1.0, "output": 2.0})
input_cost = (input_tokens / 1_000_000) * prices["input"]
output_cost = (output_tokens / 1_000_000) * prices["output"]
return input_cost + output_cost
def chat(self, prompt: str, model: str = None) -> dict:
"""Smart chat with budget tracking."""
if model is None:
task_type = classify_task(prompt)
model = get_model_for_task(task_type)
# Budget check
if self.monthly_spend >= self.monthly_budget:
return {"error": "Monthly budget exceeded", "spend": self.monthly_spend}
response = self.client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=1000
)
cost = self.calculate_cost(
model,
response.usage.prompt_tokens,
response.usage.completion_tokens
)
self.monthly_spend += cost
return {
"response": response.choices[0].message.content,
"model": model,
"cost": cost,
"total_spend": self.monthly_spend
}
A/B Testing Models
Never just assume a model is “good enough” — test it against your actual use case.
import random
def ab_test_chat(prompt: str, models: List[str], weights: List[float] = None) -> dict:
"""
A/B test different models on real user traffic.
Randomly assigns requests to models and tracks performance.
"""
if weights is None:
weights = [1.0 / len(models)] * len(models)
model = random.choices(models, weights=weights, k=1)[0]
start_time = time.time()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=1000
)
latency = time.time() - start_time
return {
"response": response.choices[0].message.content,
"model": model,
"latency": latency,
"tokens": response.usage.total_tokens
}
# Run A/B test
models_to_test = ["deepseek-v4-flash", "qwen2.5-72b-instruct", "glm-4"]
results = []
for i in range(50): # Run 50 test queries
result = ab_test_chat(f"Write a function to {['sort', 'filter', 'parse', 'generate'][i%4]}...", models_to_test)
results.append(result)
# In production, you'd also collect user feedback or quality metrics
Advanced Pattern: Model Ensembling
For the highest quality output, you can use multiple models and combine their responses:
def ensemble_chat(prompt: str, models: List[str]) -> dict:
"""
Get responses from multiple models and synthesize them.
Useful for high-stakes queries where accuracy is critical.
"""
responses = []
for model in models:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=1500
)
responses.append({
"model": model,
"content": response.choices[0].message.content
})
# Use a strong model to synthesize
synthesis_prompt = f"""You are an expert synthesizer. Below are responses from {len(models)} different AI models to the same question.
Original question: {prompt}
Model responses:
{chr(10).join(f'- {r["model"]}: {r["content"]}' for r in responses)}
Please synthesize the best answer by combining the strengths of all responses,
correcting any errors, and providing a single comprehensive answer.
"""
synthesis = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": synthesis_prompt}],
temperature=0.3,
max_tokens=2000
)
return {
"synthesized_answer": synthesis.choices[0].message.content,
"individual_responses": responses
}
Common Pitfalls to Avoid
1. Over-Engineering Your Router
Don’t build a 1,000-line routing system on day one. Start simple (just 2-3 models, basic keyword-based routing) and iterate based on data.
2. Ignoring Latency Differences
Cheaper models are often faster, but make sure they’re fast enough for your use case. DeepSeek V4 Flash typically responds in 200-500ms for simple queries, while Pro models take 1-3 seconds.
3. Forgetting About Context Limits
Different models have different context windows. A 128K model won’t work for your 200K-document analysis task. Check limits before routing.
4. Not Testing Edge Cases
Make sure every model in your rotation can handle your edge cases — non-English text, code blocks, structured outputs, etc.
The Real Cost of Multi-LLM (vs. Single Provider)
Let’s do the math on what multi-LLM architecture saves:
Scenario: 100,000 API calls/month, 1K input + 500 output tokens each
| Approach | Model Used | Monthly Cost | Quality |
|---|---|---|---|
| Single premium | GPT-4o | $7,500 | High |
| Single budget | GPT-5 Mini | $450 | Medium |
| Multi-LLM (smart routing) | 80% Flash + 20% Pro | $34.80 | High (for 80%) + Premium (for 20%) |
Savings: 95% vs. GPT-4o, 23% vs. GPT-5 Mini — with better overall quality for complex tasks.
The multi-LLM approach gives you better quality than a single budget model at a lower price point.
Getting Started with Haotokai
Building a multi-LLM application doesn’t have to be complicated. With Haotokai’s unified API, you get:
- One API key for all major Chinese AI models
- OpenAI-compatible endpoints — use your existing code
- 10+ models including DeepSeek V4, Qwen 2.5, GLM-4, Moonshot
- Transparent pricing with volume discounts
- 99.9% uptime SLA for production applications
- Developer dashboard with usage analytics and cost tracking
Ready to build your multi-LLM application? Sign up for Haotokai today and get $20 in free credits to test every model. No credit card required.
Build once, use any model. Haotokai’s unified AI API makes multi-LLM development simple and affordable. Start free →