Moonshot AI's Kimi model has revolutionized what's possible with large language models by offering an industry-leading 2 million token context window. For developers building applications that need to process entire books, legal documents, codebases, or research papers, the Kimi API eliminates the need for complex chunking and retrieval systems. In this comprehensive guide, we'll explore everything you need to know about building with Kimi.
Table of Contents
- What is Kimi? Understanding Moonshot AI
- Why Long Context Matters: The Kimi Advantage
- Kimi Model Variants & Capabilities
- Getting Started with Kimi API
- Kimi API Reference & Key Endpoints
- Code Examples: Building with Kimi API
- Kimi Pricing: Cost of 2M Token Context
- Real-World Use Cases for Kimi's Long Context
- Best Practices for Long-Context Applications
- Kimi vs Other Long-Context Models
- Access Kimi Easily Through Haotokai
What is Kimi? Understanding Moonshot AI
Kimi is the flagship large language model developed by Moonshot AI (ζδΉζι’), a Chinese AI startup founded in 2023 by former ByteDance and Tsinghua University researchers. Despite being a relatively new player, Moonshot AI has quickly made a name for itself by pushing the boundaries of context window size.
What sets Kimi apart from other LLMs is its focus on ultra-long context understanding. While most models top out at 128K or 200K tokens, Kimi supports up to 2 million tokens β enough to process entire novels, comprehensive legal contracts, or large code repositories in a single prompt.
π‘ What Does 2 Million Tokens Mean?
2 million tokens is roughly equivalent to 1.5 million words or about 3,000 pages of text. That's enough to fit: the entire Harry Potter series (1M words), War and Peace (560K words), plus several more books β all in a single context window. For developers, this means you can feed entire documents, codebases, or datasets to the model without worrying about chunking or retrieval strategies.
Moonshot AI's consumer product, also called Kimi Chat, has gained significant popularity in China for its ability to upload and analyze entire documents. The Kimi API brings this same long-context capability to developers, enabling a new class of applications that were previously impossible with smaller context windows.
Why Long Context Matters: The Kimi Advantage
The context window is one of the most important specifications of an LLM, yet it's often overlooked. Here's why Kimi's massive context window is a game-changer:
1. No More RAG (for Many Use Cases)
Retrieval-Augmented Generation (RAG) has become the standard approach for building applications that need to work with large documents. But RAG is complex β you need vector databases, chunking strategies, embedding models, and retrieval pipelines. With Kimi's 2M token window, many applications can skip RAG entirely and just feed the full document directly to the model.
2. Better Coherence & Understanding
When you chunk documents for RAG, you lose context. The model sees isolated chunks rather than the full document, which can lead to inconsistencies and missed connections. Kimi sees everything at once, enabling deeper understanding and more coherent responses that reference the full context.
3. Faster Development
Building a production-grade RAG system takes weeks or months of engineering work. With Kimi's long context, you can build document analysis applications in hours β just upload the document and start asking questions.
4. Codebase Analysis
For developers working with large codebases, Kimi can ingest entire repositories and provide context-aware suggestions, bug fixes, and architectural analysis β something no other model can match at this scale.
Kimi Model Variants & Capabilities
Moonshot AI offers several Kimi model variants optimized for different use cases:
| Model Name | Context Window | Key Strengths | Best For |
|---|---|---|---|
| moonshot-v1-8k | 8K tokens | Fast, cost-effective, good quality | Simple tasks, high-volume chatbots |
| moonshot-v1-32k | 32K tokens | Balanced context and speed | Medium documents, customer support |
| moonshot-v1-128k | 128K tokens | Large document processing | Reports, articles, research papers |
| moonshot-v1-256k | 256K tokens | Extended document analysis | Books, legal contracts, theses |
| moonshot-v1-512k | 512K tokens | Massive document processing | Codebases, large datasets |
| moonshot-v1-1m | 1M tokens | Industry-leading context | Entire books, full codebases |
| moonshot-v1-2m | 2M tokens | Maximum context available | Massive datasets, multi-book analysis |
Key Capabilities
- Document Understanding β Parse, analyze, and extract information from documents of any length
- Chinese & English Bilingual β Strong performance in both Chinese and English
- Function Calling β Connect to external tools and APIs
- Code Generation β Write, analyze, and debug code across multiple languages
- Data Analysis β Process and analyze structured data
- Creative Writing β Generate long-form content with consistent style and plot
Getting Started with Kimi API
Prerequisites
Before you can start using the Kimi API, you'll need:
- A Moonshot AI API account (or Haotokai for easier international access)
- An API key from your account dashboard
- Basic knowledge of REST APIs
- Python or your preferred programming language
API Key Setup
Store your API key as an environment variable:
# Set environment variable (Linux/macOS)
export KIMI_API_KEY="your-api-key-here"
# Windows PowerShell
$env:KIMI_API_KEY = "your-api-key-here"
The International Developer's Shortcut: Haotokai
While you can sign up directly with Moonshot AI, international developers often encounter hurdles:
- Chinese phone number required for verification
- Payment methods limited to Chinese services
- Documentation primarily in Chinese
- Customer support only available during Chinese business hours
Using Haotokai as your API gateway solves all these issues. You get access to Kimi (and other Chinese models) with:
- PayPal and international credit card support
- Full English documentation and support
- One API key for all models
- Standard OpenAI-compatible API format
Kimi API Reference & Key Endpoints
Chat Completions Endpoint
The primary endpoint for all Kimi models is the chat completions endpoint:
POST /v1/chat/completions
Key request parameters:
| Parameter | Type | Description |
|---|---|---|
model |
string (required) | The model to use (e.g., "moonshot-v1-128k") |
messages |
array (required) | Conversation history as message objects |
temperature |
float (optional) | Sampling temperature (0-1), default 0.3 |
max_tokens |
integer (optional) | Maximum tokens for the response |
stream |
boolean (optional) | Stream responses, default false |
tools |
array (optional) | List of functions/tools available to the model |
response_format |
object (optional) | Set to {"type": "json_object"} for JSON output |
β οΈ Important: Long Context Considerations
When using Kimi's larger context windows (128k+), be aware that processing time increases with input length. For very large inputs (500k+ tokens), first response latency can be significant. We recommend using streaming for the best user experience with long documents.
Files API (for Document Upload)
Kimi also provides a Files API that allows you to upload documents and reference them in chat completions. This is particularly useful for document-heavy workflows:
# Upload a file
POST /v1/files
# List files
GET /v1/files
# Get file content
GET /v1/files/{file_id}/content
Code Examples: Building with Kimi API
Basic Python Example
Here's a simple example of calling the Kimi API using Python:
import os
import requests
def call_kimi_api(prompt, model="moonshot-v1-128k", api_key=None):
"""
Call Kimi API with a user prompt.
"""
api_key = api_key or os.getenv("KIMI_API_KEY")
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": [
{
"role": "system",
"content": "You are Kimi, a helpful AI assistant developed by Moonshot AI."
},
{
"role": "user",
"content": prompt
}
],
"temperature": 0.3,
"max_tokens": 2048
}
response = requests.post(
"https://api.moonshot.cn/v1/chat/completions",
headers=headers,
json=payload
)
if response.status_code == 200:
result = response.json()
return {
"content": result["choices"][0]["message"]["content"],
"usage": result["usage"]
}
else:
raise Exception(f"API Error: {response.status_code} - {response.text}")
# Example usage
if __name__ == "__main__":
result = call_kimi_api(
"Explain the advantages of long-context language models for enterprise applications.",
model="moonshot-v1-128k"
)
print("Response:", result["content"])
print(f"\nTokens used: {result['usage']['total_tokens']}")
Using Haotokai Unified API (Recommended)
Accessing Kimi through Haotokai is even easier and gives you access to 50+ other models with the same API key:
import os
import requests
def call_haotokai_model(prompt, model="moonshot-v1-128k", api_key=None):
"""
Call any AI model through Haotokai's unified API.
Supports Kimi, GLM-4, DeepSeek, Qwen, Claude, and more.
"""
api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": [
{"role": "user", "content": prompt}
],
"temperature": 0.3,
"stream": False
}
response = requests.post(
"https://www.haotokai.com/v1/chat/completions",
headers=headers,
json=payload
)
if response.status_code == 200:
return response.json()["choices"][0]["message"]["content"]
else:
raise Exception(f"Error: {response.status_code} - {response.text}")
# Compare different long-context models with one function!
models = ["moonshot-v1-128k", "glm-4-long", "claude-3-sonnet-20240229"]
prompt = "Summarize the key advantages of your model in 3 bullet points."
for model in models:
print(f"\n=== {model} ===")
try:
answer = call_haotokai_model(prompt, model=model)
print(answer[:200] + "..." if len(answer) > 200 else answer)
except Exception as e:
print(f"Error: {e}")
Document Analysis Example with Long Context
Here's a practical example of using Kimi's long context to analyze a large document:
import os
import requests
def analyze_document(document_path, analysis_type="summary", api_key=None):
"""
Analyze a large document using Kimi's long-context model.
No chunking or RAG required β just feed the whole document!
"""
api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
# Read the full document
with open(document_path, 'r', encoding='utf-8') as f:
document_content = f.read()
# Build prompts based on analysis type
prompts = {
"summary": "Please provide a comprehensive summary of the following document. "
"Include the main arguments, key findings, and important details.",
"key_points": "Extract the 10 most important key points from the following document. "
"Present them as a numbered list with brief explanations.",
"action_items": "Identify all action items, decisions, and to-do items in the document. "
"Format as a checklist with responsible parties and deadlines where mentioned."
}
system_prompt = prompts.get(analysis_type, prompts["summary"])
payload = {
"model": "moonshot-v1-128k", # Use 2m for very large docs
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Here is the document:\n\n{document_content}"}
],
"temperature": 0.2,
"max_tokens": 4096
}
response = requests.post(
"https://www.haotokai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json=payload
)
if response.status_code == 200:
result = response.json()
return {
"analysis": result["choices"][0]["message"]["content"],
"tokens_used": result["usage"]["total_tokens"],
"document_tokens": result["usage"]["prompt_tokens"]
}
else:
raise Exception(f"API Error: {response.status_code} - {response.text}")
# Example: Analyze a 100-page report
result = analyze_document("annual_report.txt", analysis_type="summary")
print(f"Document tokens: {result['document_tokens']}")
print(f"Total tokens used: {result['tokens_used']}")
print("\n=== Summary ===")
print(result["analysis"])
Streaming Response for Long Documents
When working with long documents, streaming provides a much better user experience:
import os
import requests
import json
def stream_kimi_response(prompt, model="moonshot-v1-128k", api_key=None):
"""
Stream a response from Kimi for better UX with long documents.
"""
api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"Accept": "text/event-stream"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": True,
"temperature": 0.3
}
response = requests.post(
"https://www.haotokai.com/v1/chat/completions",
headers=headers,
json=payload,
stream=True
)
full_response = []
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:]
if data == '[DONE]':
break
try:
chunk = json.loads(data)
if "choices" in chunk and len(chunk["choices"]) > 0:
delta = chunk["choices"][0].get("delta", {})
content = delta.get("content", "")
if content:
full_response.append(content)
print(content, end="", flush=True)
except json.JSONDecodeError:
pass
return ''.join(full_response)
# Example: Stream a long document analysis
prompt = """I'm going to give you a 50,000 word technical specification.
Please analyze it and provide:
1. Executive summary
2. System architecture overview
3. Key technical decisions
4. Potential risks and issues
5. Recommendations
Document follows:
[50,000 word document goes here]
"""
stream_kimi_response(prompt, model="moonshot-v1-128k")
Function Calling with Kimi
Kimi supports function calling for building agentic applications:
import os
import requests
import json
def search_documents(query):
"""Simulated document search function."""
return [
{"title": "Q3 Financial Report", "relevance": 0.95},
{"title": "Company Strategy 2026", "relevance": 0.87},
{"title": "Product Roadmap", "relevance": 0.78}
]
def kimi_agent(query, api_key=None):
api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
tools = [
{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search the internal document library",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
}
},
"required": ["query"]
}
}
}
]
payload = {
"model": "moonshot-v1-128k",
"messages": [{"role": "user", "content": query}],
"tools": tools,
"tool_choice": "auto"
}
response = requests.post(
"https://www.haotokai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json=payload
)
result = response.json()
message = result["choices"][0]["message"]
if message.get("tool_calls"):
# Handle tool calls...
pass
return message["content"]
Kimi Pricing: Cost of 2M Token Context
Kimi's pricing is competitive, especially considering the massive context windows. Here's the current pricing structure:
| Model | Input Cost (/M tokens) | Output Cost (/M tokens) | Context Window |
|---|---|---|---|
| moonshot-v1-8k | $0.60 | $1.20 | 8K |
| moonshot-v1-32k | $1.20 | $2.40 | 32K |
| moonshot-v1-128k | $2.40 | $4.80 | 128K |
| moonshot-v1-256k | $3.00 | $6.00 | 256K |
| moonshot-v1-512k | $3.60 | $7.20 | 512K |
| moonshot-v1-1m | $4.80 | $9.60 | 1M |
| moonshot-v1-2m | $6.00 | $12.00 | 2M |
π‘ Cost Perspective
While Kimi's per-token cost is higher than smaller Chinese models like GLM-4, remember that you're getting capabilities that would otherwise require building and maintaining a complex RAG system. For many document-heavy use cases, Kimi actually saves money by eliminating engineering overhead. A 100-page document (~80K tokens) analyzed with the 128K model costs about $0.19 β often cheaper than the engineering time to set up and maintain RAG for the same task.
Cost Optimization Strategies
- Right-size your context window β Use the smallest model that fits your document. If your docs are 60K tokens, use 128K, not 2M.
- Use Haotokai for volume discounts β Higher usage tiers get better rates across all models.
- Cache repeated inputs β If you're analyzing the same document multiple times, cache embeddings or use a two-step approach.
- Preprocess documents β Remove boilerplate, headers, footers, and irrelevant content before sending to the API.
- Compare models β For simpler tasks, try cheaper alternatives like GLM-4-long through Haotokai.
Real-World Use Cases for Kimi's Long Context
1. Legal Document Analysis
Law firms and legal teams use Kimi to analyze contracts, case files, and regulatory documents. The 2M token window means entire case files or multi-contract comparisons can be done in a single prompt, with the model able to cross-reference clauses across hundreds of pages.
2. Codebase Understanding & Analysis
Development teams use Kimi to understand large codebases quickly. By feeding in an entire repository's source code, developers can ask questions about architecture, find bugs, generate documentation, or plan refactoring β all with full context of the codebase.
3. Academic Research & Literature Review
Researchers use Kimi to conduct literature reviews by feeding in hundreds of papers. The model can identify patterns, compare methodologies, and synthesize findings across the entire body of research β something that would take a human researcher weeks or months.
4. Financial Report Analysis
Financial analysts use Kimi to analyze annual reports, earnings calls, and market data. The model can extract key metrics, identify trends across multiple quarters, and compare performance across competitors β all from raw document inputs.
5. Book & Long-Form Content Creation
Authors and content creators use Kimi to write and edit long-form content. With a 2M token context, the model can maintain consistent plotlines, character arcs, and world-building details across an entire novel.
6. Enterprise Knowledge Management
Companies use Kimi to build internal knowledge assistants that understand the full context of company policies, procedures, and documentation β without needing to build and maintain complex RAG infrastructure.
Best Practices for Long-Context Applications
1. Structure Your Prompts for Long Documents
When working with very long documents, structure your prompt to guide the model's attention:
- Put instructions at the beginning and end of the prompt (the model pays most attention to these areas)
- Clearly label sections (e.g., "DOCUMENT:", "INSTRUCTIONS:")
- Be specific about what you want the model to extract or analyze
- Ask for structured output (JSON, bullet points) for easier parsing
2. Use Streaming for Better UX
Long documents take longer to process. Streaming responses gives users immediate feedback and makes the application feel faster. Always use streaming for user-facing applications with document inputs.
3. Choose the Right Context Size
Don't automatically use the 2M model for everything. Larger context windows are more expensive and have higher latency. Choose the smallest model that can comfortably fit your input plus the expected output.
4. Implement Error Handling for Large Inputs
Very large inputs can sometimes cause timeouts or other issues. Implement:
- Graceful degradation (fall back to smaller model if needed)
- Retry logic with exponential backoff
- Input size validation and user warnings
- Progress indicators for long-running requests
5. Consider Hybrid Approaches for Very Large Datasets
While Kimi's 2M window is massive, some use cases involve even larger datasets. For these cases, consider a hybrid approach: use Kimi for document-level analysis and combine with vector search for dataset-level retrieval.
Kimi vs Other Long-Context Models
How does Kimi compare to other long-context models on the market?
| Model | Provider | Max Context | Input Cost (/M) | Best For |
|---|---|---|---|---|
| Kimi (Moonshot) | Moonshot AI | 2M tokens | $6.00 | Maximum context, document analysis |
| GLM-4 Long | Zhipu AI | 1M tokens | $0.005 | Cost-effective long context |
| Claude 3.5 Sonnet | Anthropic | 200K tokens | $3.00 | Balanced quality & context |
| GPT-4o | OpenAI | 128K tokens | $5.00 | General purpose, best quality |
| Qwen 2.5 72B | Alibaba | 128K tokens | $0.50 | Affordable Chinese model |
Kimi stands out for having the largest context window among commercially available models. While it's not the cheapest per-token option, the 2M context enables use cases that no other model can handle. For document-heavy applications where context depth is critical, Kimi is the clear leader.
For developers who want to experiment with multiple models, Haotokai provides access to all of these models through a single API, making it easy to compare and choose the best model for each use case.
Access Kimi Easily Through Haotokai
While Moonshot AI's direct API is powerful, international developers face significant barriers to access. Haotokai solves these problems by providing a unified gateway to Kimi and other top Chinese AI models.
Why Haotokai is the Best Way to Use Kimi
β PayPal & International Payments
Skip the hassle of Chinese payment methods. Haotokai supports PayPal, Visa, Mastercard, and other international payment options. Top up your balance in minutes and start building immediately.
β One API Key, 50+ Models
Access Kimi alongside GLM-4, DeepSeek, Qwen, Claude, Llama, and more β all with one API key. Mix and match models for different use cases without managing multiple accounts and billing.
β OpenAI-Compatible API
Haotokai uses the standard OpenAI API format. If you already have code that works with OpenAI, you can switch to Kimi by changing just one line (the model name). No SDK changes or API rewrites needed.
β English Documentation & Support
Get full English documentation, API references, and customer support. No more struggling with machine-translated docs or language barriers when you need help.
β Competitive Pricing & Volume Discounts
Get Kimi at competitive rates with volume discounts for high-usage customers. New users get free credits to try the service risk-free.
Getting Started with Haotokai & Kimi
- Visit haotokai.com and create an account
- Top up your balance using PayPal or your preferred payment method
- Copy your API key from the dashboard
- Start building with Kimi's long-context capabilities!
Ready to Build with Kimi's 2M Token Context?
Get started with Haotokai today β access Kimi and 50+ other AI models with one API key. Pay with PayPal, no credit card required, and unlock the power of ultra-long context for your applications.
Try Haotokai Free βConclusion
Moonshot AI's Kimi represents a paradigm shift in what's possible with large language models. Its 2 million token context window enables applications that were impossible just a few years ago β from entire codebase analysis to multi-book research to comprehensive legal document review.
While Kimi's per-token pricing is higher than smaller Chinese models, the productivity gains from eliminating RAG complexity and enabling new use cases often justify the cost. For document-heavy applications, the long context window isn't just a feature β it's a competitive advantage.
For international developers, the easiest way to access Kimi is through Haotokai. With PayPal support, English documentation, and one API key for all models, Haotokai removes the friction of working with Chinese AI services while giving you access to the world's longest-context commercial model.
As context windows continue to grow, we'll see entirely new categories of applications emerge. Kimi is leading that charge today, and developers who master long-context development will be well-positioned to build the next generation of AI-powered applications.