Kimi API: The Complete Guide to Moonshot AI's Long-Context Model

πŸ“… June 6, 2026 ⏱️ 18 min read πŸ‘€ Haotokai Team

Moonshot AI's Kimi model has revolutionized what's possible with large language models by offering an industry-leading 2 million token context window. For developers building applications that need to process entire books, legal documents, codebases, or research papers, the Kimi API eliminates the need for complex chunking and retrieval systems. In this comprehensive guide, we'll explore everything you need to know about building with Kimi.

Table of Contents

What is Kimi? Understanding Moonshot AI

Kimi is the flagship large language model developed by Moonshot AI (ζœˆδΉ‹ζš—ι’), a Chinese AI startup founded in 2023 by former ByteDance and Tsinghua University researchers. Despite being a relatively new player, Moonshot AI has quickly made a name for itself by pushing the boundaries of context window size.

What sets Kimi apart from other LLMs is its focus on ultra-long context understanding. While most models top out at 128K or 200K tokens, Kimi supports up to 2 million tokens β€” enough to process entire novels, comprehensive legal contracts, or large code repositories in a single prompt.

πŸ’‘ What Does 2 Million Tokens Mean?

2 million tokens is roughly equivalent to 1.5 million words or about 3,000 pages of text. That's enough to fit: the entire Harry Potter series (1M words), War and Peace (560K words), plus several more books β€” all in a single context window. For developers, this means you can feed entire documents, codebases, or datasets to the model without worrying about chunking or retrieval strategies.

Moonshot AI's consumer product, also called Kimi Chat, has gained significant popularity in China for its ability to upload and analyze entire documents. The Kimi API brings this same long-context capability to developers, enabling a new class of applications that were previously impossible with smaller context windows.

Why Long Context Matters: The Kimi Advantage

The context window is one of the most important specifications of an LLM, yet it's often overlooked. Here's why Kimi's massive context window is a game-changer:

1. No More RAG (for Many Use Cases)

Retrieval-Augmented Generation (RAG) has become the standard approach for building applications that need to work with large documents. But RAG is complex β€” you need vector databases, chunking strategies, embedding models, and retrieval pipelines. With Kimi's 2M token window, many applications can skip RAG entirely and just feed the full document directly to the model.

2. Better Coherence & Understanding

When you chunk documents for RAG, you lose context. The model sees isolated chunks rather than the full document, which can lead to inconsistencies and missed connections. Kimi sees everything at once, enabling deeper understanding and more coherent responses that reference the full context.

3. Faster Development

Building a production-grade RAG system takes weeks or months of engineering work. With Kimi's long context, you can build document analysis applications in hours β€” just upload the document and start asking questions.

4. Codebase Analysis

For developers working with large codebases, Kimi can ingest entire repositories and provide context-aware suggestions, bug fixes, and architectural analysis β€” something no other model can match at this scale.

Kimi Model Variants & Capabilities

Moonshot AI offers several Kimi model variants optimized for different use cases:

Model Name Context Window Key Strengths Best For
moonshot-v1-8k 8K tokens Fast, cost-effective, good quality Simple tasks, high-volume chatbots
moonshot-v1-32k 32K tokens Balanced context and speed Medium documents, customer support
moonshot-v1-128k 128K tokens Large document processing Reports, articles, research papers
moonshot-v1-256k 256K tokens Extended document analysis Books, legal contracts, theses
moonshot-v1-512k 512K tokens Massive document processing Codebases, large datasets
moonshot-v1-1m 1M tokens Industry-leading context Entire books, full codebases
moonshot-v1-2m 2M tokens Maximum context available Massive datasets, multi-book analysis

Key Capabilities

Getting Started with Kimi API

Prerequisites

Before you can start using the Kimi API, you'll need:

API Key Setup

Store your API key as an environment variable:

# Set environment variable (Linux/macOS)
export KIMI_API_KEY="your-api-key-here"

# Windows PowerShell
$env:KIMI_API_KEY = "your-api-key-here"

The International Developer's Shortcut: Haotokai

While you can sign up directly with Moonshot AI, international developers often encounter hurdles:

Using Haotokai as your API gateway solves all these issues. You get access to Kimi (and other Chinese models) with:

Kimi API Reference & Key Endpoints

Chat Completions Endpoint

The primary endpoint for all Kimi models is the chat completions endpoint:

POST /v1/chat/completions

Key request parameters:

Parameter Type Description
model string (required) The model to use (e.g., "moonshot-v1-128k")
messages array (required) Conversation history as message objects
temperature float (optional) Sampling temperature (0-1), default 0.3
max_tokens integer (optional) Maximum tokens for the response
stream boolean (optional) Stream responses, default false
tools array (optional) List of functions/tools available to the model
response_format object (optional) Set to {"type": "json_object"} for JSON output

⚠️ Important: Long Context Considerations

When using Kimi's larger context windows (128k+), be aware that processing time increases with input length. For very large inputs (500k+ tokens), first response latency can be significant. We recommend using streaming for the best user experience with long documents.

Files API (for Document Upload)

Kimi also provides a Files API that allows you to upload documents and reference them in chat completions. This is particularly useful for document-heavy workflows:

# Upload a file
POST /v1/files

# List files
GET /v1/files

# Get file content
GET /v1/files/{file_id}/content

Code Examples: Building with Kimi API

Basic Python Example

Here's a simple example of calling the Kimi API using Python:

import os
import requests

def call_kimi_api(prompt, model="moonshot-v1-128k", api_key=None):
    """
    Call Kimi API with a user prompt.
    """
    api_key = api_key or os.getenv("KIMI_API_KEY")
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": model,
        "messages": [
            {
                "role": "system",
                "content": "You are Kimi, a helpful AI assistant developed by Moonshot AI."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.3,
        "max_tokens": 2048
    }
    
    response = requests.post(
        "https://api.moonshot.cn/v1/chat/completions",
        headers=headers,
        json=payload
    )
    
    if response.status_code == 200:
        result = response.json()
        return {
            "content": result["choices"][0]["message"]["content"],
            "usage": result["usage"]
        }
    else:
        raise Exception(f"API Error: {response.status_code} - {response.text}")

# Example usage
if __name__ == "__main__":
    result = call_kimi_api(
        "Explain the advantages of long-context language models for enterprise applications.",
        model="moonshot-v1-128k"
    )
    print("Response:", result["content"])
    print(f"\nTokens used: {result['usage']['total_tokens']}")

Using Haotokai Unified API (Recommended)

Accessing Kimi through Haotokai is even easier and gives you access to 50+ other models with the same API key:

import os
import requests

def call_haotokai_model(prompt, model="moonshot-v1-128k", api_key=None):
    """
    Call any AI model through Haotokai's unified API.
    Supports Kimi, GLM-4, DeepSeek, Qwen, Claude, and more.
    """
    api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.3,
        "stream": False
    }
    
    response = requests.post(
        "https://www.haotokai.com/v1/chat/completions",
        headers=headers,
        json=payload
    )
    
    if response.status_code == 200:
        return response.json()["choices"][0]["message"]["content"]
    else:
        raise Exception(f"Error: {response.status_code} - {response.text}")

# Compare different long-context models with one function!
models = ["moonshot-v1-128k", "glm-4-long", "claude-3-sonnet-20240229"]
prompt = "Summarize the key advantages of your model in 3 bullet points."

for model in models:
    print(f"\n=== {model} ===")
    try:
        answer = call_haotokai_model(prompt, model=model)
        print(answer[:200] + "..." if len(answer) > 200 else answer)
    except Exception as e:
        print(f"Error: {e}")

Document Analysis Example with Long Context

Here's a practical example of using Kimi's long context to analyze a large document:

import os
import requests

def analyze_document(document_path, analysis_type="summary", api_key=None):
    """
    Analyze a large document using Kimi's long-context model.
    No chunking or RAG required β€” just feed the whole document!
    """
    api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
    
    # Read the full document
    with open(document_path, 'r', encoding='utf-8') as f:
        document_content = f.read()
    
    # Build prompts based on analysis type
    prompts = {
        "summary": "Please provide a comprehensive summary of the following document. "
                   "Include the main arguments, key findings, and important details.",
        "key_points": "Extract the 10 most important key points from the following document. "
                     "Present them as a numbered list with brief explanations.",
        "action_items": "Identify all action items, decisions, and to-do items in the document. "
                       "Format as a checklist with responsible parties and deadlines where mentioned."
    }
    
    system_prompt = prompts.get(analysis_type, prompts["summary"])
    
    payload = {
        "model": "moonshot-v1-128k",  # Use 2m for very large docs
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Here is the document:\n\n{document_content}"}
        ],
        "temperature": 0.2,
        "max_tokens": 4096
    }
    
    response = requests.post(
        "https://www.haotokai.com/v1/chat/completions",
        headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
        json=payload
    )
    
    if response.status_code == 200:
        result = response.json()
        return {
            "analysis": result["choices"][0]["message"]["content"],
            "tokens_used": result["usage"]["total_tokens"],
            "document_tokens": result["usage"]["prompt_tokens"]
        }
    else:
        raise Exception(f"API Error: {response.status_code} - {response.text}")

# Example: Analyze a 100-page report
result = analyze_document("annual_report.txt", analysis_type="summary")
print(f"Document tokens: {result['document_tokens']}")
print(f"Total tokens used: {result['tokens_used']}")
print("\n=== Summary ===")
print(result["analysis"])

Streaming Response for Long Documents

When working with long documents, streaming provides a much better user experience:

import os
import requests
import json

def stream_kimi_response(prompt, model="moonshot-v1-128k", api_key=None):
    """
    Stream a response from Kimi for better UX with long documents.
    """
    api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "Accept": "text/event-stream"
    }
    
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "stream": True,
        "temperature": 0.3
    }
    
    response = requests.post(
        "https://www.haotokai.com/v1/chat/completions",
        headers=headers,
        json=payload,
        stream=True
    )
    
    full_response = []
    
    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            if line.startswith('data: '):
                data = line[6:]
                if data == '[DONE]':
                    break
                try:
                    chunk = json.loads(data)
                    if "choices" in chunk and len(chunk["choices"]) > 0:
                        delta = chunk["choices"][0].get("delta", {})
                        content = delta.get("content", "")
                        if content:
                            full_response.append(content)
                            print(content, end="", flush=True)
                except json.JSONDecodeError:
                    pass
    
    return ''.join(full_response)

# Example: Stream a long document analysis
prompt = """I'm going to give you a 50,000 word technical specification.
Please analyze it and provide:
1. Executive summary
2. System architecture overview
3. Key technical decisions
4. Potential risks and issues
5. Recommendations

Document follows:
[50,000 word document goes here]
"""

stream_kimi_response(prompt, model="moonshot-v1-128k")

Function Calling with Kimi

Kimi supports function calling for building agentic applications:

import os
import requests
import json

def search_documents(query):
    """Simulated document search function."""
    return [
        {"title": "Q3 Financial Report", "relevance": 0.95},
        {"title": "Company Strategy 2026", "relevance": 0.87},
        {"title": "Product Roadmap", "relevance": 0.78}
    ]

def kimi_agent(query, api_key=None):
    api_key = api_key or os.getenv("HAOTOKAI_API_KEY")
    
    tools = [
        {
            "type": "function",
            "function": {
                "name": "search_documents",
                "description": "Search the internal document library",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "Search query"
                        }
                    },
                    "required": ["query"]
                }
            }
        }
    ]
    
    payload = {
        "model": "moonshot-v1-128k",
        "messages": [{"role": "user", "content": query}],
        "tools": tools,
        "tool_choice": "auto"
    }
    
    response = requests.post(
        "https://www.haotokai.com/v1/chat/completions",
        headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
        json=payload
    )
    
    result = response.json()
    message = result["choices"][0]["message"]
    
    if message.get("tool_calls"):
        # Handle tool calls...
        pass
    
    return message["content"]

Kimi Pricing: Cost of 2M Token Context

Kimi's pricing is competitive, especially considering the massive context windows. Here's the current pricing structure:

Model Input Cost (/M tokens) Output Cost (/M tokens) Context Window
moonshot-v1-8k $0.60 $1.20 8K
moonshot-v1-32k $1.20 $2.40 32K
moonshot-v1-128k $2.40 $4.80 128K
moonshot-v1-256k $3.00 $6.00 256K
moonshot-v1-512k $3.60 $7.20 512K
moonshot-v1-1m $4.80 $9.60 1M
moonshot-v1-2m $6.00 $12.00 2M

πŸ’‘ Cost Perspective

While Kimi's per-token cost is higher than smaller Chinese models like GLM-4, remember that you're getting capabilities that would otherwise require building and maintaining a complex RAG system. For many document-heavy use cases, Kimi actually saves money by eliminating engineering overhead. A 100-page document (~80K tokens) analyzed with the 128K model costs about $0.19 β€” often cheaper than the engineering time to set up and maintain RAG for the same task.

Cost Optimization Strategies

Real-World Use Cases for Kimi's Long Context

1. Legal Document Analysis

Law firms and legal teams use Kimi to analyze contracts, case files, and regulatory documents. The 2M token window means entire case files or multi-contract comparisons can be done in a single prompt, with the model able to cross-reference clauses across hundreds of pages.

2. Codebase Understanding & Analysis

Development teams use Kimi to understand large codebases quickly. By feeding in an entire repository's source code, developers can ask questions about architecture, find bugs, generate documentation, or plan refactoring β€” all with full context of the codebase.

3. Academic Research & Literature Review

Researchers use Kimi to conduct literature reviews by feeding in hundreds of papers. The model can identify patterns, compare methodologies, and synthesize findings across the entire body of research β€” something that would take a human researcher weeks or months.

4. Financial Report Analysis

Financial analysts use Kimi to analyze annual reports, earnings calls, and market data. The model can extract key metrics, identify trends across multiple quarters, and compare performance across competitors β€” all from raw document inputs.

5. Book & Long-Form Content Creation

Authors and content creators use Kimi to write and edit long-form content. With a 2M token context, the model can maintain consistent plotlines, character arcs, and world-building details across an entire novel.

6. Enterprise Knowledge Management

Companies use Kimi to build internal knowledge assistants that understand the full context of company policies, procedures, and documentation β€” without needing to build and maintain complex RAG infrastructure.

Best Practices for Long-Context Applications

1. Structure Your Prompts for Long Documents

When working with very long documents, structure your prompt to guide the model's attention:

2. Use Streaming for Better UX

Long documents take longer to process. Streaming responses gives users immediate feedback and makes the application feel faster. Always use streaming for user-facing applications with document inputs.

3. Choose the Right Context Size

Don't automatically use the 2M model for everything. Larger context windows are more expensive and have higher latency. Choose the smallest model that can comfortably fit your input plus the expected output.

4. Implement Error Handling for Large Inputs

Very large inputs can sometimes cause timeouts or other issues. Implement:

5. Consider Hybrid Approaches for Very Large Datasets

While Kimi's 2M window is massive, some use cases involve even larger datasets. For these cases, consider a hybrid approach: use Kimi for document-level analysis and combine with vector search for dataset-level retrieval.

Kimi vs Other Long-Context Models

How does Kimi compare to other long-context models on the market?

Model Provider Max Context Input Cost (/M) Best For
Kimi (Moonshot) Moonshot AI 2M tokens $6.00 Maximum context, document analysis
GLM-4 Long Zhipu AI 1M tokens $0.005 Cost-effective long context
Claude 3.5 Sonnet Anthropic 200K tokens $3.00 Balanced quality & context
GPT-4o OpenAI 128K tokens $5.00 General purpose, best quality
Qwen 2.5 72B Alibaba 128K tokens $0.50 Affordable Chinese model

Kimi stands out for having the largest context window among commercially available models. While it's not the cheapest per-token option, the 2M context enables use cases that no other model can handle. For document-heavy applications where context depth is critical, Kimi is the clear leader.

For developers who want to experiment with multiple models, Haotokai provides access to all of these models through a single API, making it easy to compare and choose the best model for each use case.

Access Kimi Easily Through Haotokai

While Moonshot AI's direct API is powerful, international developers face significant barriers to access. Haotokai solves these problems by providing a unified gateway to Kimi and other top Chinese AI models.

Why Haotokai is the Best Way to Use Kimi

βœ… PayPal & International Payments

Skip the hassle of Chinese payment methods. Haotokai supports PayPal, Visa, Mastercard, and other international payment options. Top up your balance in minutes and start building immediately.

βœ… One API Key, 50+ Models

Access Kimi alongside GLM-4, DeepSeek, Qwen, Claude, Llama, and more β€” all with one API key. Mix and match models for different use cases without managing multiple accounts and billing.

βœ… OpenAI-Compatible API

Haotokai uses the standard OpenAI API format. If you already have code that works with OpenAI, you can switch to Kimi by changing just one line (the model name). No SDK changes or API rewrites needed.

βœ… English Documentation & Support

Get full English documentation, API references, and customer support. No more struggling with machine-translated docs or language barriers when you need help.

βœ… Competitive Pricing & Volume Discounts

Get Kimi at competitive rates with volume discounts for high-usage customers. New users get free credits to try the service risk-free.

Getting Started with Haotokai & Kimi

  1. Visit haotokai.com and create an account
  2. Top up your balance using PayPal or your preferred payment method
  3. Copy your API key from the dashboard
  4. Start building with Kimi's long-context capabilities!

Ready to Build with Kimi's 2M Token Context?

Get started with Haotokai today β€” access Kimi and 50+ other AI models with one API key. Pay with PayPal, no credit card required, and unlock the power of ultra-long context for your applications.

Try Haotokai Free β†’

Conclusion

Moonshot AI's Kimi represents a paradigm shift in what's possible with large language models. Its 2 million token context window enables applications that were impossible just a few years ago β€” from entire codebase analysis to multi-book research to comprehensive legal document review.

While Kimi's per-token pricing is higher than smaller Chinese models, the productivity gains from eliminating RAG complexity and enabling new use cases often justify the cost. For document-heavy applications, the long context window isn't just a feature β€” it's a competitive advantage.

For international developers, the easiest way to access Kimi is through Haotokai. With PayPal support, English documentation, and one API key for all models, Haotokai removes the friction of working with Chinese AI services while giving you access to the world's longest-context commercial model.

As context windows continue to grow, we'll see entirely new categories of applications emerge. Kimi is leading that charge today, and developers who master long-context development will be well-positioned to build the next generation of AI-powered applications.

Next β†’
GLM-4 API: Developer's Complete Guide