Multi-Model Orchestration: When One LLM Is Not Enough
Routers, cascades, ensembles and specialized chains for multi-model.
Introduction
Complex AI systems can’t rely on a single model. You need:
- GPT-4 for complex reasoning
- Claude for long documents
- Local models for sensitive data
- Specialized models for domain tasks
The challenge: orchestrating multiple models coherently.
When Single-Model Breaks
Symptoms:
- High costs (using expensive model for simple tasks)
- Slow responses (overpowered model for easy queries)
- Quality issues (wrong model for task type)
- Compliance problems (sensitive data to external API)
Orchestration Patterns
Pattern 1: Complexity-Based Routing
Route by query complexity:
```typescript
class ComplexityRouter {
route(query: string): string {
const complexity = this.classify(query);
if (complexity < 0.3) {
return “claude-haiku”; // $1/1M tokens
} else if (complexity < 0.7) {
return “gpt-3.5”; // $2/1M tokens
} else {
return “gpt-4”; // $30/1M tokens
}
}
classify(query: string): number {
// Simple heuristics or ML model
const factors = {
length: query.length / 1000,
technical_terms: countTechnicalTerms(query),
ambiguity: measureAmbiguity(query)
};
const values = Object.values(factors);
return values.reduce((a, b) => a + b, 0) / values.length;
}
}
```Result: 40-60% cost reduction with same quality.
Pattern 2: Task-Specific Routing
Route by task type:
```typescript
const taskModelMap: Record<string, string> = {
summarization: “claude-3-sonnet”, // Best at conciseness
code_generation: “gpt-4”, // Best at code
translation: “gpt-3.5”, // Good enough, cheap
analysis: “gpt-4”, // Needs reasoning
chat: “claude-3-opus” // Best conversation
};
function routeByTask(query: string, taskType: string): Promise<string> {
const model = taskModelMap[taskType] || “gpt-3.5”;
return callModel(model, query);
}
```Pattern 3: Cascade with Fallback
Try cheap model first, escalate if needed:
```typescript
class CascadeOrchestrator {
async execute(query: string): Promise<string> {
// Try cheap model
let response = await this.callModel(”gpt-3.5”, query);
// Check quality
if (this.qualityCheck(response) > 0.8) {
return response;
}
// Escalate to better model
response = await this.callModel(”gpt-4”, query);
return response;
}
}
```
Trade-off: Latency (+200ms) vs Cost (-30%)
Pattern 4: Parallel Ensemble
Call multiple models, choose best:
```typescript
async function ensemble(query: string): Promise<string> {
const responses = await Promise.all([
callModel(”gpt-4”, query),
callModel(”claude”, query),
callModel(”gemini”, query)
]);
// Vote or select best
return selectBestResponse(responses);
}
```Use case: High-stakes decisions (legal, medical, financial)
Cost: 3x, but much higher confidence
Pattern 5: Specialized Model Chains
Chain models for multi-step tasks:
```typescript
async function documentAnalysisChain(document: any): Promise<string> {
// Step 1: Extract with OCR model
const text = await ocrModel.extract(document);
// Step 2: Summarize with Claude (long context)
const summary = await claude.summarize(text);
// Step 3: Analyze sentiment with specialized model
const sentiment = await sentimentModel.analyze(summary);
// Step 4: Generate report with GPT-4
const report = await gpt4.generateReport(summary, sentiment);
return report;
}
```Maintaining Consistency
Challenge: Different models have different output formats.
Solution - Output Schema Enforcement:
```typescript
import { z } from “zod”;
const AnalysisOutputSchema = z.object({
summary: z.string(),
sentiment: z.number().min(-1).max(1),
key_points: z.array(z.string()),
confidence: z.number()
});
type AnalysisOutput = z.infer<typeof AnalysisOutputSchema>;
async function enforceSchema(modelResponse: any, schema: z.ZodSchema): Promise<any> {
const result = schema.safeParse(modelResponse);
if (!result.success) {
// Retry with schema enforcement in prompt
return callWithSchema(model, schema);
}
return result.data;
}
```Debugging Multi-Model Systems
Challenges:
- Which model produced this output?
- Why did router choose this model?
- How to reproduce issues?
Solution - Comprehensive Tracing:
```typescript
class TracedOrchestrator {
async execute(query: string): Promise<string> {
const traceId = generateTraceId();
// Log routing decision
this.log({
trace_id: traceId,
query: query,
routing_decision: {
model: selectedModel,
reason: routingReason,
confidence: routingConfidence
}
});
// Execute with tracing
const response = await this.callModel(
selectedModel,
query,
{ traceId }
);
// Log response
this.log({
trace_id: traceId,
model: selectedModel,
latency: executionTime,
tokens: tokenCount,
cost: calculatedCost
});
return response;
}
}
```Case Study: Document Intelligence Platform
Before: Single GPT-4 for everything
- Cost: $45K/month
- Latency: 3.2s average
After: Multi-model orchestration
- Cost: $12K/month (73% reduction)
- Latency: 1.8s average (44% improvement)
Setup:
- Simple queries (60%): GPT-3.5
- Complex analysis (30%): GPT-4
- Long documents (10%): Claude (200K context)
Conclusion
Multi-model orchestration is essential at scale:
- Route by complexity: 40-60% cost savings
- Route by task: Optimize for quality per task
- Cascade: Try cheap, escalate if needed
- Ensemble: Multiple models for high stakes
- Chain: Specialized models for multi-step
Start simple (complexity routing). Add sophistication as needed.

