Cost-optimized LLM execution via sequential model escalation with confidence-based routing.
Overview
Multi-model cascading allows an LLM task to try a fast, cheap model first and only escalate to a more expensive model if the confidence in the response is below a threshold. This dramatically reduces costs for queries that simpler models can handle well.
Flow:User query → Model A (fast/cheap) → Confidence check → if low → Model B (powerful) → Confidence check → ...
Configuration
Cascading is configured per-task in a langchain.json resource:
{"tasks":[{"id":"cascade-task","type":"openai","actions":["*"],"parameters":{"systemMessage":"You are a helpful assistant.","apiKey":"${vault:openai-key}","logSizeLimit":"10"},"modelCascade":{"enabled":true,"strategy":"cascade","evaluationStrategy":"structured_output","enableInAgentMode":true,"steps":[{"type":"openai","parameters":{"model":"gpt-4o-mini"},"confidenceThreshold":0.7,"timeoutMs":10000},{"type":"openai","parameters":{"model":"gpt-4o"},"confidenceThreshold":null,"timeoutMs":30000}]}}]}
Configuration Fields
Field
Type
Default
Description
enabled
boolean
false
Master toggle for cascading
strategy
string
"cascade"
Execution strategy. cascade = sequential. parallel reserved for future.
evaluationStrategy
string
"structured_output"
How confidence is evaluated (see below)
enableInAgentMode
boolean
true
Whether cascade activates when tools/agents are configured
steps
array
—
Ordered list of cascade steps
Step Configuration
Field
Type
Default
Description
type
string
—
Provider type (e.g., openai, anthropic, ollama)
parameters
object
{}
Provider-specific params. Merged with base task parameters.
Note: Step parameters are merged with the base task parameters. Steps only need to specify overrides (e.g., a different model). Shared params like apiKey or systemMessage are inherited.
Confidence Evaluation Strategies
structured_output (default)
Appends a JSON format instruction to the system prompt, asking the model to respond with:
The evaluator parses the JSON, extracts the confidence score, and uses the response field as the actual output. Falls back to heuristic if JSON parsing fails.
heuristic
Analyzes the response text for signals of uncertainty:
Signal
Confidence
Empty/null response
0.0
Very short (< 20 chars)
0.3
Refusal phrases ("I cannot", "I'm sorry but")
0.2
Hedging language ("I'm not sure", "might be")
0.4
Confident, full response
0.8
judge_model
Sends the response to a secondary LLM (the "judge") which evaluates the quality and assigns a confidence score. Falls back to heuristic if no judge model is configured. (Requires additional configuration — see advanced usage.)
none
Always returns 1.0 — effectively disables confidence gating. The first step's response is always accepted. Useful for A/B testing or when you want cascading for timeout/error recovery only.
Error Handling
The cascade handles errors gracefully:
Error Type
Behavior
Rate limited (429)
Auto-escalate to next step
Server error (503)
Auto-escalate to next step
Timeout
Auto-escalate to next step, log warning
Other errors
Log warning, escalate to next step
All steps fail
Throw LifecycleException with full trace
The cascade also tracks the "best response" seen so far. If a later step fails but an earlier step produced a usable response, that response is returned rather than throwing an error.
SSE Events
Two new SSE event types provide real-time visibility into cascade execution:
Event
Fields
When
cascade_step_start
stepIndex, modelType
Before each step begins
cascade_escalation
fromStep, toStep, confidence, reason
When escalating to next step
These events are emitted through the existing ConversationEventSink interface and are available on the /agents/{conversationId}/stream SSE endpoint.
Audit Trail
When the audit collector is active, the cascade writes:
Memory Key
Content
audit:cascade_model
The model type + params of the step that produced the final response
audit:cascade_confidence
The confidence score of the accepted response
cascade:trace
Full execution trace (all steps attempted, timing, errors)
Agent Mode
When enableInAgentMode is true (default), the cascade also works when tools (built-in, MCP, HTTP calls, A2A) are configured. Each cascade step can independently invoke the tool-calling loop via AgentOrchestrator.
When enableInAgentMode is false, cascading is skipped in agent mode and the standard single-model execution path is used.
Backward Compatibility
Existing configurations without modelCascade work exactly as before — no changes needed
Setting enabled: false (default) keeps standard execution
The cascade is entirely within LlmTask — no engine pipeline changes
Example: Cost Optimization
A typical 3-tier cascade for a customer support agent:
This routes simple FAQs to a local Ollama model (free), medium queries to GPT-4o-mini (~$0.15/1M tokens), and only complex queries to Claude Sonnet ($3/1M tokens).