Automatic Model Routing / Smart Model Chain

Problem

Currently, Alma supports only one default model globally (chat.defaultModel), or users can manually switch models per session. For power users with access to multiple models across different providers, this creates constant friction:

Waste money — using an expensive model (e.g., Claude Opus) for simple questions
Degrade quality — using a cheap model (e.g., Flash) for complex coding or reasoning tasks
Context switching — manually picking the right model every time breaks the flow

Alma already has the concept of using different models for different purposes — toolModel uses a lightweight model for tool calls. This feature request proposes extending that idea to the entire chat experience.

Proposed Solution

Add an Automatic Model Routing system that classifies incoming messages and routes them to the optimal model based on task complexity and type.

Configuration Example

{
  "chat": {
    "modelRouting": {
      "enabled": true,
      "classifier": "gemini-3-flash",
      "rules": [
        { "complexity": "simple",  "model": "gemini-3-flash" },
        { "complexity": "medium",  "model": "gemini-3.1-pro" },
        { "complexity": "hard",    "model": "claude-opus-4-6-thinking" },
        { "category": "code",      "model": "claude-opus-4-6-thinking" },
        { "category": "multimodal","model": "gemini-3.1-pro" }
      ],
      "fallback": "gemini-3-flash",
      "showRoutingBadge": true
    }
  }
}

How It Works

User sends a message.
The classifier model (fast and cheap, e.g., Gemini Flash) analyzes the message and determines complexity level (simple/medium/hard) and task category (chat/code/translation/multimodal/research/math/creative).
Based on the matching rule, Alma routes the message to the appropriate model.
The response streams as usual — the user just sees the answer.

Key Design Points

Classifier overhead: The classifier should be fast and cheap (Flash-tier). Classification is ~100 tokens in, ~10 tokens out, adding less than 200ms latency.
Rule priority: Category rules take precedence over complexity rules.
Manual override: If the user explicitly selects a model for a session, routing is bypassed.
Routing badge: A small UI indicator showing which model was selected and why.
Fallback chain: If the primary model fails (rate limit, timeout), try the next model.
USER.md integration: Optionally read default routing preferences from USER.md.

Prior Art

OpenRouter auto model — automatically picks the best model, but not user-configurable.
Alma toolModel — already separates tool-calling to a different model, proving multi-model pattern works.

Benefits

Cost optimization — Simple queries use cheap models; complex ones get the power they need.
Quality optimization — Each task type gets the best-suited model.
Zero friction — Users do not need to think about which model to use.
Customizable — Power users can fine-tune rules; casual users use sensible defaults.
Builds on existing architecture — toolModel already proves Alma can handle multi-model flows.

Alma