Hi,
I'd love to suggest a feature that could significantly reduce API costs while maintaining response quality: **intelligent query routing**.
**The idea:**
Before sending a request to the main LLM, a lightweight local model (e.g., a small classifier or a tiny model like Phi-3-mini / Qwen2.5-0.5B running via Ollama) evaluates the complexity of the user's query and routes it to the appropriate model:
- **Simple** → fast, cheap model (e.g., Claude Haiku, GPT-4o-mini)
- **Medium** → balanced model (e.g., Claude Sonnet, GPT-4o)
- **Complex** → powerful model (e.g., Claude Opus, o1)
**Why this matters:**
In practice, 60–70% of everyday queries are simple or medium complexity. Routing them to cheaper models could cut API costs by 40–60% with little to no quality loss. There's even an open-source framework for this — [RouteLLM by Berkeley](https://github.com/lm-sys/RouteLLM) — that validates this approach.
**Suggested implementation:**
1. A local routing layer that classifies each query before it's sent out
2. Three configurable tiers (Simple / Medium / Complex), each mapped to a user-selected model
3. An optional override — users can manually force a specific model for a request
4. A routing log or indicator showing which model was used and why
This would be especially valuable for power users who send a high volume of mixed queries daily. It turns the app into a cost-aware assistant, not just a model wrapper.
Would love to hear your thoughts on feasibility. Happy to elaborate or test a prototype if helpful!
Thanks for building such a great tool.
Please authenticate to join the conversation.
In Review
Feature Request
About 3 hours ago

Koben Alex
Get notified by email when there are changes.
In Review
Feature Request
About 3 hours ago

Koben Alex
Get notified by email when there are changes.