LLM Routing

Hi,

I'd love to suggest a feature that could significantly reduce API costs while maintaining response quality: **intelligent query routing**.

**The idea:**

Before sending a request to the main LLM, a lightweight local model (e.g., a small classifier or a tiny model like Phi-3-mini / Qwen2.5-0.5B running via Ollama) evaluates the complexity of the user's query and routes it to the appropriate model:

- **Simple** → fast, cheap model (e.g., Claude Haiku, GPT-4o-mini)

- **Medium** → balanced model (e.g., Claude Sonnet, GPT-4o)

- **Complex** → powerful model (e.g., Claude Opus, o1)

**Why this matters:**

In practice, 60–70% of everyday queries are simple or medium complexity. Routing them to cheaper models could cut API costs by 40–60% with little to no quality loss. There's even an open-source framework for this — [RouteLLM by Berkeley](https://github.com/lm-sys/RouteLLM) — that validates this approach.

**Suggested implementation:**

1. A local routing layer that classifies each query before it's sent out

2. Three configurable tiers (Simple / Medium / Complex), each mapped to a user-selected model

3. An optional override — users can manually force a specific model for a request

4. A routing log or indicator showing which model was used and why

This would be especially valuable for power users who send a high volume of mixed queries daily. It turns the app into a cost-aware assistant, not just a model wrapper.

Would love to hear your thoughts on feasibility. Happy to elaborate or test a prototype if helpful!

Thanks for building such a great tool.

Alma

LLM Routing

Subscribe to post

Subscribe to post