How Optimization Works

Guides

Understand how Overmind evaluates your traces and recommends better prompts and models. This page explains each step.

The Optimization Loop

Traces come in
      │
      ▼
Agent detection
      │
      ▼
LLM Judge evaluation
      │
      ├──────────────────────┐
      ▼                      ▼
Prompt experimentation  Model experimentation
      │                      │
      └──────────┬───────────┘
                 ▼
        Recommendations
                 │
                 ▼
        User feedback
        (accept / reject)
                 │
                 ▼
        Refined criteria,
        repeat the loop

Step 1: Trace Collection

Every LLM call made through the Overmind client is recorded as a trace. A trace captures:

Input: The full prompt (system message, user message, tool definitions)
Output: The model’s response
Model: Which model was used (e.g., gpt-4o, claude-sonnet-4-5-20250929)
Timing: Latency in milliseconds
Tokens: Input and output token counts
Cost: Estimated cost based on model pricing
Metadata: Project, user, workflow name, custom tags

Traces are visible immediately in the dashboard under the Traces tab, with a flame chart view for understanding execution timing.

Step 2: Agent Detection

Once Overmind has 30+ traces, it analyses them to detect agents.

Each agent uses its own template allowing us to identify distinct Agents among your LLM traces. A template is the fixed structure of your prompt, with variables separated out. For example, if your traces contain prompts like:

You are a helpful customer support agent for Acme Corp.
The customer's name is John and their order ID is #12345.
Please help them with their question: "Where is my package?"

Overmind identifies the template:

You are a helpful customer support agent for {company}.
The customer's name is {customer_name} and their order ID is {order_id}.
Please help them with their question: "{question}"

This separation is important because optimization works on the template (the fixed part you control), not the variables (which change per request).

Agent detection happens automatically. If you want more control, you can also define Agents explicitly via the SDK — see the SDK Reference.

Step 3: Scoring

With Agents, templates and traces in hand, Overmind understands what your agent is trying to achieve, formulates a set of criteria defining what good agent response should look like. It then runs an LLM judge — a separate LLM call that scores each trace on these criteria.

Default evaluation dimensions:

Dimension	What it measures
Quality	Is the output correct, does it meet the specified criteria?
Cost	How much did this call cost? Can we achieve similar quality for less?
Latency	How long did the call take? Are there faster alternatives?

The judge evaluates every trace and produces scores that are aggregated per template and per model.

While Overmind formulates the initial set of criteria by itself, you are always in control and can edit these criteria as you see fit — for example, you might add:

“Must not contain financial advice”
“Must cite sources when making factual claims”
“Must always use formulas and not hardcoded values”

Step 4: Prompt Experimentation

Once the baseline is established (your current prompts scored by the judge), Overmind generates candidate prompt variations and tests them.

How it works:

Overmind generates modified versions of your prompt template (rephrased instructions, restructured format, different emphasis)
Each variation is tested against the same inputs from your historical traces
The LLM judge scores the outputs of each variation
Variations that score better on your criteria are flagged as recommendations

Example: Overmind might discover that restructuring your system prompt from a paragraph into bullet points improves response quality by 15% as measured by the judge.

Step 5: Model Experimentation

In parallel with prompt experimentation, Overmind tests your traces against different models.

For example, if you’re currently using gpt-5, Overmind will replay your inputs through:

gpt-5-mini (cheaper, faster)
claude-sonnet-4-5-20250929 (different provider)
Other available models

The LLM judge scores the outputs from each model using the same criteria. The result is a cost/quality/latency comparison:

Model	Quality Score	Avg Latency	Avg Cost
gpt-5 (current)	8.5/10	1200ms	$0.03
gpt-5-mini	8.1/10	400ms	$0.004
claude-sonnet-4-5	8.7/10	900ms	$0.02

This helps you make informed decisions about model selection for your specific use case.

Step 6: Suggestions

Results from prompt and model experimentation appear in the dashboard as suggestions.

Each recommendation includes:

What changed: The specific prompt modification or model swap
Impact: Before vs after scores on quality, cost, and latency
Confidence: How many traces were used to validate the recommendation

Step 7: User Feedback

You review each recommendation and take action:

Accept: Apply the suggested change (update your prompt or model)
Reject: Mark the suggestion as not useful
Provide feedback: Explain why a suggestion doesn’t work for your use case

Your feedback refines our optimisation engine. If you reject a recommendation because the output style doesn’t match your brand voice, Overmind learns to factor that into future evaluations. If you say that the new model is too slow we will assign larger weight to latency in our future suggestions.

The loop then repeats with improved evaluation criteria and recommendation logic producing increasingly relevant recommendations.

Triggering Optimization

Optimization can be triggered in two ways:

Automatic: Overmind runs evaluations when new batches of traces arrive (30+ new traces for a given template)
Manual: You can trigger an evaluation run from the dashboard for any template at any time

Dashboard Views

Traces Overview

Browse all LLM invocations. Filter by time range, project, agent, or status. Click into any trace to see full input/output, timing breakdown (flame chart), and evaluation scores.

Agent Statistics & Recommendations

You can see agent-level information - current prompt template, model, performance statistics etc. We will automatically highlight problematic traces and surface new cost saving, performance improvement opportunities.

From this page you could also manually trigger prompt tuning and model backtesting, provide your feedback and adjust evaluation criteria.

FAQ

How many traces do I need before optimization starts? At least 30 traces for a given prompt template. More traces produce better recommendations.

Does Overmind modify my prompts in production? No. Overmind only suggests changes. You decide what to adopt and update your code accordingly.

Can I use custom evaluation criteria? Yes. You can define custom criteria for the LLM judge from the dashboard.

What if I disagree with a recommendation? Reject it and optionally provide feedback. The system learns from your preferences.