Skip to content
Get started

How Optimization Works

Understand how Overmind evaluates your traces and recommends better prompts and models. This page explains each step.

Traces come in
Agent detection
LLM Judge evaluation
├──────────────────────┐
▼ ▼
Prompt experimentation Model experimentation
│ │
└──────────┬───────────┘
Recommendations
User feedback
(accept / reject)
Refined criteria,
repeat the loop

Every LLM call made through the Overmind client is recorded as a trace. A trace captures:

  • Input: The full prompt (system message, user message, tool definitions)
  • Output: The model’s response
  • Model: Which model was used (e.g., gpt-4o, claude-sonnet-4-5-20250929)
  • Timing: Latency in milliseconds
  • Tokens: Input and output token counts
  • Cost: Estimated cost based on model pricing
  • Metadata: Project, user, workflow name, custom tags

Traces are visible immediately in the dashboard under the Traces tab, with a flame chart view for understanding execution timing.


Once Overmind has 30+ traces, it analyses them to detect agents.

Each agent uses its own template allowing us to identify distinct Agents among your LLM traces. A template is the fixed structure of your prompt, with variables separated out. For example, if your traces contain prompts like:

You are a helpful customer support agent for Acme Corp.
The customer's name is John and their order ID is #12345.
Please help them with their question: "Where is my package?"

Overmind identifies the template:

You are a helpful customer support agent for {company}.
The customer's name is {customer_name} and their order ID is {order_id}.
Please help them with their question: "{question}"

This separation is important because optimization works on the template (the fixed part you control), not the variables (which change per request).

Agent detection happens automatically. If you want more control, you can also define Agents explicitly via the SDK — see the SDK Reference.


With Agents, templates and traces in hand, Overmind understands what your agent is trying to achieve, formulates a set of criteria defining what good agent response should look like. It then runs an LLM judge — a separate LLM call that scores each trace on these criteria.

Default evaluation dimensions:

DimensionWhat it measures
QualityIs the output correct, does it meet the specified criteria?
CostHow much did this call cost? Can we achieve similar quality for less?
LatencyHow long did the call take? Are there faster alternatives?

The judge evaluates every trace and produces scores that are aggregated per template and per model.

While Overmind formulates the initial set of criteria by itself, you are always in control and can edit these criteria as you see fit — for example, you might add:

  • “Must not contain financial advice”
  • “Must cite sources when making factual claims”
  • “Must always use formulas and not hardcoded values”

Once the baseline is established (your current prompts scored by the judge), Overmind generates candidate prompt variations and tests them.

How it works:

  1. Overmind generates modified versions of your prompt template (rephrased instructions, restructured format, different emphasis)
  2. Each variation is tested against the same inputs from your historical traces
  3. The LLM judge scores the outputs of each variation
  4. Variations that score better on your criteria are flagged as recommendations

Example: Overmind might discover that restructuring your system prompt from a paragraph into bullet points improves response quality by 15% as measured by the judge.


In parallel with prompt experimentation, Overmind tests your traces against different models.

For example, if you’re currently using gpt-5, Overmind will replay your inputs through:

  • gpt-5-mini (cheaper, faster)
  • claude-sonnet-4-5-20250929 (different provider)
  • Other available models

The LLM judge scores the outputs from each model using the same criteria. The result is a cost/quality/latency comparison:

ModelQuality ScoreAvg LatencyAvg Cost
gpt-5 (current)8.5/101200ms$0.03
gpt-5-mini8.1/10400ms$0.004
claude-sonnet-4-58.7/10900ms$0.02

This helps you make informed decisions about model selection for your specific use case.


Results from prompt and model experimentation appear in the dashboard as suggestions.

Each recommendation includes:

  • What changed: The specific prompt modification or model swap
  • Impact: Before vs after scores on quality, cost, and latency
  • Confidence: How many traces were used to validate the recommendation

You review each recommendation and take action:

  • Accept: Apply the suggested change (update your prompt or model)
  • Reject: Mark the suggestion as not useful
  • Provide feedback: Explain why a suggestion doesn’t work for your use case

Your feedback refines our optimisation engine. If you reject a recommendation because the output style doesn’t match your brand voice, Overmind learns to factor that into future evaluations. If you say that the new model is too slow we will assign larger weight to latency in our future suggestions.

The loop then repeats with improved evaluation criteria and recommendation logic producing increasingly relevant recommendations.


Optimization can be triggered in two ways:

  • Automatic: Overmind runs evaluations when new batches of traces arrive (30+ new traces for a given template)
  • Manual: You can trigger an evaluation run from the dashboard for any template at any time

Browse all LLM invocations. Filter by time range, project, agent, or status. Click into any trace to see full input/output, timing breakdown (flame chart), and evaluation scores.

You can see agent-level information - current prompt template, model, performance statistics etc. We will automatically highlight problematic traces and surface new cost saving, performance improvement opportunities.

From this page you could also manually trigger prompt tuning and model backtesting, provide your feedback and adjust evaluation criteria.


How many traces do I need before optimization starts? At least 30 traces for a given prompt template. More traces produce better recommendations.

Does Overmind modify my prompts in production? No. Overmind only suggests changes. You decide what to adopt and update your code accordingly.

Can I use custom evaluation criteria? Yes. You can define custom criteria for the LLM judge from the dashboard.

What if I disagree with a recommendation? Reject it and optionally provide feedback. The system learns from your preferences.