Skip to content
Get started
Overview

Introduction

Automatically optimize your AI agents — better prompts, better tools, better models.

Most “agent observability” platforms today focus on the manual half of the loop: collecting traces, surfacing slow runs, optimizing prompts in isolation, and putting a human in front of every dashboard to review and tweak. That works for experiments and demos, but it doesn’t scale to the real prize — agents deployed in industry that run autonomously on real data, day after day, doing concrete jobs (qualifying leads, triaging support, extracting contracts, coding clinical encounters, processing invoices, handling on-call alerts, automating returns, generating research briefs).

These agents fail in long-tailed, data-shaped ways that no amount of prompt-by-hand iteration will catch. You ship an agent, it works 70% of the time, and you spend weeks reading logs, tweaking prompts, adjusting tool definitions, and re-running evals — only to find that fixing one case breaks three others.

Overmind closes that loop autonomously: it runs the agent against real data, scores the outputs against your policies, diagnoses the failures with a strong reasoning model, generates and validates concrete code fixes, and only keeps changes that measurably improve the agent without regressions. No dashboards to babysit. No manual prompt-tweak sessions. No bespoke eval scripts.

Overmind is an agent improvement platform focused on one outcome: better production agents over time, driven by real data — without manual review loops.

Want the product-level view first? See the Overmind Platform Overview.

Today, Overmind includes:

You register your agent as a Python entrypoint, define (or let Overmind infer) a policy describing what correct behavior looks like, and run overmind optimize. Overmind then:

  1. Runs your agent across a test dataset and collects detailed traces of every LLM call and tool invocation.
  2. Scores outputs against an evaluation spec derived from your policy.
  3. Diagnoses failures with a strong reasoning model that sees the code, policy, traces, and scores.
  4. Generates candidate code fixes (best-of-N) targeting prompts, tool descriptions, model selection, and agent logic.
  5. Accepts or reverts each candidate based on regression-aware acceptance criteria — keeping only changes that genuinely improve the agent.

After a few rounds you get a measurably better agent, plus a readable report and diff.

  • Autonomous, not manual. Other observability tools collect traces and let you read them; Overmind reads them for you, diagnoses why runs failed, generates code fixes, validates them, and accepts only the ones that improve the agent. The dashboard is the diff and the report — not a queue of runs awaiting your eyeballs.
  • Policy-driven. You define the decision rules, constraints, and expectations. Every stage — evaluation, synthesis, diagnosis, scoring — respects them, so the optimizer can’t game metrics in ways that violate your business rules.
  • Full-stack, not prompt-only. Overmind optimizes system prompts, tool descriptions, model selection, control flow, output parsing, and iteration limits — together, in a single loop. Generic prompt optimizers tweak a string in isolation; real production agents need the whole stack moved at once.
  • Trace-aware. Diagnosis starts from detailed per-call traces, so the analyzer sees why a run failed, not just that it did.
  • Zero infrastructure. Runs from your terminal. Artifacts are pushed to the Overmind backend and viewable at console.overmindlab.ai/agents — no local setup required.
  • Safe by default. Train/holdout splits, regression-aware acceptance, complexity penalties, and label-leakage guards prevent the optimizer from overfitting or hacking the metric.
Your Python agent (registered entrypoint)
overmind optimize <name>
┌────────────────────┴────────────────────┐
│ │
▼ │
Run agent on dataset ──▶ Traces + outputs │
│ │
▼ │
Score vs. eval │
spec (+ policy) │
│ │
▼ │
Diagnose failures │
│ │
▼ │
Generate N candidate fixes│
│ │
▼ │
Validate + re-score │
│ │
▼ │
Accept best / revert rest ┘
optimized agent + report (console)

Requirements: Python 3.10+, uv, and API keys for at least one LLM provider (OpenAI, Anthropic).

  1. Install the CLI:
Terminal window
pip install overmind
  1. Initialize in your project:
Terminal window
cd your-agent-project/
overmind init # configure API keys and default models
  1. Register your agent entrypoint:
Terminal window
overmind agent register my-agent agents.my_agent:run

Your function receives an input dict and must return a dict:

def run(input_data: dict) -> dict:
return {"response": result}

Framework-based agents (Google ADK, LangChain, CrewAI, …) don’t need a custom wrapper — Overmind detects them and offers to auto-generate an entrypoint.

  1. Set up evaluation:
Terminal window
overmind setup my-agent
# or bring your own policy document:
overmind setup my-agent --policy docs/my_policy.md

This analyzes your code, generates (or imports) a policy, builds a test dataset, and proposes scoring criteria.

  1. Optimize:
Terminal window
overmind optimize my-agent

Sit back as Overmind iteratively diagnoses, fixes, and validates improvements. Results are pushed to the backend — view them at console.overmindlab.ai/agents.

See the Getting Started guide for the full walkthrough and the Overmind guide for deep reference.

If you want to trace LLM calls from a running application (staging, production, or during local development) — independently of the optimizer — Overmind ships Python and JavaScript SDKs. Call init() once and every LLM call is captured with model, inputs/outputs, latency, token counts, and cost.

Terminal window
pip install overmind

See the Python SDK reference, the JS/TS SDK reference, and the integrations guide for supported providers and frameworks.