5 min read
A temperature check on AI agents and workflows

tldr; agents and AI workflows are finally practical, but only when you scope tightly, add guardrails, and measure the right things. Most autonomy claims crumble without these basics.


A dream that tripped on reality

Remember Devin, the “AI software engineer”? On paper, it looked unstoppable. Cognition’s technical report showed Devin solved ~13.9% of issues end‑to‑end on the SWE‑bench benchmark, far ahead of prior baselines.

Then real projects happened. Answer.AI spent a month trying Devin on >20 practical tasks; it fully completed three of them. The Register summarized the same outcome: 3 wins, 3 inconclusive, 14 failures.

That gap here, of benchmark wins vs. production reliability, is the thread of this post. It’s where agents meet the unglamorous engineering that actually ships.


Agents vs. workflows (in plain English)

  • Workflows are recipes: predictable steps with input/output contracts.
  • Agents are loops that decide the next step: they plan, call tools, check results, and repeat until they answer, escalate, or give up.

Below is a minimal example of an agent loop:

let steps = 0;
while (steps++ < MAX_STEPS) {
  const intent = await llm.decide({
    schema: {
      type: "object",
      properties: {
        tool: { enum: ["search", "code", "none"] },
        args: { type: "object" },
        answer: { type: "string" },
        abstain: { type: "boolean" },
      },
      required: [],
      additionalProperties: false,
    },
  });

  if (intent.abstain) return { answer: "I don't know yet.", confidence: 0.2 };

  if (intent.tool && intent.tool !== "none") {
    const result = await TOOLS[intent.tool](intent.args);
    trace.push({ intent, result }); // observability
    continue; // plan next step
  }

  return verifyAndFormat(intent.answer); // final answer path
}
throw new Error("Out of budget");

That’s it: an LLM + a tool belt + a budget. The rest is product engineering.


Scope first or hallucinations will eat your lunch

OpenAI’s recent research put a finger on why models “hallucinate”: our evaluations reward guessing over calibrated “I don’t know.” If the scoreboard prizes confidence, models learn to bluff. The fix is dull but powerful score abstention and calibration, not just accuracy. In product terms: your agent needs permission to say “not enough context” and ask for help.

Here is what that looks like in code:

const schema = {
  type: "object",
  properties: {
    action: { enum: ["answer", "ask_user", "search", "abstain"] },
    answer: { type: "string" },
    need: { type: "string" },
  },
  required: ["action"],
  additionalProperties: false,
};

// If action === "abstain" or "ask_user", surface it in the UI instead of forcing a guess.

Tight scoping + a real abstain path instantly lowers error rates in the wild. (And it makes the rest of your guardrails actually work.)


What actually works

Structured outputs (ban free‑form “maybe JSON”)

Why this matters: tool calls and UI updates can’t run on vibes. Use strict schemas so the model must return valid, typed fields; no missing keys, no invented enums.

  • OpenAI’s Structured Outputs and similar features enforce JSON Schema with a strict option. Anthropic encourages tool use to get JSON‑shaped arguments reliably. ([OpenAI Cookbook][4])

Here is a vendor agnostic demo if this in JS:

const AnswerSchema = {
  name: "Answer",
  schema: {
    type: "object",
    properties: {
      answer: { type: "string" },
      confidence: { type: "number", minimum: 0, maximum: 1 },
      sources: { type: "array", items: { type: "string" } },
    },
    required: ["answer", "confidence"],
    additionalProperties: false,
  },
  strict: true,
};

const resp = await llm.call({
  input: "Summarize risks in this PR.",
  response_format: { type: "json_schema", json_schema: AnswerSchema },
});

Stateful orchestration

“Autonomous” code that just loops tends to wedge or wander. Production systems model the flow as a state machine/graph with explicit pauses, retries, and human checkpoints. LangGraph became the default here, and the platform added persistence, tracing, and deployment for long‑running agents.

nodes: plan -> {retrieve?} -> act(tool) -> verify -> (done | plan)
guards: step_budget, tool_allowlist, approval_required
persistence: save state every step; resume on crash

You get control (and a replayable trace) instead of a black box.


Observability + evals as infrastructure

If you can’t trace it, you can’t trust it. Capture each step (prompts, tool args/results, and outputs) and then run dataset‑based evals on every change.

  • Tools that help: LangSmith (tracing + evals) and Arize Phoenix (OpenTelemetry‑based tracing/evals). Both let you run end‑to‑end checks, not just unit prompts.
  • The mantra you’ve seen: “evals are surprisingly often all you need.”, Greg Brockman (widely quoted in the evals community). Use them to pick models and prompts with data, not vibes.
const cases = [
  {
    input: "Create a PR title for this diff",
    expect: (t) => t.answer.length < 80,
  },
  {
    input: "Explain this stack trace",
    expect: (t) => /NullPointer|KeyError/.test(t.answer),
  },
];

for (const c of cases) {
  const t = await agent.run(c.input);
  console.assert(c.expect(t), `Failed on: ${c.input}`);
}

Human‑in‑the‑loop by design

Let the agent propose; you approve. Make “Apply / Revert / Edit” first‑class. Approval hooks should gate any side‑effecting tool (terminal, Git push, DB writes).

  • In Cursor, the Agent can edit files and run terminal commands; changes show up in a review UI and the CLI asks for command approval by default.
  • In Claude Code (Anthropic’s CLI), you can set --permission-mode, cap steps with --max-turns, and whitelist/blacklist tools (sane defaults for safety).

One‑liner you’ll actually use (Claude Code)

claude -p --max-turns 3 \
  --allowedTools "Bash(git diff:*)" "Read" \
  --output-format json "Summarize this repo and propose a two-task plan"

That’s a scoped agent with a budget and an audit trail.


What goes wrong (and how to avoid it)

The “autonomy” trap

General‑purpose autonomy looks magical in demos and melts on open‑ended tasks. Devin’s month‑long trial is the canonical example: benchmarks up, production hit rate down. Treat autonomy as a spectrum and dial it up slowly as your evals improve.

Hallucinations make messes

If your system can’t abstain, it will hallucinate with confidence. Score and surface uncertainty, or accept that wrong answers will slip through.

Security is the main failure mode

Agents with tool access expand the blast radius. In the Cursor ecosystem, recent reports and advisories showed prompt‑injection vectors and “autorun” risks from malicious repos; underscoring why approval gates, workspace trust, and allowlists matter. (Cursor provides docs and updates; keep them on.)

6) What good agents look like

Cursor (agentic IDE)

  • Why it works for many teams: an Agent that edits with reviewable diffs and can run commandswith approvals inside your project. That’s human‑in‑the‑loop, built in.
  • How to set it up sanely: keep approvals on; avoid “run everything”; treat unknown repos as hostile; audit MCP/tool settings in version control. The recent security write‑ups show why.

Claude Code (Anthropic’s CLI)

  • Why it lands well: lives in your terminal, understands your repo, can run git/test flows; but lets you cap turns, gate permissions, and print JSON for automation. That maps perfectly to “scope + guardrails + observability.”
  • Nice touch: --output-format json means you can pipe results into your own scripts/evals without scraping.

9) A quick word on retrieval

Don’t stop at vector search. For bigger, messier corpora, graph‑flavored retrieval like GraphRAG (entity graphs + community summaries) can answer “global” questions more coherently than naive chunking. Start simple, but know this path exists as your knowledge base grows.


10) Wrap‑up

The pattern behind every working AI agent today is boring, in a good way:

  • Scope narrowly, allow abstention, and budget the loop.
  • Type everything (structured outputs) and trace everything (observability + evals).
  • Put a human in the loop for side‑effects and scary actions.

Do this, and you’ll get compounding wins. Skip it, and you’ll rediscover why glossy demos don’t survive contact with production.