SafeShip — Trace

Dashboard / Runs / Run #r_8f3a91…b04c

Agent timeline 5 steps · 4 passed · 1 failed

01 llmclassify_intent

in"Hi, I was double charged on my order from last week, can I get a refund?"

out{ "intent": "refund_request", "confidence": 0.94 }

142ms

02 toollookup_order

in{ "user_id": "u_77a2", "query": "last_order" }

out{ "order_id": "4821", "total": 24.99, "items": 2, "status": "shipped" }

87ms

03 llmdraft_reply

in{ intent, order: #4821, total: $24.99 } → "compose refund confirmation"

out"…we've issued a refund of $249.00 to your card…"

1,847ms

Full input

// system + user composed prompt
{
  "intent": "refund_request",
  "order": {
    "order_id": "4821",
    "total": 24.99,
    "currency": "USD",
    "items": 2
  },
  "instruction": "Compose a short, friendly refund confirmation. Quote the exact order total. Do not invent additional fees."
}

Full output

"Hi there — thanks for reaching out! I've gone ahead and issued a refund of $249.00 back to the card on file for order #4821. You should see it land within 3–5 business days. Sorry for the inconvenience, and let me know if anything else looks off!"

What went wrong

Expected refund amount $24.99 based on order #4821, but the agent output $249.00 — a 10× overstatement.

Likely cause: the model dropped the decimal point from the total field while parsing the prompt and appended an extra zero. Pattern matches 3 other refund-related failures in the last 24h.

Output diff regression introduced by model swap

– gpt-5.0 + gpt-5.1-mini

1 "Hi there — thanks for reaching out! I've gone ahead and
2-issued a refund of $24.99 back to the card on file for order #4821.
2+issued a refund of $249.00 back to the card on file for order #4821.
3 You should see it land within 3–5 business days. Sorry
4 for the inconvenience, and let me know if anything else
5 looks off!"

Same prompt · different model · failing on 14/14 replays Open in diff viewer →

04 toolpolicy_check

in{ "draft": "…refund of $249.00…", "ruleset": "support.v3" }

out{ "passed": true, "flags": [] } // downstream check missed numeric mismatch

61ms

05 toolqueue_for_send

in{ "channel": "email", "thread_id": "th_19c4" }

out{ "queued": true, "scheduled_at": "2026-05-12T14:21:08Z" }

23ms

Raw trace JSON · 4.2 KB

{
  "run_id": "r_8f3a91b04c",
  "project": "support-triage",
  "env": "prod",
  "started_at": "2026-05-12T14:18:54.211Z",
  "duration_ms": 3184,
  "status": "failed",
  "failure": {
    "test": "draft_reply.no_hallucinated_refund",
    "step": 3,
    "expected": "$24.99",
    "actual": "$249.00",
    "severity": "high"
  },
  "model": {
    "name": "gpt-5.1-mini",
    "temperature": 0.2,
    "prev": "gpt-5.0"
  },
  "trigger": "deploy",
  "deploy": {
    "sha": "a4f7c2e",
    "by": "marin.n@safeship.dev"
  },
  "steps": [
    {
      "i": 1,
      "tool": "classify_intent",
      "kind": "llm",
      "ok": true,
      "ms": 142,
      "in": "Hi, I was double charged…",
      "out": { "intent": "refund_request", "confidence": 0.94 }
    },
    {
      "i": 2,
      "tool": "lookup_order",
      "kind": "tool",
      "ok": true,
      "ms": 87,
      "in": { "user_id": "u_77a2" },
      "out": {
        "order_id": "4821",
        "total": 24.99,
        "items": 2,
        "status": "shipped"
      }
    },
    {
      "i": 3,
      "tool": "draft_reply",
      "kind": "llm",
      "ok": false,
      "ms": 1847,
      "assertion": "output.contains($24.99)",
      "matched": false,
      "in": { /* … see expanded view */ },
      "out": "…issued a refund of $249.00…",
      "diff_against": "r_18b720c5d3 · gpt-5.0"
    },
    {
      "i": 4,
      "tool": "policy_check",
      "kind": "tool",
      "ok": true,
      "ms": 61,
      "out": { "passed": true, "flags": [] }
    },
    {
      "i": 5,
      "tool": "queue_for_send",
      "kind": "tool",
      "ok": true,
      "ms": 23,
      "out": { "queued": true }
    }
  ],
  "tokens": { "in": 412, "out": 98, "cost_usd": 0.0021 },
  "tags": ["refund", "numeric", "hallucination", "regression"],
  "parent_test": "draft_reply.suite",
  "replay_url": "https://safeship.dev/r/8f3a91b04c"
}

draft_reply should not hallucinate refund amounts

What went wrong