From run #r_8f3a91…b04c
2m ago
support-triage
High severity
01 / 47
Agent draft_reply output a $249.00 refund instead of the correct $24.99 from order #4821.
View full trace →
SafeShip's suggested test
New
Refund amounts in agent output must exactly match the order's total field. No invented numbers.
1test: draft_reply.refund_amount_matches_order 2when: step == "draft_reply" 3assert: output.amount == order.total
This failure pattern-matched 3 other traces this week. The agent has a tendency to hallucinate dollar amounts when generating refund language — especially after the gpt-5.1-mini swap on Tuesday.
Asserting structural equality between output.amount and the upstream order.total is cheap, deterministic, and catches the entire class.
3matching traces · 7d
14/14replay failures
~2msruntime cost
0false positives in shadow