SafeShip · Suggested tests

From run #r_8f3a91…b04c · 2m ago · support-triage

High severity 01 / 47

Agent draft_reply output a $249.00 refund instead of the correct $24.99 from order #4821.

SafeShip's suggested test New

Refund amounts in agent output must exactly match the order's total field. No invented numbers.

1test: draft_reply.refund_amount_matches_order
2when: step == "draft_reply"
3assert: output.amount == order.total

This failure pattern-matched 3 other traces this week. The agent has a tendency to hallucinate dollar amounts when generating refund language — especially after the gpt-5.1-mini swap on Tuesday.

Asserting structural equality between output.amount and the upstream order.total is cheap, deterministic, and catches the entire class.

3matching traces · 7d 14/14replay failures ~2msruntime cost 0false positives in shadow