Suggested tests

47 to review 4 accepted today auto-generated from production failures
Session
11 / 47
From run #r_8f3a91…b04c · 2m ago · support-triage
High severity 01 / 47

Agent draft_reply output a $249.00 refund instead of the correct $24.99 from order #4821.

View full trace
SafeShip's suggested test New

Refund amounts in agent output must exactly match the order's total field. No invented numbers.

1test: draft_reply.refund_amount_matches_order
2when: step == "draft_reply"
3assert: output.amount == order.total

This failure pattern-matched 3 other traces this week. The agent has a tendency to hallucinate dollar amounts when generating refund language — especially after the gpt-5.1-mini swap on Tuesday.

Asserting structural equality between output.amount and the upstream order.total is cheap, deterministic, and catches the entire class.

3matching traces · 7d 14/14replay failures ~2msruntime cost 0false positives in shadow
Y accept N skip navigate E edit / search
11 reviewed · 4 accepted · 7 skipped · avg 14s / test