Overview

4,829 runs · last sync just now

Regression score 0 – 100 · higher is better
73/100 ▼ 22

Dropped 22 points on May 10 · draft_reply regressed after gpt-5.1-mini swap.

MAY 04 MAY 05 MAY 06 MAY 07 MAY 08 MAY 10 MAY 11
Recent runs last 6
#r_8f3a91…b04c deploy 73 2m ago
#r_2c4e07…91ff production 81 14m ago
#r_b91d44…ae20 scheduled 95 1h ago
#r_06aa12…7d11 production 96 2h ago
#r_f4cd56…2a88 deploy 94 5h ago
#r_18b720…c5d3 scheduled 97 yesterday
142 runs in the last 7 days View all →

Recent failures 3 unresolved

View all failures →
draft_reply 2m ago

Hallucinated refund amount on order lookup.

draft_replyexpected $24.99, got $249.00

seen 14× · gpt-5.1-mini View trace
tool: search_kb 38m ago

Tool timeout — KB search exceeded 2s budget.

search_kbTimeoutError after 2,041ms

seen · p95 2.4s View trace
classify_intent 3h ago

Misrouted billing escalation as “feedback”.

classify_intentexpected billing, got feedback

seen · low confidence View trace