Prompt engineering by feel is how teams ship bad agents. This harness lets you A/B test two configurations (different prompts, different models, different tool sets) on the same live traffic — see which converts / scores better.
What's Included
- Traffic Splitter: 50/50 or weighted split between configurations
- Outcome Tracking: User reaction (good / bad), downstream conversion, follow-up question rate
- Stat Significance: Surfaces winners with confidence intervals (not 'we ran it for an hour and Variant B was better')
- Cost Compare: Cost per outcome, not just cost per run
- One-Click Promote: Winning config becomes the default with one tap
How To Use
- Clone this app
- Define two agent configurations in the Variants project
- Connect to your live agent traffic via automation
- The harness routes traffic, tracks outcomes, surfaces the winner
