Even after a skill passes the held-back test, the skill doesn't ship. It runs in shadow mode for 4-8 weeks on live customer accounts. The AI generates the answer, but nobody but the framework sees it. The output gets compared to what actually happened.
Plainly: the skill is doing its job behind the scenes for weeks. Nobody reads its output. The framework watches to see whether the AI's calls held up in the real world, with real spend, real customer choices, real outcomes.
If after 4-8 weeks the shadow output is consistent with what actually happened on those accounts, the skill graduates and starts answering for you. If it isn't, back to step 01.