ChatWithAds — The chat that knows your ads

Step 01

Define what a right answer looks like.

Before the AI answers a kind of question, the right answer for a hundred real cases gets written down first. Not what the AI thinks. What an experienced marketer would say.

Example

For the question "Which ad is bleeding money this week?" a structured set of real ad accounts gets manually labeled: what's actually losing money, what isn't, by how much. That's the ground truth.

If the right answer for 100 cases can't be agreed on ahead of time, the question isn't ready for the AI yet.

Step 02

Test on data the AI has never seen.

The 100 real cases get split into two piles: 70 to teach the AI's logic, 30 to test it. The AI never sees the 30 during training. When it answers those 30, its calls get compared to the known-correct answers.

What ML people call this: a hold-out split. Why it matters: if you only test on data the AI's already seen, of course it does well. The real test is what it does on questions it hasn't seen before.

If the AI gets fewer than 9 out of 10 of the held-back cases right, the skill doesn't ship. Period.

The 70/30 split. 70 to train, 30 held back.

Step 03

Pick the metric that actually matters.

"Accuracy" can be misleading. If a skill flags 1 out of 100 ads as bleeding money, and only 1 actually is, technically the AI is "99% accurate" by always saying no. Useless.

Different questions need different ways to measure. So for each skill, the test that fits gets picked:

Examples

For "is this ad losing money" → what matters most is how many of the real money-losers got caught. Missing one is the expensive mistake. For "should I act on this right now?" → what matters most is how often the answer is right when the call is yes. False alarms waste your time. For "which campaign should you scale" → what matters is whether the top pick actually outperforms the rest of the list.

Picking the right test is half the work. Picking the wrong one is how marketing math lies.

Step 04

Run silently for weeks before going live.

Even after a skill passes the held-back test, the skill doesn't ship. It runs in shadow mode for 4-8 weeks on live customer accounts. The AI generates the answer, but nobody but the framework sees it. The output gets compared to what actually happened.

Plainly: the skill is doing its job behind the scenes for weeks. Nobody reads its output. The framework watches to see whether the AI's calls held up in the real world, with real spend, real customer choices, real outcomes.

If after 4-8 weeks the shadow output is consistent with what actually happened on those accounts, the skill graduates and starts answering for you. If it isn't, back to step 01.

No numbers until we've measured them.

Most AI products list precision numbers like "94% accurate" or "F1 score 0.92." Some are real. Most aren't. They're measured against a self-selected demo set, or quoted from one cherry-picked test.

We don't publish ours yet because we don't have enough customers using each skill in production for the numbers to mean anything. The framework above is what we commit to. The specific percentages come when there's a real customer corpus to measure them against. Not before.

What we'll publish when we have it:

Once we have a real customer corpus to measure against, here's what you'll see: how often each skill got it right on a held-back test. How often the AI's draft matched the experienced marketer's call during shadow mode. The date each skill graduated from shadow to live. Updated quarterly, every figure auditable from chat histories under NDA.

No vanity stats. No demo-set numbers. No "industry-leading" claims.

For a worked example of what one finished answer looks like, see the composite case anatomy →

What happens when it gets it wrong.

Every answer has a one-tap flag. Wrong answers go straight to the founder's inbox. Each correction gets investigated. The underlying skill is either fixed or rolled back. We publish rollback events in our quarterly status update.

What "I don't know" looks like

Will my Q4 ROAS be better than Q3?

Not enough Q3 data to answer that yet. Your holiday window doesn't close for 9 more days. Ask me again on the 24th and I'll have a real read.

We'd rather it say "I don't know" than make something up.

Pre-built skills, backtested.