Methodology

Pre-built skills, backtested.

Every skill is validated on $100M+ of real ad-spend data before it answers for you. Here's the 4-step framework, plus what hasn't been measured yet.

An AI that reads your ads will sometimes be wrong. ChatWithAds can't pretend otherwise. What the framework does is make sure every kind of question the AI answers has been tested against a real, held-back chunk of data, before any answer gets to you.

This page is the framework. Four steps, every skill, every time. The technical bits get translated into plain words. The page also says what hasn't been measured yet, because publishing fake precision numbers would violate the only thing we promise: no comfortable lies.

01
Step 01

Define what a right answer looks like.

Before the AI answers a kind of question, the right answer for a hundred real cases gets written down first. Not what the AI thinks. What an experienced marketer would say.

Example
For the question "Which ad is bleeding money this week?" a structured set of real ad accounts gets manually labeled: what's actually losing money, what isn't, by how much. That's the ground truth.

If the right answer for 100 cases can't be agreed on ahead of time, the question isn't ready for the AI yet.

02
Step 02

Test on data the AI has never seen.

The 100 real cases get split into two piles: 70 to teach the AI's logic, 30 to test it. The AI never sees the 30 during training. When it answers those 30, its calls get compared to the known-correct answers.

What ML people call this: a hold-out split. Why it matters: if you only test on data the AI's already seen, of course it does well. The real test is what it does on questions it hasn't seen before.

If the AI gets fewer than 9 out of 10 of the held-back cases right, the skill doesn't ship. Period.

TRAIN (70)HELD-BACK (30)SKILL ANSWERS THESE
The 70/30 split. 70 to train, 30 held back.
03
Step 03

Pick the metric that actually matters.

"Accuracy" can be misleading. If a skill flags 1 out of 100 ads as bleeding money, and only 1 actually is, technically the AI is "99% accurate" by always saying no. Useless.

Different questions need different ways to measure. So for each skill, the test that fits gets picked:

Examples
For "is this ad losing money" → what matters most is how many of the real money-losers got caught. Missing one is the expensive mistake. For "should I act on this right now?" → what matters most is how often the answer is right when the call is yes. False alarms waste your time. For "which campaign should you scale" → what matters is whether the top pick actually outperforms the rest of the list.

Picking the right test is half the work. Picking the wrong one is how marketing math lies.

04
Step 04

Run silently for weeks before going live.

Even after a skill passes the held-back test, the skill doesn't ship. It runs in shadow mode for 4-8 weeks on live customer accounts. The AI generates the answer, but nobody but the framework sees it. The output gets compared to what actually happened.

Plainly: the skill is doing its job behind the scenes for weeks. Nobody reads its output. The framework watches to see whether the AI's calls held up in the real world, with real spend, real customer choices, real outcomes.

If after 4-8 weeks the shadow output is consistent with what actually happened on those accounts, the skill graduates and starts answering for you. If it isn't, back to step 01.

No numbers until we've measured them.

Most AI products list precision numbers like "94% accurate" or "F1 score 0.92." Some are real. Most aren't. They're measured against a self-selected demo set, or quoted from one cherry-picked test.

We don't publish ours yet because we don't have enough customers using each skill in production for the numbers to mean anything. The framework above is what we commit to. The specific percentages come when there's a real customer corpus to measure them against. Not before.

What we'll publish when we have it:

Once we have a real customer corpus to measure against, here's what you'll see: how often each skill got it right on a held-back test. How often the AI's draft matched the experienced marketer's call during shadow mode. The date each skill graduated from shadow to live. Updated quarterly, every figure auditable from chat histories under NDA.

No vanity stats. No demo-set numbers. No "industry-leading" claims.

For a worked example of what one finished answer looks like, see the composite case anatomy →

When we're wrong

What happens when it gets it wrong.

Every answer has a one-tap flag. Wrong answers go straight to the founder's inbox. Each correction gets investigated. The underlying skill is either fixed or rolled back. We publish rollback events in our quarterly status update.

What "I don't know" looks like
Will my Q4 ROAS be better than Q3?
Not enough Q3 data to answer that yet. Your holiday window doesn't close for 9 more days. Ask me again on the 24th and I'll have a real read.

We'd rather it say "I don't know" than make something up.

The shortcuts we don't take.

We don't
Push changes to your ad accounts.

Read-only is the wedge. The AI reads your data and answers questions. It never pauses, scales, edits, or spends. You stay in control of every action.

We don't
Train and test on the same data.

If you do, your "accuracy" is theater. The hold-back split is non-negotiable.

We don't
Pick the metric after seeing the results.

The metric for each skill is decided in step 01, before any testing. Picking a flattering metric after the fact is just shopping for a story.

We don't
Ship straight from passing tests into live answers.

Test-set performance and real-world performance diverge. Shadow mode catches that divergence. No skill graduates without 4-8 weeks of silent runtime.

We don't
Quote precision numbers we haven't measured against a real customer corpus.

When we have the data, we publish. Until then, the framework is the promise.

We don't
Use your data to train models that other customers see.

Each customer's account is read for that customer's answers only. The skills are pre-built; they don't fine-tune on your ad spend.

The framework is the promise.

Run the 7-day trial on your own account. That's the only honest evidence either of us can use right now: the answers in your own data.

Ad intelligence through conversation. The reasoning engine for growing brands.

© 2026 ChatWithAds. All rights reserved.