AA test: how to run one and read the results

Q: What sample size do you need for an AA test?

Use the same [statistical power and sample sizing](/statistical-power-power-analysis-for-a-b) calculations as your A/B tests. The short answer: more is better. For a typical click-through rate of 0.6%, you'd need roughly 191,000 visitors per group to detect a tiny change to 0.65%. That sounds like a lot (and it is). For most small sites, running the test for 1-2 weeks and checking for obvious problems is plenty. You don't need to hit textbook sample sizes to catch a broken setup. You're looking for big, obvious problems, not subtle statistical effects. See also: [minimum detectable effect](/minimum-detectable-effect).

An AA test (also called an A/A test) runs two identical versions of the same page against each other. Same content. Same design. Same everything. The goal isn’t to find a winner. It’s to make sure your testing tool actually works before you trust it with real decisions.

Think of it like stepping on a scale twice in a row. If it shows two different numbers, you wouldn’t trust it to track your weight. Same idea here. If your testing tool says Version A beat Version B when both versions are the exact same page, something’s broken.

Ron Kohavi, the researcher behind Microsoft’s experimentation platform, puts it simply: A/A tests “fail so many times in practice” that the failures themselves are the value. They expose bugs you’d never catch any other way.

What is an AA test?

An AA test shows the same page to two groups of visitors. If your tool reports a difference, your setup has a problem.

An A/A test splits your traffic 50/50 between two groups. Both groups see the same page. No changes. No tweaks. Nothing different at all.

In a normal A/B test, you’d change a headline or a button color and compare results. In an A/A test, you skip the change. You’re not testing your page. You’re testing your testing tool.

The technical name is a “null test” (because you expect no difference between groups). Some people write it as “A/A test” and others write “AA test.” Same thing.

An A/B test asks “which version is better?” An A/A test asks “can I trust the answer?”

If you’re new to what split testing means, the concept is simple. You show different visitors different things and see which performs better. An A/A test is the sanity check you run first.

Why run an A/A test before A/B testing

A/A testing catches broken setups, bad tracking, and uneven traffic splits before they corrupt your real test results.

Your A/B test results are only as good as the system producing them. If the foundation is shaky, every test you run on top of it is suspect.

It confirms your traffic split is actually even. You set 50/50. But does one group get 60% of visitors? That’s called sample ratio mismatch (your visitors aren’t being divided evenly). An A/A test catches this immediately.

It also validates that your tracking works end to end. Maybe your analytics script loads before your testing script on some browsers. Maybe your consent banner blocks tracking for one group. These things happen, and an A/A test reveals them.

Before you start improving your A/B testing conversion rate, you need to know what “normal” looks like. An A/A test confirms your control group validity by proving the baseline performs consistently when nothing changes.

Booking.com runs over 25,000 tests a year. They use A/A testing as fundamental validation for their entire program. If the biggest testing operation in the world considers it essential, there’s probably something to it.

In our experience, the most common issue an A/A test catches is a mismatch between what your testing tool reports and what your analytics platform shows. The numbers should be close. If they’re wildly different, you’ve got a data plumbing problem.

It’s one of the most overlooked A/B testing best practices: validate before you test.

What an AA test can (and can’t) catch

A/A tests catch infrastructure problems. They don’t guarantee your future A/B tests will produce valid results.

A passed A/A test is reassuring. But it’s not a free pass.

What it catches:

Broken randomization (visitors not being assigned to groups properly)
Tracking gaps (events firing for one group but not the other)
Uneven traffic splits
Implementation bugs in your testing script
Mismatches between your testing tool and analytics

What it doesn’t catch:

Problems that only show up when you actually change something (like a JavaScript error on a new headline variant)
Effects from running multiple tests at the same time
Novelty effects (where visitors behave differently just because something is new)
Subtle math problems that only appear with specific types of data

A passed A/A test means your foundation is solid. It doesn’t mean everything built on top will be perfect. It’s necessary, not sufficient. If you want to understand the math behind why some results fool you, see Type 1 and Type 2 errors in testing.

Our take: An A/A test is like checking the batteries before you use the remote. It won’t guarantee the movie is good, but at least you’ll know the remote works.

How to run an AA test (step by step)

Pick a high-traffic page, create two identical versions, split traffic 50/50, and let it run for at least a week.

This is simpler than a regular A/B test because you’re not designing anything new. You’re just running the test machinery with no actual change.

Pick your page. Choose a page with decent traffic. Your homepage or main landing page usually works. More visitors means faster, more reliable results.
Create two identical versions. In your testing tool, set up a test where Version A and Version B are exactly the same. Don’t change a single thing. Not a pixel, not a comma. Kirro makes this straightforward, but any A/B testing tool will do. Just duplicate the page and leave it untouched.
Set a 50/50 split. Half your visitors see Version A. Half see Version B. Every major platform (Statsig, Amplitude, Optimizely) recommends a 50/50 split for A/A testing.
Choose your metric. What are you measuring? Conversion rate, click rate, revenue per visitor. Pick the same metric you’ll use for future A/B tests.
Let it run for at least one full week. Weekdays and weekends have different traffic patterns. You need both. Amplitude’s documentation recommends at least one full business cycle, and we agree. Two weeks is even better if you can wait.
Compare the results. Check your testing tool’s report and your analytics platform side by side. The numbers for both groups should be close.

If you want to design a marketing experiment properly, this step comes before anything else. And you can set up a free A/A test in Kirro in about three minutes.

One critical rule: don’t check your results early. Researcher Evan Miller showed that peeking at results inflates your false positive rate from 5% to 26.1%. Adobe’s own documentation confirms that with continuous peeking, an A/A test is “guaranteed to show statistical significance at some point.” Set a date. Come back then.

How to read your AA test results

The expected result is “no winner.” If your tool declares a winner between two identical pages, something’s wrong.

A boring result is a good result here. You want your testing tool to shrug and say “no meaningful difference.” That means everything is working.

No significant difference? Your setup is working. The traffic split is even, tracking is consistent, and your tool’s math checks out. Go run real A/B tests.

Your tool reports a “winner”? Two possibilities. First, random noise. At standard confidence levels, roughly 1 in 20 A/A tests will show a false “winner” purely by chance. That’s how probability works, not a bug. The technical term is a false positive, and it’s expected about 5% of the time. (More on the null hypothesis in A/B testing if you’re curious.)

But if the difference is large, consistent, or shows up alongside a lopsided traffic split, you’ve probably got a real problem. Time to debug.

Things to check:

Is the traffic split close to 50/50? Small deviations (like 49.3/50.7) are normal. Large ones (like 45/55) are a red flag.
Are the conversion rates for both groups similar? A 2.1% vs 2.3% difference is noise. A 2.1% vs 4.7% gap is a problem.
Do your testing tool and analytics platform agree on visitor counts?

The key A/B testing metrics you track in regular tests are the same ones you check here. The difference is that you expect them to be equal.

One more thing to watch. With skewed data (like revenue, where a few big spenders throw off the average), false positive rates can reach 30% instead of the expected 5%. ByteDance’s research team documented this. If you’re testing revenue metrics, you might need a longer test or a different statistical approach. Bayesian A/B testing handles skewed data more gracefully.

Our take: If your A/A test shows a clear winner and you haven’t peeked early, don’t panic. Run it again. If it keeps showing a winner, then panic. (Just a little.)

What to do when your AA test fails

Most competitors stop at “check your setup.” Here’s the actual debugging playbook, sourced from Microsoft’s experimentation team and practitioners who’ve found real bugs.

This is where most guides on A/A testing leave you hanging. They say “if results differ, something is wrong with your setup.” Thanks. Very helpful.

Microsoft’s Experimentation Platform team runs hundreds of simulated A/A tests per metric. They found that 10-15% of metrics fail their A/A checks (some products hit 30%). They identified four patterns behind these failures.

The most common: one big spender skews everything. A single visitor makes a massive purchase and throws off your average. The fix is capping extreme values (data teams call this winsorization). If your testing tool doesn’t do this automatically, flag it with whoever manages your analytics.

Sometimes it’s several outliers, not just one. Same idea, bigger scale. Cap values at a reasonable ceiling before comparing groups.

Then there are metrics that barely fire. If only a tiny fraction of visitors trigger the event you’re measuring (say, 0.1% purchase rate), the data is too sparse for a reliable comparison. You either need way more traffic or a different metric.

And finally, stuck tracking. The metric returns the same value for everyone. This usually means your event tracking code isn’t implemented correctly. Check that your conversion events are actually firing.

Beyond these data patterns, Ian Whitestone’s team documented specific bugs caught through A/A testing. Their statistical code failed to adjust for testing multiple metrics at once, which doubled their false positive rate. They only found it because they ran an A/A test.

The debugging checklist:

Check the traffic split first. Is it actually 50/50? If not, you likely have a randomization issue (cookie problems, caching, or your CDN serving cached versions unevenly).
Verify tracking consistency. Are both groups being tracked identically? Check tag firing order, consent management, and ad blockers.
Look for bots. Bot traffic often hits groups unevenly. Filter known bots before analyzing results.
Compare your testing tool and analytics. If the visitor counts don’t match, something’s leaking between the two systems.
Re-run the test. If the same problem appears twice, it’s real. If it doesn’t, it was probably random noise.

If you’ve gone through this list and can’t find the issue, contact your testing tool’s support team with the data. For more on common A/B testing mistakes, the debugging principles are similar.

Do you actually need an AA test? The case for and against

Experts genuinely disagree on this. We’ll give you both sides and our honest take.

This is a real debate in the testing world. Not every expert agrees that A/A testing is worth the effort.

The case for comes from Ron Kohavi, whose response to a LinkedIn debate on this exact topic got 56+ reactions: “Properly run A/A tests are highly valuable, especially when run in large numbers.” Microsoft, Booking.com, and Twitch all use A/A testing as core infrastructure validation.

The argument is simple. You wouldn’t skip testing your smoke detector because “it’s probably fine.” Your testing tool is a decision-making tool. Validate it.

The case against comes from CXL’s Craig Sullivan. His argument: A/A tests eat into real testing time, knowledge is temporary (traffic patterns shift), and they’re hard to justify to clients who want actionable results. Modern platforms also have automated checks for uneven traffic splits, which handle one of the biggest things A/A tests catch.

The middle ground is where most smart teams land. Kyle Hearnshaw points out that A/A tests are excellent educational tools for teams new to testing. They demonstrate that “significance” can appear early due to noise, before enough data has come in. Aaron Montana frames them as an engineering exercise that prevents your testing program from failing during launch.

There’s also a clever alternative: simulated A/A tests. Instead of sending real visitors through an A/A test, you take historical data and randomly split it thousands of times. Microsoft’s team and Twitch’s engineering team both do this. No wasted traffic. Same validation.

Our position: run one when you first set up your tool, and again after major changes. Optimizely recommends quarterly as a general rule, and that feels right for most teams. If you’re just getting started, you can run your first A/A test with Kirro before lunch. Don’t let it become a blocker.

If you’re exploring alternatives to A/B testing, simulated A/A tests are one of the less obvious options.

AA test vs A/B test: key differences

An A/A test validates your tool. An A/B test validates your idea.

People sometimes conflate these, so here’s the clean comparison.

	AA test	A/B test
What you’re testing	Your testing tool and setup	Your page change or idea
Versions	Two identical pages	Two different pages
Expected result	No difference	One version wins
When to use	Before your first A/B test, after major changes	When you want to compare options
What it tells you	”My testing infrastructure works"	"This change helped (or didn’t)“
Traffic split	Always 50/50	Usually 50/50, sometimes other ratios

An A/A test is a prerequisite, not a replacement. Once your A/A test confirms your setup works, move on to actual A/B testing. That’s where the real improvements happen.

For more complex scenarios where you want to test multiple changes at once, see multivariate testing or our guide to comparing A/B and multivariate testing.

FAQ

Quick answers to the most common A/A testing questions.

How long should an AA test run?

At least one full week, ideally two. You need to capture both weekday and weekend traffic patterns. Statsig recommends running long enough to reach most of your weekly active visitors, and that’s good advice.

If your site has very low traffic, a single A/A test might take too long to be practical. In that case, the simulated approach (splitting historical data offline) is a better option.

What sample size do you need for an AA test?

Use the same statistical power and sample sizing calculations as your A/B tests. The short answer: more is better. For a typical click-through rate of 0.6%, you’d need roughly 191,000 visitors per group to detect a tiny change to 0.65%.

That sounds like a lot (and it is). For most small sites, running the test for 1-2 weeks and checking for obvious problems is plenty. You don’t need to hit textbook sample sizes to catch a broken setup. You’re looking for big, obvious problems, not subtle statistical effects. See also: minimum detectable effect.

Can you run an AA test and an A/B test at the same time?

Yes. Advanced teams run ongoing A/A tests in parallel with their real tests. It’s like having a control monitor running in the background. If your A/A test suddenly shows a winner while your A/B test is running, something changed in your traffic or tracking.

What does it mean if my AA test shows a winner?

Either random noise or a real problem. At standard confidence levels, about 5% of A/A tests will show a false winner just by chance. That’s normal math.

But if the “winner” has a large effect, if the traffic split is uneven, or if you see the same pattern when you re-run the test, it’s a setup issue. Go back to the debugging checklist above.

Is an AA test the same as a null test?

Yes. “Null test” and “A/A test” are different names for the same thing. Both mean testing identical versions to validate your infrastructure. Kohavi uses both terms interchangeably in his book on online experiments.

Randy Wattilete

CRO expert and founder with nearly a decade running conversion experiments for companies from early-stage startups to global brands. Built programs for Nestlé, felyx, and Storytel. Founder of Kirro (A/B testing).

View all author posts

AA test: what it is, how to run one, and what to do when it fails