Only 1 in 7 A/B tests reach statistical significance. That’s an 86% failure rate. And yet 58% of companies have no framework for deciding what or how to test.
Those two stats are connected. Bad methodology is why most tests fail. Not bad ideas.
This section covers the science behind reliable testing. Every method below has a specific use case. Some need thousands of visitors. Some work with a few hundred. Some give you fast answers. Others give you precise ones. The trick is matching the method to your situation.
Kirro uses Bayesian statistics and handles the math for you. You get answers like “Version B has an 89% chance of being better,” not p-values.
This cluster is part of the A/B Testing & Experimentation pillar.
Which method do I need?
Start here. Don’t read all 18 articles. Answer these three questions, then go to the one that matches.
Question 1: How much traffic does your page get?
- Under 1,000 visitors/month: Focus on big, obvious changes. Test one thing at a time. Read split testing meaning for the basics, then jump to landing page split testing for a step-by-step playbook.
- 1,000 to 10,000 visitors/month: Standard A/B testing works. You can detect meaningful differences with enough patience. Start with how to design a marketing experiment and use our sample size calculator to check timing.
- 10,000+ visitors/month: You have options. A/B tests, multivariate testing, even multi-armed bandits. Read on.
Question 2: What’s your goal?
- “I need to learn which version is better.” Classic A/B test. You want precision and confidence. This is what most teams need most of the time.
- “I need to optimize revenue right now.” Multi-armed bandits send more traffic to the winning version while the test runs. Less learning, more earning. Good for short campaigns and sales events.
- “I need to test several elements at once.” Multivariate testing lets you test headlines, images, and buttons simultaneously. But you’ll need serious traffic (think 50,000+ visitors) to get reliable results.
Question 3: How comfortable are you with statistics?
- “Not at all.” That’s fine. Bayesian A/B testing gives you probabilities in plain language. Kirro shows “89% chance Version B wins” instead of confusing p-values. Start there.
- “I know the basics.” Dig into the sample size formula and A/B testing conversion rate benchmarks to set realistic expectations.
- “I want the deep stats.” Go for type 1 vs type 2 errors, power analysis, null hypothesis testing, and minimum detectable effect.
Our take: A December 2025 Harvard Business Review study found that traditional significance testing demands 24 to 55 times more data than you actually need for a good business decision. Speed often matters more than certainty. Most small teams are better off running more tests with slightly less precision than running fewer “perfect” tests.
The core methods
Standard A/B testing splits traffic 50/50 between your current page and one change. It’s the workhorse. Reliable, easy to understand, works at almost any traffic level. If you’re new to testing, split testing meaning explains the concept, and landing page split testing walks through a real example. For the stats behind it, see A/B testing conversion rate.
Bayesian A/B testing updates results as visitors arrive instead of making you wait for a fixed sample. Kirro uses this approach because the results make sense to non-statisticians. “89% chance Version B wins” is a sentence your boss can act on. Our Bayesian A/B testing guide covers when it helps and when it’s overkill.
Multivariate testing tests combinations of changes. Different headlines paired with different images paired with different buttons. Powerful, but hungry for traffic. Our multivariate testing guide includes the traffic calculator so you can check if your site qualifies before committing.
Multi-armed bandits automatically shift traffic toward whichever version is winning. Less learning, faster revenue. Good for flash sales or time-limited campaigns where waiting for full statistical confidence would cost you money. Deep dive: multi-armed bandit testing.
Sequential testing lets you stop a test early (or keep it running longer) based on ongoing results, without inflating your false positive rate. It solves the “peeking problem” that Evan Miller famously showed raises error rates from 5% to 26%. Full guide: sequential testing.
CUPED (variance reduction) uses data you already have about your visitors to reduce the noise in your results. The practical result: 30 to 40% smaller sample sizes for the same precision. If your tests always seem to take too long, this is probably the fix. Guide: CUPED and variance reduction.
The statistics that actually matter
Every testing method above relies on the same handful of statistical concepts. You don’t need to calculate them (that’s the tool’s job). But knowing what they mean helps you avoid the most common A/B testing mistakes.
Sample size is “how many visitors do I need?” Too few and your test can’t tell a real winner from random noise. Our sample size formula guide breaks down the math, and the free calculator does it for you.
Minimum detectable effect is “what’s the smallest improvement worth finding?” If you’d only act on a 20% improvement, don’t set up a test designed to detect 2% changes. It’ll take forever. MDE guide.
Type 1 and type 2 errors are the two ways a test can lie to you. A type 1 error says B wins when it doesn’t (false alarm). A type 2 error misses a real winner (missed opportunity). Understanding both helps you set up tests that balance speed with accuracy.
Statistical power is the probability your test will actually detect a real difference. Low power means you’ll miss winners. Microsoft runs 10,000+ experiments annually and still obsesses over power calculations. If it matters to them at that scale, it matters to you. Power analysis guide.
For the full theoretical foundation (what a null hypothesis is, how to think about probability): null hypothesis in A/B testing.
Implementation and architecture
Picking the right statistical method gets you halfway. The other half is the practical setup: where the test runs, how visitors get assigned to versions, and what happens when cookies disappear.
Experiment design covers the full process: forming a clear guess about what will happen, choosing the right metric, picking the page, and setting up controls. How to design a marketing experiment walks through this start to finish.
Client-side vs server-side testing is about where the test runs. Client-side (in the browser) is easier to set up but can cause page flicker. Server-side (on your server) is invisible to visitors but needs developer involvement. Most small teams start client-side and it works fine. Client-side vs server-side A/B testing helps you decide.
Cookieless testing matters more every year. Safari already blocks third-party cookies. Chrome offers users a choice. If your testing tool relies on third-party cookies, you’re losing data on a growing chunk of visitors. Cookieless A/B testing covers the alternatives.
Feature flags vs A/B testing confuses a lot of teams. Feature flags let developers turn features on and off. A/B tests measure which version performs better. They solve different problems, and some platforms bundle them together. If you’re wondering whether you need a feature flag tool or a testing tool, feature flags vs A/B testing sorts it out. (Short answer for most marketers: you need a testing tool.)
AI-powered testing is the newest addition to the toolkit. AI can help prioritize what to test, generate variations, and analyze results faster. But it’s not magic, and the fundamentals still apply. AI A/B testing separates the real applications from the hype.
Start somewhere
Most teams overthink the methodology and underthink the action. Microsoft found that a 1% improvement to Bing’s revenue equals over $10 million per year. Those gains came from running thousands of simple tests, not from picking the “perfect” statistical method.
Pick a high-traffic page. Change one thing. Run the test. Three minutes to set up in Kirro. The methodology guides above are here for when you want to go deeper. But the first test? Just run it.
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
AI A/B testing: what's real, what's hype, and what actually helps
AI A/B testing uses machine learning to help you design, run, and analyze your tests faster. You stop manually picking what to test, writing every headline yourself, and waiting weeks for results. AI handles parts of that for you. That's the short version. The longer version is more interesting (and mor ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
Client-side vs server-side A/B testing: which one your business actually needs
Client-side testing changes what visitors see inside the browser. Server side testing decides what to send before the page even loads. That's the core differenc ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
Cookieless A/B testing: what actually changed and what to do about it
Most A/B testing tools use first-party cookies. That's the kind your own website sets. Third-party cookies, the kind advertisers use to follow yo ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
CUPED and variance reduction: run faster A/B tests with less traffic
CUPED is a technique that uses what your visitors did before your test started to filter out noise in your results. It stands for Controlled-experiment Using ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
Feature flags vs A/B testing: what's the difference and which do you need?
Feature flags are on/off switches that developers put in code to control who sees a new feature. A/B testing shows two versions of something to real visitors an ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
Multi-armed bandit testing: what it is, when it works, and when it backfires
A multi-armed bandit is an algorithm that automatically sends more traffic to whichever version of your page is winning. Instead of splitting visitors 50/50 and ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
The null hypothesis in A/B testing: what it means and why most tests prove it right
The null hypothesis says your change made no difference. That's it. When you run an A/B test, the null hypothesis is the default assumption that Version A and V ...
-
Randy Wattilete - 14 Mar, 2026
- Testing Methodology
Sequential testing: when to stop your A/B test early
Sequential testing lets you check A/B testing results as data comes in. Without getting tricked by random noise. Instead of waiting for a pre-set ...
-
Randy Wattilete - 13 Mar, 2026
- Testing Methodology
How to design a marketing experiment (even if most of them fail)
To design a marketing experiment: start with a business question, write a hypothesis you can be wrong about, pick one metric, figure out how many visitors you n ...
-
Randy Wattilete - 13 Mar, 2026
- Testing Methodology
Minimum detectable effect in A/B testing: how to pick the right one for your business
Minimum detectable effect (MDE) is the smallest real improvement your A/B test can reliably catch. If your page actually converts 5% better but your test's ...