A/B Testing & Experimentation · 13 Mar, 2026

Testing Methodology

📐

🧮

The method matters less than you think. What matters: picking one that fits your traffic and sticking with it.

Only 1 in 7 A/B tests reach statistical significance. That’s an 86% failure rate. And yet 58% of companies have no framework for deciding what or how to test.

Those two stats are connected. Bad methodology is why most tests fail. Not bad ideas.

This section covers the science behind reliable testing. Every method below has a specific use case. Some need thousands of visitors. Some work with a few hundred. Some give you fast answers. Others give you precise ones. The trick is matching the method to your situation.

Kirro uses Bayesian statistics and handles the math for you. You get answers like “Version B has an 89% chance of being better,” not p-values.

This cluster is part of the A/B Testing & Experimentation pillar.

Which method do I need?

Answer three questions and you’ll know exactly which testing approach fits.

Start here. Don’t read all 18 articles. Answer these three questions, then go to the one that matches.

Question 1: How much traffic does your page get?

Under 1,000 visitors/month: Focus on big, obvious changes. Test one thing at a time. Read split testing meaning for the basics, then jump to landing page split testing for a step-by-step playbook.
1,000 to 10,000 visitors/month: Standard A/B testing works. You can detect meaningful differences with enough patience. Start with how to design a marketing experiment and use our sample size calculator to check timing.
10,000+ visitors/month: You have options. A/B tests, multivariate testing, even multi-armed bandits. Read on.

Question 2: What’s your goal?

“I need to learn which version is better.” Classic A/B test. You want precision and confidence. This is what most teams need most of the time.
“I need to optimize revenue right now.” Multi-armed bandits send more traffic to the winning version while the test runs. Less learning, more earning. Good for short campaigns and sales events.
“I need to test several elements at once.” Multivariate testing lets you test headlines, images, and buttons simultaneously. But you’ll need serious traffic (think 50,000+ visitors) to get reliable results.

Question 3: How comfortable are you with statistics?

“Not at all.” That’s fine. Bayesian A/B testing gives you probabilities in plain language. Kirro shows “89% chance Version B wins” instead of confusing p-values. Start there.
“I know the basics.” Dig into the sample size formula and A/B testing conversion rate benchmarks to set realistic expectations.
“I want the deep stats.” Go for type 1 vs type 2 errors, power analysis, null hypothesis testing, and minimum detectable effect.

Our take: A December 2025 Harvard Business Review study found that traditional significance testing demands 24 to 55 times more data than you actually need for a good business decision. Speed often matters more than certainty. Most small teams are better off running more tests with slightly less precision than running fewer “perfect” tests.

The core methods

Six approaches, each built for a different situation. Here’s what each one does and when it’s worth your time.

Standard A/B testing splits traffic 50/50 between your current page and one change. It’s the workhorse. Reliable, easy to understand, works at almost any traffic level. If you’re new to testing, split testing meaning explains the concept, and landing page split testing walks through a real example. For the stats behind it, see A/B testing conversion rate.

Bayesian A/B testing updates results as visitors arrive instead of making you wait for a fixed sample. Kirro uses this approach because the results make sense to non-statisticians. “89% chance Version B wins” is a sentence your boss can act on. Our Bayesian A/B testing guide covers when it helps and when it’s overkill.

Multivariate testing tests combinations of changes. Different headlines paired with different images paired with different buttons. Powerful, but hungry for traffic. Our multivariate testing guide includes the traffic calculator so you can check if your site qualifies before committing.

Multi-armed bandits automatically shift traffic toward whichever version is winning. Less learning, faster revenue. Good for flash sales or time-limited campaigns where waiting for full statistical confidence would cost you money. Deep dive: multi-armed bandit testing.

Sequential testing lets you stop a test early (or keep it running longer) based on ongoing results, without inflating your false positive rate. It solves the “peeking problem” that Evan Miller famously showed raises error rates from 5% to 26%. Full guide: sequential testing.

CUPED (variance reduction) uses data you already have about your visitors to reduce the noise in your results. The practical result: 30 to 40% smaller sample sizes for the same precision. If your tests always seem to take too long, this is probably the fix. Guide: CUPED and variance reduction.

The statistics that actually matter

You don’t need a stats degree. You need to understand four numbers.

Every testing method above relies on the same handful of statistical concepts. You don’t need to calculate them (that’s the tool’s job). But knowing what they mean helps you avoid the most common A/B testing mistakes.

Sample size is “how many visitors do I need?” Too few and your test can’t tell a real winner from random noise. Our sample size formula guide breaks down the math, and the free calculator does it for you.

Minimum detectable effect is “what’s the smallest improvement worth finding?” If you’d only act on a 20% improvement, don’t set up a test designed to detect 2% changes. It’ll take forever. MDE guide.

Type 1 and type 2 errors are the two ways a test can lie to you. A type 1 error says B wins when it doesn’t (false alarm). A type 2 error misses a real winner (missed opportunity). Understanding both helps you set up tests that balance speed with accuracy.

Statistical power is the probability your test will actually detect a real difference. Low power means you’ll miss winners. Microsoft runs 10,000+ experiments annually and still obsesses over power calculations. If it matters to them at that scale, it matters to you. Power analysis guide.

For the full theoretical foundation (what a null hypothesis is, how to think about probability): null hypothesis in A/B testing.

Implementation and architecture

The method is one half. How you run the test is the other half.

Picking the right statistical method gets you halfway. The other half is the practical setup: where the test runs, how visitors get assigned to versions, and what happens when cookies disappear.

Experiment design covers the full process: forming a clear guess about what will happen, choosing the right metric, picking the page, and setting up controls. How to design a marketing experiment walks through this start to finish.

Client-side vs server-side testing is about where the test runs. Client-side (in the browser) is easier to set up but can cause page flicker. Server-side (on your server) is invisible to visitors but needs developer involvement. Most small teams start client-side and it works fine. Client-side vs server-side A/B testing helps you decide.

Cookieless testing matters more every year. Safari already blocks third-party cookies. Chrome offers users a choice. If your testing tool relies on third-party cookies, you’re losing data on a growing chunk of visitors. Cookieless A/B testing covers the alternatives.

Feature flags vs A/B testing confuses a lot of teams. Feature flags let developers turn features on and off. A/B tests measure which version performs better. They solve different problems, and some platforms bundle them together. If you’re wondering whether you need a feature flag tool or a testing tool, feature flags vs A/B testing sorts it out. (Short answer for most marketers: you need a testing tool.)

AI-powered testing is the newest addition to the toolkit. AI can help prioritize what to test, generate variations, and analyze results faster. But it’s not magic, and the fundamentals still apply. AI A/B testing separates the real applications from the hype.

Start somewhere

Most teams overthink the methodology and underthink the action. Microsoft found that a 1% improvement to Bing’s revenue equals over $10 million per year. Those gains came from running thousands of simple tests, not from picking the “perfect” statistical method.

Pick a high-traffic page. Change one thing. Run the test. Three minutes to set up in Kirro. The methodology guides above are here for when you want to go deeper. But the first test? Just run it.

📋

🧪

A/B testing template: 5 ready-to-use templates for every stage of your test

You need an A/B testing template. Not a 30-page whitepaper about experimentation culture. Just a simple document that keeps your test organized from start to finish. Here are five templates that do exactly that. Each one covers a different stage of the testing process. Planning, hypotheses, tracking, results, and what ...

🧫

🎯

Control group: what it is, why it matters, and how it works in any experiment

What is a control group? A control group is the unchanged version in an experiment. It's the baseline you compare everything else against.</Highl ...

⚖️

📊

Multivariate testing vs A/B testing: which one should you actually use?

A/B testing. Change one thing, pick the winner. Multivariate testing (MVT). Change several things at once, find the best combination. That's the quick answer. ...

📊

🧠

Frequentist vs Bayesian statistics: what's the difference and why it matters

Frequentist statistics asks "how likely is this data if nothing changed?" Bayesian statistics asks "given the data, how likely is it that something changed?" Sa ...

🔬

💡

Hypothesis generator: how to build A/B test hypotheses that actually win

A hypothesis generator turns a vague idea ("maybe we should change the headline") into a structured prediction you can actually test. Instead of guessing, you f ...

🔢

📐

P-value calculator: find and interpret your p-value

A p-value tells you how surprised you should be by your data. Specifically, it's the probability of seeing results as extreme as yours if nothing actually chang ...

🔀

🧪

A/B testing alternatives: 7 methods to try when split testing won't work

A/B testing is great when it works. But it doesn't always work. Maybe your site gets 300 visitors a month and you'd need three years to reach a confident result ...

🎯

🧪

A/B testing best practices that actually move the needle (backed by data)

Most A/B tests don't produce a winner. Optimizely analyzed 127,000 experiments and found that only 12% improve ...

🔬

✅

AA test: what it is, how to run one, and what to do when it fails

An AA test (also called an A/A test) runs two identical versions of the same page against each other. Same content. Same design. Same everything. The goal isn't ...

📊

📈

A/B testing metrics: the 4 types that decide if your test actually matters

A/B testing metrics fall into four categories: primary metrics (did Version B win?), guardrail metrics (did anything break?), secondary metrics (why ...