Free A/B Test Power Calculator

Find out if your A/B test has enough power to catch a real winner. Enter your numbers and see instantly.

đŸ§Ș
📊

Enter your test setup to see if you have enough visitors to catch a real winner.

%

Check your analytics for the page you're testing

visitors

The number of visitors each version will see (or has already seen)

% relative

Relative change. 20% means detecting a shift from 5.0% to 6.0%.

Results

Your test's power
49.1%

Your test has a coin-flip chance of missing a real winner.

0%80% target100%
Your test has a coin-flip chance of missing a real winner. If Version B actually IS better, your test probably won't catch it. Test bigger changes or run it longer.

To reach 80% power, you'd need 8,158 visitors per version.

Plan your sample size before the next test.


Baseline rate5%
Visitors per version3,900 visitors
Min. detectable effect20% relative
Significance95%

Need more power? Kirro uses Bayesian stats, so you get reliable answers with less traffic. Try it free

Statistical power answers one question: “If Version B really is better, will my test actually catch it?”

Low power means you might miss a real winner. Your test ends, the results look flat, and you conclude “nothing worked” when the truth is you just didn’t have enough data to see the difference. This calculator tells you your test’s power level so you can plan accordingly.

How to use this calculator

  1. Enter your baseline conversion rate. That’s your current page’s conversion rate before the test.
  2. Enter the sample size per version. That’s how many visitors each version will see (or has already seen).
  3. Set the minimum detectable effect. That’s the smallest improvement you want to catch.
  4. Pick your significance level. 95% is standard.
  5. Read the result. 80% power or above means your test is well-equipped to catch a real difference. Below that, you’re rolling the dice.

Try the reverse mode too: enter your traffic and baseline rate, and the calculator tells you the smallest improvement your test can realistically detect. That’s often a more useful question than “what’s my power?”

How we calculate this

Power measures the probability that your test will correctly identify a winner when one actually exists. In statistics, it’s written as 1 minus beta, where beta is the chance of a Type II error (missing a real winner).

Here’s the analogy. Imagine you’re trying to hear someone whisper in a loud room. Power is how good your hearing is. More visitors equals better hearing. You catch smaller differences.

Four things determine your power:

Sample size is the biggest lever. More visitors means more data, which means smaller differences become visible. Doubling your sample size doesn’t double your power (the relationship isn’t linear), but it always helps.

Baseline conversion rate matters because differences are harder to detect at extreme rates. Spotting a 10% relative improvement is easier when your rate is 10% (detecting a 1 point change) than when it’s 1% (detecting a 0.1 point change). Higher baselines give you more “signal” to work with.

Minimum detectable effect is the smallest improvement you’re looking for. The smaller the difference you want to catch, the more power you need. This is the core tradeoff: you can either detect small changes (which needs a lot of visitors) or detect large changes (which is cheaper and faster). The MDE guide helps you decide what’s right for your situation.

Significance level is your tolerance for false positives. Higher confidence (99% vs 95%) is stricter, which consumes some of your power. At the same sample size, a 99% confidence test has less power than a 95% confidence test.

The formula uses the non-central distribution to calculate the probability of rejecting the null hypothesis when the alternative hypothesis is true. In practical terms: given your sample size, baseline rate, and the effect you’re trying to detect, how often would your test correctly pick up on it?

Power = P(Z > Zα - ή√(n/2) / √(p̂(1-p̂)))

Where ÎŽ is the true difference between conversion rates, n is visitors per version, and p̂ is the pooled rate under the alternative hypothesis.

The industry standard is 80% power. That means if Version B really is better by the amount you specified, your test has an 80% chance of catching it. Some practitioners aim for 90% when the stakes are high.

FAQ

What is statistical power in A/B testing?

Power is the probability that your test will detect a real difference when one exists. At 80% power, if Version B truly converts 20% better than Version A, your test has an 80% chance of correctly identifying B as the winner. The remaining 20% is the risk of a Type II error: concluding “no difference” when there actually is one. It’s the false negative rate of your test.

What power level should I aim for?

80% is the standard for most A/B tests. It balances reliability with practical sample size requirements. Aim for 90% if the test results will drive a major decision (full site redesign, pricing change, major product launch). Below 80%, your test has a meaningful chance of missing real improvements. Below 50%, you’re basically flipping a coin.

How does power relate to sample size?

They move together. More visitors equals more power. If your power is too low, the fix is almost always “get more visitors” (by running the test longer or on higher-traffic pages). The sample size calculator works this relationship in reverse: given your desired power level, it tells you exactly how many visitors you need.

My test had low power. Are the results useless?

Not entirely. If a low-power test finds a significant result, that result is still valid. Low power doesn’t create false positives. What low power does is increase the chance of false negatives. So if your underpowered test says “no difference found,” you can’t be sure there wasn’t one. You just didn’t have enough data to see it. Think of it this way: a metal detector with low sensitivity won’t create fake signals, but it might miss real ones buried deep.

What’s the relationship between power and the minimum detectable effect?

Inverse. Smaller MDE needs more power (and more visitors) to detect. Larger MDE needs less. If your power is too low and you can’t get more traffic, consider increasing your MDE. In practice, that means testing bigger changes. Instead of tweaking button color (small effect, hard to detect), test a completely different headline (large effect, easier to detect). The CUPED method can also help by reducing variance, effectively boosting power without more traffic.

Why don’t most A/B testing tools show power?

Because low power is an uncomfortable truth. Most tools show you whether your result is significant but don’t tell you whether your test was capable of detecting a difference in the first place. It’s like checking if your net caught a fish without asking whether the net had holes in it. Power analysis is the “before” to significance testing’s “after.” Do it before you start the test using this calculator and the sample size calculator together.

Need more power? Kirro uses Bayesian statistics, which work faster with small traffic. You get reliable answers sooner.

Try Kirro

Run smarter A/B tests and boost your conversions

Everything. No limits. No surprises.

Get started free