Statistical power answers one question: âIf Version B really is better, will my test actually catch it?â
Low power means you might miss a real winner. Your test ends, the results look flat, and you conclude ânothing workedâ when the truth is you just didnât have enough data to see the difference. This calculator tells you your testâs power level so you can plan accordingly.
How to use this calculator
- Enter your baseline conversion rate. Thatâs your current pageâs conversion rate before the test.
- Enter the sample size per version. Thatâs how many visitors each version will see (or has already seen).
- Set the minimum detectable effect. Thatâs the smallest improvement you want to catch.
- Pick your significance level. 95% is standard.
- Read the result. 80% power or above means your test is well-equipped to catch a real difference. Below that, youâre rolling the dice.
Try the reverse mode too: enter your traffic and baseline rate, and the calculator tells you the smallest improvement your test can realistically detect. Thatâs often a more useful question than âwhatâs my power?â
How we calculate this
Power measures the probability that your test will correctly identify a winner when one actually exists. In statistics, itâs written as 1 minus beta, where beta is the chance of a Type II error (missing a real winner).
Hereâs the analogy. Imagine youâre trying to hear someone whisper in a loud room. Power is how good your hearing is. More visitors equals better hearing. You catch smaller differences.
Four things determine your power:
Sample size is the biggest lever. More visitors means more data, which means smaller differences become visible. Doubling your sample size doesnât double your power (the relationship isnât linear), but it always helps.
Baseline conversion rate matters because differences are harder to detect at extreme rates. Spotting a 10% relative improvement is easier when your rate is 10% (detecting a 1 point change) than when itâs 1% (detecting a 0.1 point change). Higher baselines give you more âsignalâ to work with.
Minimum detectable effect is the smallest improvement youâre looking for. The smaller the difference you want to catch, the more power you need. This is the core tradeoff: you can either detect small changes (which needs a lot of visitors) or detect large changes (which is cheaper and faster). The MDE guide helps you decide whatâs right for your situation.
Significance level is your tolerance for false positives. Higher confidence (99% vs 95%) is stricter, which consumes some of your power. At the same sample size, a 99% confidence test has less power than a 95% confidence test.
The formula uses the non-central distribution to calculate the probability of rejecting the null hypothesis when the alternative hypothesis is true. In practical terms: given your sample size, baseline rate, and the effect youâre trying to detect, how often would your test correctly pick up on it?
Power = P(Z > Zα - ÎŽâ(n/2) / â(pÌ(1-pÌ)))
Where ÎŽ is the true difference between conversion rates, n is visitors per version, and pÌ is the pooled rate under the alternative hypothesis.
The industry standard is 80% power. That means if Version B really is better by the amount you specified, your test has an 80% chance of catching it. Some practitioners aim for 90% when the stakes are high.
FAQ
What is statistical power in A/B testing?
Power is the probability that your test will detect a real difference when one exists. At 80% power, if Version B truly converts 20% better than Version A, your test has an 80% chance of correctly identifying B as the winner. The remaining 20% is the risk of a Type II error: concluding âno differenceâ when there actually is one. Itâs the false negative rate of your test.
What power level should I aim for?
80% is the standard for most A/B tests. It balances reliability with practical sample size requirements. Aim for 90% if the test results will drive a major decision (full site redesign, pricing change, major product launch). Below 80%, your test has a meaningful chance of missing real improvements. Below 50%, youâre basically flipping a coin.
How does power relate to sample size?
They move together. More visitors equals more power. If your power is too low, the fix is almost always âget more visitorsâ (by running the test longer or on higher-traffic pages). The sample size calculator works this relationship in reverse: given your desired power level, it tells you exactly how many visitors you need.
My test had low power. Are the results useless?
Not entirely. If a low-power test finds a significant result, that result is still valid. Low power doesnât create false positives. What low power does is increase the chance of false negatives. So if your underpowered test says âno difference found,â you canât be sure there wasnât one. You just didnât have enough data to see it. Think of it this way: a metal detector with low sensitivity wonât create fake signals, but it might miss real ones buried deep.
Whatâs the relationship between power and the minimum detectable effect?
Inverse. Smaller MDE needs more power (and more visitors) to detect. Larger MDE needs less. If your power is too low and you canât get more traffic, consider increasing your MDE. In practice, that means testing bigger changes. Instead of tweaking button color (small effect, hard to detect), test a completely different headline (large effect, easier to detect). The CUPED method can also help by reducing variance, effectively boosting power without more traffic.
Why donât most A/B testing tools show power?
Because low power is an uncomfortable truth. Most tools show you whether your result is significant but donât tell you whether your test was capable of detecting a difference in the first place. Itâs like checking if your net caught a fish without asking whether the net had holes in it. Power analysis is the âbeforeâ to significance testingâs âafter.â Do it before you start the test using this calculator and the sample size calculator together.
Need more power? Kirro uses Bayesian statistics, which work faster with small traffic. You get reliable answers sooner.