Free A/B Test Significance Calculator

Check if your A/B test results are real or random noise. Enter your data and get a clear answer in seconds.

đŸ§Ș
📊

Version A (Control)

visitors
conversions

Conversion rate: 4.00%

Version B (Variation)

visitors
conversions

Conversion rate: 5.50%

Results

Not enough evidence yet. Keep the test running or test a bigger change.

Uplift
+37.5%
+1.50 percentage points (4.00% → 5.50%)

Chance the result is random11.5%

There's a 11% chance this is random noise. You can't trust this result yet.

Chance B is truly better

Based on the data so far, the estimated likelihood Version B genuinely outperforms A

94.3%

You need roughly 4,314 more total visitors to reach significance at this effect size.

Want to plan ahead? Calculate your sample size before the next test.


Required certainty95%
z-score1.5769

Got noise? Kirro helps you figure out what to test next. Try it free

You ran a test. One version got more clicks. But is the difference real, or just random noise?

This calculator tells you. Enter your visitors and conversions for both versions, and it’ll tell you whether the difference is statistically significant, plus how likely it is that Version B is actually better. No statistics degree required.

How to use this calculator

  1. Enter the number of visitors and conversions for Version A (your original page).
  2. Enter the same for Version B (your changed page).
  3. Pick your confidence level. 95% is the standard. Use 99% if you want to be extra sure.
  4. Read the verdict. Green means Version B is the real winner. Gray means you need more data.
  5. Check the Bayesian probability. It tells you the chance that Version B is genuinely better, as a simple percentage.

How we calculate this

This calculator uses two methods to analyze your results. You get both because they answer slightly different questions.

The frequentist method (p-value) asks: “If there were no real difference between the two versions, how likely is it that I’d see a gap this big by pure chance?” That probability is the p-value.

Here’s the intuition. Imagine you flip a coin 10 times and get 7 heads. Is the coin rigged? Probably not, you’d need more flips to know. But if you flip it 1,000 times and get 700 heads, something’s going on. The p-value captures that logic. It measures whether your sample is large enough and the difference big enough to rule out luck.

The math uses a two-proportion z-test. It compares the conversion rates of both versions, accounts for the sample size, and calculates how many standard deviations apart the results are. If that distance is large enough (based on your chosen confidence level), the result is significant.

The formula: z = (p₁ - p₂) / √(p̂(1-p̂)(1/n₁ + 1/n₂))

Where p₁ and p₂ are the conversion rates, n₁ and n₂ are the visitor counts, and p̂ is the pooled conversion rate. The z-score maps to a p-value using the normal distribution.

If p < 0.05 at 95% confidence, you can reject the null hypothesis and call the result significant.

The Bayesian method asks a friendlier question: “What’s the probability that Version B is actually better?” Instead of “is this statistically significant at the 0.05 level,” you get “there’s an 87% chance B wins.” Most people find this easier to act on.

We calculate this using a normal approximation to the posterior distribution. It won’t always agree with the p-value, and that’s fine. The Bayesian probability is especially useful early in a test when you don’t have enough data for frequentist significance but want a sense of direction.

Which one should you trust? Both. The p-value is the rigorous, industry-standard answer. The Bayesian probability is the practical, “should I care about this yet?” answer. If both agree, you’re in good shape. If the Bayesian probability is high but the p-value isn’t significant yet, you probably need more data. Keep the test running.

FAQ

What does statistical significance mean?

It means the difference between your two versions is unlikely to be caused by random chance. At 95% confidence, there’s only a 5% chance you’re seeing a pattern that isn’t real. It does not mean the result is big or important. A tiny difference (4.01% vs 4.00%) can be statistically significant with enough visitors. Always look at the actual size of the difference alongside the significance. Our guide to null hypothesis testing explains this in more detail.

What confidence level should I use?

95% is the standard and works for most tests. Use 90% if you’re running a quick directional test and can tolerate more risk. Use 99% if the change is expensive or hard to reverse (like a full site redesign). Higher confidence means you need more visitors. The tradeoff is always between certainty and speed. The sample size calculator shows exactly how visitor requirements change at each level.

My test is significant at 90% but not 95%. What do I do?

You have three options. First, keep the test running until you hit 95% (safest). Second, accept the 90% result if the change is low-risk and easy to reverse (reasonable for minor copy changes). Third, check the Bayesian probability. If it shows an 85%+ chance B is better, that’s additional evidence in the same direction. The choice depends on what’s at stake. A button color change? 90% might be fine. A complete page redesign? Wait for 95%.

Can a test be significant but not meaningful?

Yes. Statistical significance tells you the difference is real (not random). It doesn’t tell you the difference is big enough to matter. A 0.1 percentage point improvement might be statistically significant with 500,000 visitors, but the business impact could be negligible. Always check the absolute uplift alongside the significance. If the difference is real but tiny, your time is better spent testing something with a bigger potential payoff.

What’s the difference between frequentist and Bayesian results?

Frequentist statistics (the p-value) answers: “How likely is this data if there’s no real difference?” Bayesian statistics answers: “Given this data, how likely is it that B is better?” The Bayesian approach is more intuitive and works better with smaller samples, which is why Kirro uses Bayesian statistics in the product. This calculator shows both so you can compare.

How many visitors do I need before checking significance?

Calculate your required sample size before you start the test using the sample size calculator. Checking too early and making decisions based on incomplete data is called peeking, and it’s one of the most common A/B testing mistakes. If you’ve already started without calculating, a rough guideline: you need at least 100 conversions per version before significance results become reliable.

Got a winner? Push it live. Got noise? Kirro helps you figure out what to test next. Try it free.

Try Kirro

Run smarter A/B tests and boost your conversions

Everything. No limits. No surprises.

Get started free