Testing Methodology · 12 Jun, 2026

P-value calculator: find and interpret your p-value

🔢
📐

A p-value tells you how surprised you should be by your data. Specifically, it’s the probability of seeing results as extreme as yours if nothing actually changed. Low p-value? Your data is hard to explain by random chance alone. High p-value? The results could easily be noise.

Most people searching for a p-value calculator already have a test statistic (a Z-score, t-value, chi-square, or F-statistic) and just need to convert it into a p-value. You’ll find instructions for each type below, along with formulas for Excel and Google Sheets.

One thing worth knowing: 67% of published studies get p-value interpretation wrong. Even trained researchers. So after you calculate yours, stick around for the section on what your result actually means.

How to find a p-value (step by step)

Pick your test, plug in your numbers, and compare the result to your significance level. Six steps, no PhD required.

All you need is a test statistic and a significance level.

Step 1: State your starting assumption. This is called the null hypothesis. It’s the boring explanation. “Nothing changed.” “There’s no difference.” “The new headline performs the same as the old one.” You’re trying to disprove it.

Step 2: Pick the right test. This depends on your data type. Comparing averages? Use a t-test. Comparing percentages or proportions? Use a Z-test. Checking whether categories are related? Chi-square. Comparing several group averages at once? F-test. (More on this in the decision table below.)

Step 3: Calculate your test statistic. This is the number that summarizes how far your data is from “nothing changed.” Your statistical software, spreadsheet, or testing tool does this for you.

Step 4: Figure out degrees of freedom. Degrees of freedom (df) are essentially how many independent data points went into your calculation. For a t-test comparing two groups, it’s roughly the total number of observations minus 2. For chi-square, it’s (rows - 1) times (columns - 1). Z-tests and some large-sample tests don’t need this.

Step 5: Look up or compute the p-value. This is where a calculator comes in. Plug in your test statistic, degrees of freedom, and whether you’re testing in one direction or two. The calculator converts your test statistic into a probability using the right distribution.

Step 6: Compare to your significance level. Your significance level (called alpha) is the threshold you picked before running the test. Usually 0.05. If your p-value is below alpha, you reject the null hypothesis. If it’s above? You don’t have enough evidence to reject it. That’s not the same as proving nothing changed. It just means your data wasn’t convincing enough.

Our take: Step 6 trips up more people than any other. “Not significant” doesn’t mean “no effect.” It means “not enough evidence,” like a jury verdict of “not guilty” versus “innocent.” If your test doesn’t reach significance, check your sample size before giving up. You might just need more data.

How to calculate a p-value

The math behind a p-value depends on your test type, but it always boils down to asking: “how far out in the tail of the distribution is my result?”

Same logic every time: take your test statistic, find where it falls on the appropriate probability distribution, and measure the area in the tail(s). That tail area is your p-value.

Z-test (for proportions or large samples). You have a Z-score. The p-value is the area under the standard normal curve beyond your Z-score. For a two-tailed test, you double it. If Z = 2.0, the one-tailed p-value is about 0.023. Two-tailed? About 0.046.

t-test (for means with smaller samples). Same idea, but the curve is slightly wider (fatter tails) to account for the extra uncertainty in small samples. You need degrees of freedom. As your sample grows, the t-distribution approaches the normal distribution.

Chi-square (for categorical data). This distribution is always positive and right-skewed. Your p-value is the area to the right of your chi-square statistic. Higher values mean bigger discrepancies between what you expected and what you observed.

F-test (for comparing variances or ANOVA). Also right-skewed, also always positive. You need two sets of degrees of freedom (one for the numerator, one for the denominator). The p-value is the right-tail area beyond your F-statistic.

For the full mathematical derivation of each formula, see our p-value formula guide. This page keeps things practical.

Which test statistic should you use?

Your data type picks the test for you. Means get a t-test, proportions get a Z-test, categories get chi-square, and multiple groups get an F-test.

You’ve got a number and need a p-value, but you’re not sure which test applies. Quick reference:

SituationTestExample
Comparing two averages (small sample or unknown variance)t-test”Is the average order value different between two page designs?”
Comparing two proportions or percentages (large sample)Z-test”Did more people click the green button or the blue button?”
Checking if categories are independentChi-square”Is there a relationship between device type and purchase rate?”
Comparing averages across 3+ groupsF-test (ANOVA)“Do conversion rates differ across four landing page versions?”
Comparing two proportions (very large sample, known variance)Z-test”Is the signup rate different this month vs. last month?”

For A/B testing specifically: Most website tests compare conversion rates (proportions), so you’re usually looking at a Z-test. If you’re comparing average revenue per visitor, that’s a t-test. And if you’re running tests with more than two versions (sometimes called multivariate testing), the F-test enters the picture.

If you’re running A/B tests regularly, dedicated tools handle the statistics automatically. Our free statistical significance calculator lets you paste in conversion numbers and get the answer without picking a test type. For ongoing testing, Kirro uses Bayesian statistics and gives you a direct probability (“89% chance Version B is better”) instead of a p-value.

What a p-value actually tells you (and what it doesn’t)

A p-value measures how surprising your data is under the assumption that nothing changed. It does not tell you the probability that your hypothesis is true.

A p-value of 0.03 means: “If absolutely nothing changed, there’s a 3% chance you’d see data this extreme just from random variation.” That’s it. That’s the whole thing.

Three things it does NOT mean:

“There’s a 3% chance my results are wrong.” Nope. The p-value says nothing about the probability of your hypothesis being true or false. It only describes the data under one specific assumption (that nothing changed). The American Statistical Association felt strongly enough about this to publish an official statement in 2016 correcting decades of misuse.

“The effect is big.” Not even close. A tiny, meaningless difference can produce a very small p-value if your sample is large enough. Test a headline change on 500,000 visitors and even a 0.01% lift might hit p < 0.05. Statistically significant? Sure. Practically worthless? Also yes. The gap between statistical significance and practical significance is something the ASA explicitly warns about.

“I can repeat this result.” Not guaranteed. In a landmark 2015 study, researchers tried to replicate 100 psychology experiments. 97 of the originals had significant p-values. Only 36 replicated. The effect sizes on replication were roughly half as large.

If you want a method that gives you a direct “what’s the probability this worked?” answer, that’s Bayesian statistics. Different framework, different question, and you get a straight probability instead of a p-value you need to decode.

Our take: P-values are like a smoke detector. They tell you something unusual might be happening, but they can’t tell you whether the building is on fire or how big the flames are. Always check the effect size and your sample size. A p-value of 0.001 with a 0.1% conversion lift? Your test detected a real but useless difference.

The five traps even researchers fall into

In a 2022 study, 100% of medical residents tested got the interpretation wrong. 88% were confident they understood it. Five common mistakes:

  1. Treating the p-value as the probability the null is true. It’s not. It’s the probability of the data given the null. Those are different things. (Confused? You’re in good company. Greenland et al. cataloged 25 distinct misinterpretations.)

  2. Confusing “not significant” with “no effect.” A high p-value doesn’t prove nothing happened. It means your data wasn’t strong enough to prove something did. Maybe your sample was too small. Check your statistical power.

  3. Ignoring effect size. A p-value tells you whether an effect exists. It says nothing about how large it is. Always look at the actual size of the difference, not just whether it crossed a threshold.

  4. Comparing p-values across tests. “Our p-value is 0.001 and theirs is 0.04, so our effect is bigger.” Wrong. P-values depend on sample size, variability, and test design. Two different tests can’t be ranked by p-value alone.

  5. Cherry-picking the direction after seeing results. Deciding to use a one-tailed test after you already know which direction the data went is a form of p-hacking. It doubles your chance of a Type 1 error (false positive). Always pick your test direction before collecting data.

Why 0.05? The surprising history behind the magic number

Ronald Fisher picked 0.05 in 1925 because it was “convenient.” Not because of any deep scientific principle.

Every statistics textbook teaches you to compare your p-value to 0.05. But almost none explain where that number came from.

In 1925, British statistician Ronald Fisher published Statistical Methods for Research Workers. He chose 0.05 because it roughly corresponds to two standard deviations from the mean. A round number that felt about right. Fisher himself called it a matter of personal preference, not a universal truth.

And then it stuck. For a hundred years.

The statistical community has been pushing back. In 2018, 72 prominent statisticians proposed lowering the default threshold from 0.05 to 0.005. Their argument: a p-value of 0.05 corresponds to surprisingly weak evidence (a Bayes factor of only about 3:1 against the null).

Then in 2019, over 800 scientists signed a letter in Nature calling to retire the concept of “statistical significance” entirely. Not to stop using p-values, but to stop drawing binary lines.

Different fields already use different thresholds. Particle physicists require “5 sigma” (roughly p < 0.0000003) before claiming a discovery. Genomics studies use p < 0.00000005. Clinical trials often require p < 0.01 for regulatory approval. So if someone tells you “0.05 is the standard,” ask them: standard where?

Set your significance level before you run the test. Pick a threshold that makes sense for the stakes. Testing a headline? 0.05 is fine. Deciding whether a drug is safe? You’d want much stricter.

How to find p-values in Excel and Google Sheets

Excel and Google Sheets have built-in functions for every common p-value calculation. No add-ons needed.

If you prefer spreadsheets, here are the functions you need:

From a Z-score:

  • Excel: =2*(1-NORM.S.DIST(ABS(Z),TRUE)) for two-tailed
  • Google Sheets: Same formula works

From a t-statistic:

  • Excel: =T.DIST.2T(ABS(t), df) for two-tailed
  • Excel: =T.DIST.RT(t, df) for right-tailed (one-tailed)
  • Google Sheets: =TDIST(ABS(t), df, 2) for two-tailed

From a chi-square statistic:

  • Excel: =CHISQ.DIST.RT(chi_sq, df)
  • Google Sheets: =CHISQ.DIST.RT(chi_sq, df)

From an F-statistic:

  • Excel: =F.DIST.RT(F, df1, df2)
  • Google Sheets: =FDIST(F, df1, df2)

Replace Z, t, chi_sq, F, df, df1, and df2 with your actual values.

Quick example: you ran a t-test and got t = 2.45 with 28 degrees of freedom. In Excel, type =T.DIST.2T(2.45, 28) and you get approximately 0.021. That’s below 0.05, so you’d reject the null hypothesis at the 5% significance level.

If spreadsheet formulas feel like overkill for a landing page test, you can run a quick A/B test in Kirro and skip the manual calculation entirely.

One-tailed vs two-tailed p-values

A one-tailed test checks one direction (“is B better?”). A two-tailed test checks both (“is B different in either direction?”).

When you calculate a p-value, you need to decide: are you testing whether something is specifically better (or specifically worse)? Or are you testing whether it’s simply different in any direction?

Two-tailed (most common): Use this when you’re asking “is there any difference?” You don’t have a strong prediction about which direction. The p-value accounts for extreme results on both sides. This is the safer, more conservative choice.

One-tailed: Use this when you have a strong directional prediction before seeing any data. “I believe Version B will increase conversions.” The p-value only looks at one side, so it’s half the two-tailed value. Easier to reach significance, which is why you should only use it when the direction is genuinely known upfront.

Switching from two-tailed to one-tailed after seeing your results is cheating. It doubles your false positive rate. Simmons, Nelson, and Simonsohn showed that this kind of flexibility can push false-positive rates above 60%.

Most A/B testing tools default to one-tailed tests (“is the new version better?”). That’s reasonable when you’d never ship a change you believe is worse. For academic or medical research, two-tailed is standard. For a deeper comparison, see our guide on one-tailed vs two-tailed tests.

P-values in A/B testing

Most A/B testing tools calculate p-values for you. The hard part isn’t the math. It’s knowing when to trust the result.

If you’re here because you ran an A/B test, a few things are different from textbook statistics.

The peeking problem. In a textbook, you collect all your data, then calculate the p-value once. In reality, people check their A/B tests daily. Or hourly. Every peek is essentially another test, which inflates your false positive rate. Check 5 times during a test run and your actual false positive rate can climb from 5% to over 20%.

Tools that account for this use an approach called sequential testing, which adjusts the significance threshold each time you check.

Practical significance matters more. A p-value of 0.001 on a 0.2% conversion lift means you detected a real but tiny effect. You’d need to ask: is 0.2% worth the engineering effort to ship? The minimum detectable effect you set before the test should answer this.

Bayesian A/B testing skips p-values entirely. Instead of asking “how weird is this data if nothing changed?”, Bayesian A/B testing asks “given the data, what’s the probability that B is better than A?” The answer comes as a direct probability, like “92% chance B wins.” No interpretation gymnastics.

If you’re running website tests and don’t want to deal with p-value interpretation, try setting up a test in Kirro. It uses Bayesian math and gives you results in plain language. You get a straight answer instead of a number you have to decode.

FAQ

How do you calculate a p-value?

Start with your test statistic (Z-score, t-value, chi-square, or F-statistic). Determine degrees of freedom if your test requires them. Then use a calculator, spreadsheet function, or statistical software to find the area in the tail(s) of the appropriate distribution. That tail area is your p-value. For Excel, functions like T.DIST.2T() and CHISQ.DIST.RT() handle this in one step.

Is a p-value of 0.05 significant?

By convention, yes. A p-value at or below 0.05 is considered “statistically significant” at the 5% level. But 0.05 is a convention, not a natural law. Fisher picked it in 1925 because it was a convenient round number. Some fields use much stricter thresholds (0.005 or even 0.0000003 in particle physics). And a p-value of 0.049 isn’t meaningfully different from 0.051, even though one crosses the line and the other doesn’t. Guideline, not cliff edge.

Can a p-value be negative?

No. P-values are probabilities, so they always fall between 0 and 1. A very small p-value (like 0.0001) means extremely surprising data. A p-value of exactly 0 is theoretically impossible but some software displays “0.000” when the value is too small to show.

What does a high p-value mean?

A high p-value (say, 0.45) means your data is perfectly consistent with “nothing changed.” It does not prove that nothing changed. Maybe the effect is real but your sample was too small to detect it. If you suspect that, calculate the statistical power of your test. Low power plus high p-value usually means “inconclusive,” not “no effect.”

What’s the difference between a p-value and a confidence interval?

A p-value tells you whether an effect is likely real (yes/no). A confidence interval tells you how big that effect probably is (a range). They’re complementary. A result can be statistically significant (low p-value) but practically useless (the confidence interval shows the effect is tiny). When possible, report both.

How do I find a p-value from a Z-score?

Use the standard normal distribution. For a two-tailed test, find the area beyond your Z-score in both tails: p = 2 * (1 - NORM.S.DIST(ABS(Z), TRUE)) in Excel. For a one-tailed test, drop the “2 *” and just use = 1 - NORM.S.DIST(Z, TRUE) for a right-tailed test. A Z-score of 1.96 gives a two-tailed p-value of exactly 0.05.

Randy Wattilete

Randy Wattilete

CRO expert and founder with nearly a decade running conversion experiments for companies from early-stage startups to global brands. Built programs for Nestlé, felyx, and Storytel. Founder of Kirro (A/B testing).

View all author posts

Try Kirro

Run smarter A/B tests and boost your conversions

Everything. No limits. No surprises.

Get started free