Frequentist vs Bayesian statistics explained

Frequentist statistics asks “how likely is this data if nothing changed?” Bayesian statistics asks “given the data, how likely is it that something changed?” Same data. Different question.

The frequentist approach treats probability as long-run frequency. Flip a coin a million times, heads comes up about half the time. The Bayesian approach treats probability as a degree of belief. I’m 80% sure this coin is fair, and I’ll update that belief every time I see a new flip.

That’s the core split. Everything else (the math, the tools, the 250-year academic feud) grows from that one philosophical difference. And if you’re running A/B tests or just trying to make better decisions with data, this shapes the answers you get.

You probably don’t need to pick a side. But understanding both will make you a sharper reader of data.

Frequentist vs Bayesian at a glance

Frequentists ask “how surprising is this data?” Bayesians ask “how likely is this theory?”

	Frequentist	Bayesian
What is probability?	How often something happens over many repeats	How confident you are that something is true
Starting point	No assumptions. Just the data.	Start with a belief (called a prior), then update it
What you get back	P-values, confidence intervals	Posterior probabilities, credible intervals
Plain-English answer	”The data would be this extreme 3% of the time if nothing changed"	"There’s a 97% chance the new version is better”
Sample size needs	Generally needs more data	Can work with less (the prior fills gaps)
When to stop	You must decide the sample size upfront	You can check results as data arrives
Best for	Large datasets, regulatory work, academic research	Small samples, prior knowledge, business decisions

The Bayesian answer is usually what people actually want. “There’s an 89% chance Version B beats Version A” makes sense to anyone.

The frequentist equivalent? “If Version B were actually the same as A, we’d see data this extreme only 4% of the time.” Even statisticians wince saying it.

Our take: Most people hearing “frequentist vs Bayesian” for the first time realize they’ve been thinking like Bayesians their whole life. You check the weather forecast (prior), look out the window (data), and update your belief about whether to bring an umbrella. That’s Bayesian reasoning. We just didn’t call it that.

What is frequentist statistics?

Frequentist statistics treats probability as long-run frequency and uses only the data in front of you, with no outside assumptions.

The frequentist approach says: probability is about what happens over many, many repeats. A coin has a 50% chance of heads because, if you flipped it a million times, roughly half would be heads. The true probability is fixed. You just need enough data to find it.

In practice, that means a few specific tools:

A p-value answers one question: “If nothing actually changed, how often would I see data this extreme?” A low p-value (usually below 0.05) means the data would be pretty rare under “nothing changed,” so you reject that assumption. You can calculate one with a p-value calculator, but knowing what it means matters more than the number.

Then there’s hypothesis testing. You start with a null hypothesis (nothing changed) and try to disprove it. Think of a courtroom: the null hypothesis is “innocent until proven guilty.” Your data is the evidence. Strong enough evidence? Reject the null. Not strong enough? Verdict is “not guilty.” Not “innocent.” Just “not enough evidence.”

Confidence intervals trip up almost everyone. A 95% confidence interval doesn’t mean “there’s a 95% chance the true value is in this range.” It means: if you repeated this exact study 100 times, about 95 of those intervals would contain the true value. Even researchers get this wrong.

And statistical significance is the threshold where you say “this result probably isn’t random noise.” Usually set at 95% confidence (a 5% chance of a false positive, or Type 1 error).

The frequentist framework dominated science for most of the 20th century. Ronald Fisher built its foundation in the 1920s (his 1922 paper rejected Bayesian methods as “founded upon an error”). Jerzy Neyman and Egon Pearson formalized hypothesis testing in the 1930s, adding statistical power and Type 2 errors. Together, these ideas became the default toolkit for clinical trials, psychology, and academic journals.

Where it struggles. Frequentist methods don’t let you incorporate what you already know. Every analysis starts from scratch.

They also produce answers that are famously hard to interpret. Ask five researchers what a p-value means and you’ll get six answers. And they require you to plan your sample size upfront. Peeking at results early inflates your false positive rate unless you use sequential testing corrections.

What is Bayesian statistics?

Bayesian statistics starts with what you already know, then updates that belief as new data arrives.

The Bayesian approach says: probability is about degrees of belief. You start with an initial estimate (called a prior), collect data, and update your estimate. The updated belief is called a posterior.

The formula behind it (Bayes’ theorem) is surprisingly simple: take what you believed before, multiply by how well the new data fits, and normalize. You don’t need the math to get the concept.

Think about weather forecasting. You wake up and check the forecast: 30% chance of rain. That’s your prior. You look outside and see dark clouds. That’s new data. You update your belief: maybe now it’s 70% rain. You grab an umbrella. That’s Bayesian reasoning, and you do it naturally every day.

Thomas Bayes published the core idea in 1763 (posthumously). Pierre-Simon Laplace formalized it in 1774. Then Fisher’s frequentist revolution pushed Bayesian methods out of mainstream science for decades.

They came back in the 1990s when computers got powerful enough to handle the calculations that pen-and-paper couldn’t. Today, nearly 60% of articles in the Journal of the American Statistical Association reference Bayesian methods. Quite a comeback for a 250-year-old idea.

What makes Bayesian different in practice:

The answers are intuitive. “There’s an 89% probability Version B is better” is something anyone can act on. No translation needed.

It works with small samples. When you don’t have much data, the prior stabilizes your estimates. This matters for businesses without millions of monthly visitors.

You can also check results as they come in without inflating error rates (mostly, with some caveats). And if you’ve run similar tests before, that experience feeds your analysis instead of being thrown away.

Where it struggles. The prior is subjective. Two analysts with different priors can reach different conclusions from the same data. Critics have called this the fundamental problem with Bayesian methods for 250 years.

And until recently, the calculations were too slow for complex problems without serious computing power.

If you want to see how Bayesian methods work in A/B testing specifically, that’s a different conversation. We cover the applied side in our guide to Bayesian A/B testing in practice.

Key differences: where frequentist and Bayesian actually diverge

The philosophical split matters less than the practical one: frequentist methods need big samples and upfront planning, Bayesian methods are flexible but depend on your starting assumptions.

How they define probability

Frequentist: probability is the frequency of outcomes over infinite repeats. A die has a 1/6 chance of landing on 4 because, rolled enough times, it will land on 4 about 16.7% of the time.

Bayesian: probability is your confidence level. You might say “I’m 60% sure this die is fair” before rolling it, then update after seeing results.

How they handle prior knowledge

Frequentist analysis pretends you’ve never seen data before. Every study is a fresh start. This is sometimes a strength (objectivity) and sometimes a weakness (waste of useful information).

Bayesian analysis folds in what you already know. If you’ve tested 50 landing pages and headlines usually lift conversion by 5-15%, that knowledge becomes your prior for the 51st test.

What the answers look like

Ask a frequentist “did this headline change conversion rates?” and you’ll hear: “The p-value is 0.03, so we reject the null hypothesis at the 0.05 significance level.”

Ask a Bayesian the same question: “There’s a 94% probability the new headline increased conversions, and the most likely lift is around 8%.”

One of these answers would make sense in a meeting with your boss. The other would get you blank stares.

Stopping rules and peeking

Frequentist tests are like sealed envelopes. You decide the sample size before you start, run the test, then open the envelope. If you peek early and act on what you see, your results become unreliable. You need sequential testing methods to correct for this.

Bayesian tests are more forgiving. The posterior probability is valid at any point, though recent research shows that using fixed decision thresholds while peeking continuously can still inflate false positives. No method is completely peek-proof.

Sample size requirements

Frequentist methods generally need larger samples to produce reliable results. The minimum detectable effect depends heavily on your traffic volume.

Bayesian methods can give useful answers with smaller datasets because the prior fills in gaps. But “useful” isn’t “magic.” With very little data, your results are mostly just your prior beliefs echoed back at you. The data has to do some actual work.

Our take: The “which needs less data” debate is overblown. Bayesian methods handle small samples better, yes. But neither approach creates information from nothing. If you’re asking 4 people whether your shirt looks good, no statistical framework is going to save you. You need more people.

The replication crisis: why this debate actually matters

Misapplied frequentist methods contributed to a crisis where most published research findings couldn’t be reproduced, and that same problem shows up whenever anyone makes decisions with bad statistics.

This isn’t just academics arguing about philosophy. The frequentist-Bayesian debate has real consequences.

In 2015, the Open Science Collaboration tried to replicate 100 published psychology studies. Of the originals, 97% had reported statistically significant results. When other labs repeated them? Only 36% replicated. Nearly two-thirds of “proven” findings fell apart.

The problem wasn’t frequentist statistics itself. It was how people used them. Researchers chased p-values below 0.05 by running multiple tests, tweaking their analysis until something hit the threshold, or ignoring studies where nothing reached significance. Andrew Gelman (Columbia University) calls this the “garden of forking paths,” where ordinary analytical decisions compound into unreliable results.

It got bad enough that the American Statistical Association issued a formal statement on p-values in 2016, warning against their misuse. Three years later, a 2019 editorial called for abandoning the phrase “statistical significance” entirely. The ASA has existed since 1839. They don’t do things like this casually.

John Ioannidis put it most bluntly in his 2005 paper “Why Most Published Research Findings Are False,” the most cited paper in PLoS Medicine history. Using Bayesian reasoning, he showed that standard frequentist practices systematically produce false positives, especially in small studies testing many hypotheses.

When statistics send people to prison

The stakes get much higher than academic papers.

In 1999, Sally Clark, a British mother, was convicted of murdering her two infant children. The key evidence? An expert testified the probability of two sudden infant deaths in one family was 1 in 73 million. The court treated this as “the chance she’s innocent is 1 in 73 million.”

That’s a textbook statistical error. The probability of the evidence given innocence is not the probability of innocence given the evidence. This confusion is called the prosecutor’s fallacy, and it’s the frequentist-Bayesian gap playing out in a courtroom.

A proper Bayesian analysis would factor in how rare double child murders are (very) and how rare double SIDS cases are (uncommon, but far more common than double murders). Clark spent three years in prison before her conviction was overturned.

In the Netherlands, nurse Lucia de Berk spent six years imprisoned after a similar statistical error. A Bayesian reanalysis of her case helped overturn the conviction in 2010.

What this means for your business

You’re probably not going to prison over a bad A/B test. But the same logic applies at smaller scales.

A marketing team chasing “statistical significance” by peeking at tests early, running multiple variations without correction, or stopping the moment they hit p < 0.05? That’s the same thing those researchers did. Just with conversion rates instead of psychology papers.

Frank Harrell, a biostatistician at Vanderbilt who spent 30 years using frequentist methods before switching to Bayesian, puts it this way: “Investigators will be surprised to know how little we have learned from clinical trials that are not huge when p > 0.05.”

Do you actually have to choose? The modern answer

Most working statisticians and data scientists have moved on from the debate. The best approach depends on the problem, not the philosophy.

The frequentist vs Bayesian war has been going for over 250 years. The good news: it’s mostly over. Not because someone won. Because the practitioners got tired of fighting and started picking whichever tool worked for the problem in front of them.

Rafael Irizarry, a professor of biostatistics at Harvard and Johns Hopkins, put it plainly: “I declare the Bayesian vs. frequentist debate over.” His point? Good applied statisticians don’t pick a team. They think about the problem.

Andrew Gelman, one of the most influential Bayesian statisticians alive, titled a 2024 blog post “Bayesians are frequentists.” His argument: both sides ultimately want the same thing. You want your inferences to be right most of the time. That’s a frequentist goal, even if you use Bayesian methods to get there.

Gelman also co-authored a paper called “Holes in Bayesian Statistics” (2020), listing six specific weaknesses in Bayesian practice. Flat priors lead to bad inferences, subjective priors are incoherent, Bayes factors fail with weak priors. A leading Bayesian cataloging the holes in his own methods. That’s intellectual honesty.

Brad Efron, a Stanford statistician who invented the bootstrap (one of the most important frequentist tools), showed in a 2015 paper how to measure the frequentist accuracy of Bayesian estimates. The math doesn’t care about the labels.

What the tech companies actually do

So what do major companies actually use?

Microsoft runs about 100,000 A/B tests per year. Mostly frequentist (t-tests, p-values, chi-squared).
Netflix started frequentist, then built a “rich modular framework” including Bayesian models for growth experiments.
Spotify explicitly chose Group Sequential Tests (a frequentist method) to solve the peeking problem. They blogged their entire decision process.
VWO switched to Bayesian in 2016 and reported cutting test duration by up to 50%. Their reasoning: “The question businesses want answered is ‘What is the probability that page A is better than B.’ That’s a Bayesian question.”
Google used Bayesian methods in Google Optimize. When they shut it down and moved to Firebase, they switched back to frequentist.
LaunchDarkly lets you pick either method, explicitly because “statisticians prefer one approach while business stakeholders understand another better.”

The pattern? Big tech doesn’t pick sides. They pick tools that fit the problem. Many use both.

And in January 2026, the FDA published their first-ever draft guidance on using Bayesian methods in clinical trials for drug approval. The most conservative statistical institution in the world, officially saying “both methods have their place.” If that’s not the debate ending with a handshake, nothing is.

For A/B testing, the tool usually picks for you

Most modern A/B testing tools (including Kirro) have already made this choice. They use Bayesian statistics behind the scenes because the output is clearer: “Version B has an 89% chance of being better” is something a marketing team can act on without a statistics degree.

You don’t need to pick a statistical philosophy to run a test. The tool handles the math. Your job is asking the right question and giving the test enough visitors to answer it.

FAQ

Answers to the most common questions about frequentist and Bayesian statistics.

What is an example of frequentist statistics?

A coin flip. A frequentist says: “If I flip this coin 10,000 times, it’ll land heads about 50% of the time.” The probability comes from the long-run frequency of the event.

In the real world, clinical drug trials are the most common example. Pharmaceutical companies test whether a new drug works by giving it to a treatment group and a placebo to a control group, then checking whether the difference is statistically significant. Most academic research, quality control in manufacturing, and traditional A/A testing also use frequentist methods.

Is ChatGPT Bayesian?

Not exactly. Large language models like ChatGPT are trained using maximum likelihood estimation, which is a frequentist method.

But a 2025 study found that transformers are “Bayesian in expectation, not in realization.” In plain English: when they adjust their responses to new information within a conversation, that mimics Bayesian updating. A separate study found GPT-4o achieved “superhuman and nearly perfect Bayesian classifications” on decision tasks where GPT-3.5 did worse than humans.

So the training is frequentist. The behavior looks Bayesian. Even the AI is hedging its bets.

Is Monte Carlo frequentist or Bayesian?

Neither. And both. Monte Carlo is a computational technique (using random sampling to solve math problems), not a statistical philosophy.

Frequentists use it for bootstrap simulations. Bayesians use a version called MCMC (Markov Chain Monte Carlo) to calculate posterior distributions that would be impossible to solve by hand. The rise of MCMC in the 1990s is actually what made Bayesian methods practical. Before fast computers, Bayesian calculations were often too complex to finish.

Why is Bayesian statistics controversial?

The main objection: priors are subjective. Two analysts with different starting beliefs can reach different conclusions from identical data. For 250 years, critics have argued this makes Bayesian methods unscientific.

Defenders counter that frequentist methods also involve subjective choices (which test to use, what significance level, when to stop collecting data). They just don’t label them as subjective. Steven Goodman, a Stanford biostatistician, cataloged 12 common misconceptions about p-values and argued that Bayes factors have “virtually all of the desirable properties” that p-values lack.

The controversy has cooled a lot. Most modern statisticians view both as useful tools, not warring philosophies.

Which is better for small sample sizes?

Bayesian methods generally handle small samples better because they incorporate prior knowledge to stabilize estimates.

If you’ve only got 200 visitors to your landing page, a frequentist test will likely produce wide confidence intervals or fail to reach significance, even if a real difference exists. A Bayesian approach can combine your prior experience (say, from previous tests) with the limited data to give a useful estimate.

This is one reason tools like Kirro use Bayesian statistics. Most small businesses don’t have enterprise-level traffic, and waiting months for a frequentist test to reach significance isn’t practical.

But no method creates information from nothing. With very small samples, your results are mostly just your prior beliefs reflected back at you. The data still has to do some work.

Randy Wattilete

CRO expert and founder with nearly a decade running conversion experiments for companies from early-stage startups to global brands. Built programs for Nestlé, felyx, and Storytel. Founder of Kirro (A/B testing).

View all author posts

Frequentist vs Bayesian statistics: what's the difference and why it matters