A/B testing UX: a guide for designers and teams

A/B testing in UX means showing two design versions to real visitors and measuring which one performs better. Not “which one the team likes more in a meeting.” Which one actual people prefer, based on what they do on your site.

That’s the short version. The longer version? A/B testing is one of the most useful tools a UX designer can reach for. It’s also one of the most misunderstood. Marketers treat it as a conversion machine. Designers sometimes treat it as a threat to creativity. The truth depends on how you use it.

This guide covers when A/B testing makes sense for UX work and where it fits alongside other research methods. Plus the trap that catches even the most data-driven teams.

What A/B testing means for UX teams

A/B testing shows two versions of a design to different visitors and measures which one leads to better outcomes.

You take your current page (that’s your control group), create a second version with a specific change, then split your traffic between them. After enough visitors see both, you compare the results.

Simple concept. The UX version looks different from the marketing version, though.

Marketers usually care about one thing: conversions. Did more people click the button? Buy the thing? Fill out the form? That’s the whole story.

UX teams ask better questions. Did people finish the task faster? Did fewer people angrily click the same button over and over (designers call that rage-clicking)? Did the checkout flow cause fewer drop-offs at step three?

The metric changes because the goal changes. Marketing A/B testing asks “did they convert?” UX A/B testing asks “did they have a better experience?”

That difference matters more than it sounds. Deceptive design tricks (the industry calls them dark patterns) can boost conversions while making the experience worse. Research from the University of Chicago found that aggressive dark patterns tripled conversion rates (from 11% to 37%). The conversion number went up. The experience went down.

Our take: If your A/B test only tracks conversion rate, you’re measuring the business outcome without checking whether you broke the experience to get there. Always add a guardrail metric, something like task completion time or error rate, that catches collateral damage.

Where A/B testing fits in the UX research toolkit

A/B testing tells you WHAT works. Other methods tell you WHY.

This is where most articles on A/B testing UX get it wrong. They treat A/B testing like it’s the only research method that matters. It’s not. It’s one tool in a toolkit that includes at least a dozen options.

Nielsen Norman Group maps UX research methods across two dimensions: qualitative vs. quantitative, and attitudinal (what people say) vs. behavioral (what people do). A/B testing sits in one quadrant: quantitative + behavioral. It measures what people do, not what they say or think.

And that means A/B testing has blind spots. Big ones.

Method	What it tells you	People needed	Best for
A/B testing	Which version performs better	Hundreds to thousands	Validating a specific design change
Usability testing	Where people struggle and why	5 to 10	Finding problems before you know what to fix
Card sorting	How people organize information	15 to 30	Navigation and information architecture
Surveys	What people say they want	50+	Gathering opinions at scale
Session recordings	What people actually do (unguided)	Dozens	Spotting unexpected behavior patterns

When should you reach for A/B testing? When three things are true:

You have enough traffic (at least a few hundred daily visitors)
You have a clear metric to measure
You have two viable design options ready to compare

If you don’t have the traffic, try usability testing instead. Five people watching your checkout flow will teach you more than an A/B test that never reaches confidence. For low-traffic sites, alternatives to A/B testing work with smaller numbers.

The 2025 State of User Research report tells the same story. UX practitioners still use qualitative methods far more than quantitative ones. Usability testing and interviews dominate. A/B testing doesn’t crack the top three.

That’s not because A/B testing is bad. It’s because most UX teams know it’s only part of the picture.

What UX elements to A/B test (and what to skip)

Test things that change user behavior, not things that change how the page looks.

Not every design decision needs an A/B test. And some of the things teams test most often are a waste of time.

High-impact UX elements worth testing:

Checkout flows and form layouts (Baymard Institute found that simplifying checkout UX can lift conversions by 35% for large e-commerce sites)
Navigation patterns and menu structures
Call-to-action placement and wording
Onboarding sequences
Page layouts on key conversion pages
Product detail pages (47% of all A/B tests run on product pages, and they have the highest win rate)

What to skip:

Font choices (test a headline rewrite instead)
Subtle color variations (the “41 shades of blue” problem, more on that below)
Icon styles and micro-interactions
Decorative elements that don’t affect user decisions

The difference? High-impact tests change what people do. Low-impact tests change what people see without affecting their behavior.

Some famous examples. Jared Spool’s “$300 Million Button” case study: replacing a “Register” button with “Continue as Guest” at checkout increased purchases by 45%. That’s $300 million in additional revenue in one year. Tiny UX change. Massive behavioral change.

The Obama 2008 campaign tested 24 homepage variations (images and button text). The winning combination drove 40% more email signups, about $60 million in additional donations.

Meanwhile, you could test 41 shades of blue on a link. You’d learn which shade people click 0.1% more often. Big difference in how you spend your time.

If you’re looking to split test landing pages specifically, the same rule applies: test the things that affect decisions, not the things that affect aesthetics.

How to run a UX-focused A/B test

Start with a UX problem, not a metric target.

A step-by-step process designed for UX work. It’s different from a marketing A/B test because it starts with understanding the person, not chasing a number.

Step 1: Find the friction first.

Don’t start with “let’s test something.” Start with “where are people struggling?” Look at session recordings, support tickets, usability test findings, or CRO testing data. The best A/B tests come from real problems, not hunches.

NN/g researcher Jen Cardello calls this the “garbage in, garbage out” problem. Without UX research upfront, you end up testing assumptions instead of real user problems. And even the best A/B test can’t find the right answer if you’re asking the wrong question.

Step 2: Write a hypothesis.

“If we reduce the checkout form from six fields to three, more people will complete their purchase because the perceived effort drops.” That’s a hypothesis. “Let’s see if a green button works better” is not.

The hypothesis should include what you’re changing, what you expect to happen, and why. That “why” part is what separates UX testing from random guessing.

Step 3: Design two versions.

Keep the change meaningful but isolated. If you change the layout, the copy, and the button color all at once, you won’t know which change made the difference. For that you’d need multivariate testing, which means testing multiple changes at the same time. And you’d need a lot more traffic.

Step 4: Pick your metrics carefully.

Choose a primary metric (the thing you’re trying to improve) and at least one guardrail metric (the thing you don’t want to break). If you’re testing a new checkout flow, your primary metric might be completion rate and your guardrail might be average order value. For help choosing, see our guide to A/B testing metrics.

Harvard Business Review research found that focusing on averages hides problems. A design change might boost one group by 30% while making things worse for another. Segment your results.

Step 5: Let the test run long enough.

This is where most teams mess up. Data from thousands of e-commerce tests shows the median test runtime is 42 days. Not 3 days. Not a week. 42 days.

Research from Stanford and Optimizely found the reason patience matters. When teams peek at results early and stop the test when things look good, false positive rates jump from 5% to over 40%. Nearly half your “winners” might not be real winners at all.

If you need a specific visitor count, use a sample size calculator before you start.

Step 6: Read the results in context.

A winning version tells you which design performed better. It doesn’t tell you why. Pair your quantitative results with qualitative follow-up. Review session recordings of both versions. Run a quick usability test on the winner. Check whether the improvement holds across device types.

Tools like Kirro make the testing part simple. Visual editor, no code, and Bayesian stats that give you a straight answer without needing a massive sample. But the thinking before and after the test is what separates UX testing from just running numbers.

Our take: The test itself is the easy part. The hard part is figuring out what to test and understanding why it worked. Spend 80% of your effort on research and interpretation, 20% on the test mechanics.

The local maximum trap

A/B testing can perfect a bad design. It can’t invent a better one.

This is the thing none of the top search results for “A/B testing UX” talk about. And it’s the most important concept UX teams need to understand.

The local maximum is what happens when you’ve squeezed every possible improvement out of your current design through incremental testing. Your tests keep coming back flat. Small wins get smaller. You’re at the peak of the current mountain. You just can’t see there’s a taller one right next to you.

A/B testing is brilliant at climbing hills. It’s terrible at finding new hills to climb.

In 2009, Google’s visual design lead Doug Bowman left the company. The reason? The data-driven culture had teams testing 41 shades of blue to find the “best” link color. Bowman argued that micro-optimization like that killed bold design decisions. He wasn’t wrong.

The most dramatic real-world example? Booking.com runs over 25,000 A/B tests per year. Arguably the most-tested website on the planet. And EyeQuant research found they score 32 out of 100 on visual clarity. The industry average is 71.

Twenty-five thousand tests. The most cluttered interface in the industry.

That’s the local maximum in action. Each individual test probably improved the specific metric it was measuring. But the sum of thousands of “winning” tests created an interface that’s genuinely hard to use. No single A/B test can fix that, because the problem isn’t any one element. It’s the whole approach.

How to escape it:

CXL’s research on the local maximum recommends alternating between two modes:

Optimization cycles: Use A/B testing to refine your current design. Small, targeted changes. This is what most teams do.
Innovation cycles: Step back. Do user research. Prototype something completely different. Then A/B test the radical redesign against the current version.

The teams that get stuck are the ones that only do mode 1. They optimize forever but never innovate.

Jakob Nielsen put it bluntly in 2005: qualitative UX insights can yield “100% improvements” while A/B testing typically delivers “1-2% improvements.” Both have their place. But if you only do one, you’re leaving the big gains on the table.

Combining A/B testing with qualitative UX research

The best UX teams use qualitative research to find problems and A/B testing to confirm solutions.

This is where the real power is. Not A/B testing alone. Not usability testing alone. The cycle between them.

Observe: Session recordings or usability tests reveal that visitors can’t find the pricing page (if the pricing page itself is the problem, see our guide on pricing page A/B tests)
Hypothesize: “Moving pricing to the main navigation will reduce bounce rate on the comparison page”
Test: A/B test the current navigation against the new version
Learn: The new version wins by 12%. But session recordings also show people now skip the features page entirely
Observe again: Start the cycle over with the new finding

Yann Riche, a UX researcher at Microsoft and Google, described a case that shows this perfectly. His team spent months A/B testing different versions of a conversion dialog. Nothing worked. Every test came back flat.

Then they did qualitative research. The real issue wasn’t the dialog text at all. It was a psychological barrier nobody had considered. They designed a variant that addressed the actual problem, found through research, not testing. The new version projected over $300,000 in additional annual revenue.

A/B testing couldn’t generate that hypothesis. It could only evaluate it.

The numbers back this up. Only about one-third of A/B tests at Microsoft and Google produce positive results. The rest are flat or negative. DRIP Agency’s e-commerce data confirms it: only 36.3% produce a winner.

Better hypotheses from qualitative research improve that hit rate. You’re not guessing anymore. You’re testing solutions to problems you’ve actually observed.

Erin Weigel ran over 1,400 experiments at Booking.com. One thing she noticed: when a test “loses,” teams abandon the idea entirely. But the concept and the execution are separate variables. A good idea with a poor implementation will lose. That doesn’t mean the idea was wrong.

Qualitative follow-up after a losing test is just as valuable as follow-up after a winner.

If you’re working on UX conversion optimisation more broadly, this combined approach is how top teams improve their win rates. Want to try running your first test? The research-first mindset applies whether you’re using Kirro or anything else.

Common UX A/B testing pitfalls

Most UX A/B testing failures happen before the test even starts.

A few traps that catch even experienced UX teams. (For the full list, see our deep-dive on common A/B testing mistakes.)

Testing too many things at once. If you change the hero image, the headline, and the CTA simultaneously, you’ve learned nothing about which change mattered. Either isolate your changes or use A/B testing vs multivariate testing to handle multiple variables.

Ignoring mobile vs. desktop. A layout that wins on desktop can tank on mobile. Segment your results by device. Always.

Stopping too early. The Stanford/Optimizely peeking research above says it all. Checking results daily and stopping when you see a “winner” means about 40% of your wins are fake. Let the test run its course. Bayesian testing approaches help because the math adjusts for continuous monitoring.

Chasing a single metric. A design change can boost one metric while destroying another. Harvard Business Review found that a change can help casual visitors while alienating your most active ones. Aggregate numbers hide the damage.

Killing good ideas too early. Erin Weigel’s insight from 1,400 Booking.com experiments: a test can fail because the execution was poor, not because the idea was wrong. Before you throw out a losing concept, ask whether the implementation gave it a fair shot.

Tools like Kirro include guardrail metrics specifically to catch the “won the metric, broke the experience” problem. You set up your test with both a primary goal and a safety metric, so you know whether a “win” is actually worth shipping.

FAQ

Quick answers to the questions UX teams ask most about A/B testing.

What is A/B testing in UX design?

A/B testing in UX design means comparing two design versions with real visitors to see which delivers a better experience. The key difference from marketing A/B testing? UX teams measure behavior metrics like task completion, time on task, and error rates alongside conversion rates. Not just conversion alone.

What is the difference between usability testing and A/B testing?

Usability testing is qualitative. You watch 5 to 10 people use your site and observe where they struggle. It tells you what’s broken and why. A/B testing is quantitative. You show two versions to hundreds of visitors and measure which performs better. It tells you which option wins but not why. The best A/B testing best practices combine both methods.

When should UX designers use A/B testing?

When you have enough traffic (hundreds of daily visitors), a clear metric to measure, and two viable design options. Don’t use it for early-stage exploration (usability testing is better). Don’t use it when traffic is very low (the test will take months). And don’t use it when you need to understand WHY something isn’t working. A/B testing only tells you WHAT.

Can you A/B test with low traffic?

You can, but it takes longer. Low traffic means longer test durations to reach confidence in your results. If your site gets fewer than 100 daily visitors, look at alternatives to A/B testing. Usability testing, preference testing, and the five-second test method all work with smaller numbers. For the exact visitor count you need, check our sample size formula guide.

What is the difference between UAT and A/B testing?

UAT (user acceptance testing) checks whether software meets its requirements before launch. It’s a pass/fail gate. A/B testing compares two live design versions to see which performs better. Different purposes, different stages. UAT happens before release. A/B testing happens after, with real visitors, to improve what’s already live. If you’re working with a product management team, both have a role, just at different points in the process.

Randy Wattilete

CRO expert and founder with nearly a decade running conversion experiments for companies from early-stage startups to global brands. Built programs for Nestlé, felyx, and Storytel. Founder of Kirro (A/B testing).

View all author posts

A/B testing UX: how designers use split testing to make better design decisions