A/B Testing Guide: Avoid the Mistakes That Fool You

Q: What sample size do I need for an A/B test?

It depends on your baseline conversion rate and the smallest improvement you want to detect (MDE). A site with a 3% baseline conversion rate needs roughly 50,000 visitors per variant to detect a 10% relative lift at 95% confidence and 80% power. Smaller lifts need exponentially more traffic. Use Evan Miller's calculator before launching, not after.

Q: How long should I run an A/B test?

A minimum of two full weeks, regardless of when you hit significance. This captures full weekly behavioral cycles. Cap tests at 6-8 weeks - beyond that, seasonality, marketing campaigns, and product changes contaminate the results. If you haven't reached significance in 8 weeks, the effect is either too small to matter or doesn't exist.

Q: What's the difference between statistical significance and practical significance?

Statistical significance tells you the difference probably isn't due to chance. Practical significance tells you whether the difference is large enough to act on. A 0.3% lift can be statistically significant with enough traffic but not worth the engineering cost to ship. Always define a minimum detectable effect that's worth your time before starting.

Q: Can I trust an A/B test that ran for only 3 days?

No. Three days doesn't capture weekly behavior cycles, sample sizes are usually too small, and stopping early (peeking) inflates your false positive rate to 20-30%. A winner after 3 days frequently regresses to the mean once you collect more data.

Q: Is Bayesian A/B testing better than frequentist?

Neither is objectively better - they answer slightly different questions. Frequentist methods report p-values and require a fixed sample size. Bayesian methods report probability statements and allow more flexible stopping. Bayesian is often more intuitive for non-statisticians; frequentist is the standard in regulated environments. Pick one framework and commit to it for the duration of each test.

Q: What is Sample Ratio Mismatch (SRM) and why does it matter?

SRM occurs when traffic isn't split as expected - for example, you set 50/50 but Variant A got 51.2% of visitors. Even small mismatches signal a setup problem: a tracking bug, a redirect firing unevenly, bot traffic, or browser caching issues. SRM invalidates the test entirely. Run a chi-square SRM check before reading any results.

Q: How do I A/B test on a low-traffic site?

Honestly, you usually can't get meaningful A/B test results below 10,000 visitors per month. Instead: test bigger, bolder changes (full redesigns, new value propositions); rely on qualitative research and usability testing; benchmark against published conversion rate benchmarks; or batch changes and monitor pre/post effects directionally without claiming statistical significance.

Contents

What Is A/B Testing?

A/B testing (also called split testing) is a method of comparing two versions of a webpage, email, or other marketing asset to see which one performs better. You show version A to half your audience and version B to the other half, then measure which version achieves more conversions.

The concept is simple. However, running A/B tests correctly is surprisingly difficult. Most businesses make fundamental mistakes that lead to false conclusions and wasted effort.

In other words, they think they’re making data-driven decisions when they’re actually just fooling themselves. This guide walks through how A/B testing works, how much traffic you really need, the statistical traps that quietly invalidate most tests, and the alternatives worth considering when classic A/B testing isn’t a fit.

How A/B Testing Works

The basic process looks straightforward:

Create a hypothesis — “Changing the button color from blue to green will increase clicks”
Build two versions — Control (A) keeps the original, Variant (B) has the change
Split traffic randomly — 50% sees A, 50% sees B
Measure results — Track conversions for each version
Declare a winner — Pick the version with better performance

This sounds easy. Unfortunately, each step hides potential pitfalls that can completely invalidate your results.

The Statistical Significance Problem

Here’s where most A/B tests go wrong: people don’t understand what “statistical significance” actually means.

Statistical significance tells you the probability that your results occurred by chance. A 95% confidence level means there’s only a 5% chance that the difference you observed was random noise.

However, this doesn’t mean what most people think it means.

What 95% Confidence Actually Means

A common misconception is that 95% confidence means “there’s a 95% chance the winning variant is actually better.” That interpretation is incorrect.

Instead, it means: “If there were no real difference between A and B, you’d see results this extreme only 5% of the time.”

This distinction matters. Even with 95% confidence, you can still be wrong. In fact, if you run 20 tests where there’s no real difference, you’ll likely get one false positive purely by chance.

Understanding P-Values in A/B Testing

The p-value is the engine behind statistical significance. It answers a single question: assuming the null hypothesis is true (no real difference between A and B), how likely are you to see results at least this extreme?

A p-value of 0.05 means there’s a 5% chance the observed difference is random noise. Most teams set 0.05 as their threshold and call anything below it “significant.”

That’s a convention, not a law. Microsoft, Google, and Booking publish on much stricter thresholds — Microsoft’s experimentation team has documented why large-scale platforms use p<0.01 for high-stakes decisions. The right threshold depends on the cost of a wrong call. Cheap, reversible changes can tolerate p<0.10. Pricing or checkout changes deserve p<0.01.

The Peeking Problem

The most common A/B testing mistake is checking results too early and stopping when they look good.

Suppose you’re running a test and after three days, version B shows a 15% improvement with 92% confidence. It’s tempting to declare victory and implement the change.

Don’t do it.

Early results are unreliable because they’re based on insufficient data. Consequently, the “winning” variant often regresses to the mean once you collect more data. What looked like a 15% improvement might turn out to be noise.

This phenomenon is called “peeking” or “optional stopping,” and it dramatically increases your false positive rate. If you check results daily and stop whenever significance is reached, your actual false positive rate can exceed 30% — far higher than the 5% you think you have.

How Much Traffic Do You Actually Need?

Before running any test, calculate your required sample size. Tools like Evan Miller’s sample size calculator make this easy. The calculation depends on three factors:

Baseline conversion rate — Your current conversion rate
Minimum detectable effect (MDE) — The smallest improvement worth detecting
Statistical power — The probability of detecting a real effect (typically 80%)

Sample Size Reality Check

Here’s what most people don’t realize: detecting small improvements requires enormous sample sizes.

For example, if your baseline conversion rate is 3% and you want to detect a 10% relative improvement (from 3% to 3.3%), you need approximately 50,000 visitors per variant — 100,000 total.

If you only get 10,000 visitors per month, that’s a 10-month test. For most businesses, this is impractical.

As a result, you have two options:

Test bigger changes — A 50% relative improvement (3% to 4.5%) requires only about 4,000 visitors per variant
Accept the limitations — Recognize that small optimizations are undetectable with your traffic

Minimum Viable Sample Sizes

As a general guideline, aim for these minimums:

Metric	Minimum Recommended
Visitors per variant	30,000+
Conversions per variant	300+
Test duration	2 weeks minimum
Maximum duration	6-8 weeks

If you can’t reach these thresholds, your test results will be unreliable regardless of what the statistics say. For lower-traffic sites, qualitative methods and established landing page best practices often produce faster wins than underpowered experiments.

How Long Should You Run an A/B Test?

Test duration matters for two reasons: sample size and capturing behavioral patterns.

The Two-Week Minimum

Always run tests for at least two full weeks, even if you reach statistical significance earlier. This is because user behavior varies by day of week. Monday visitors behave differently than Saturday visitors.

If you run a test for only 5 days, you might miss important patterns. For instance, your variant might perform well on weekdays but poorly on weekends — something you’d never discover with a short test.

The Maximum Duration Problem

On the other hand, don’t let tests run too long. After 6-8 weeks, external factors start contaminating your results:

Seasonality effects
Marketing campaigns that affect traffic quality
Competitor actions
Technical changes to your site

Therefore, if you can’t reach significance within 8 weeks, the effect you’re testing for probably doesn’t exist — or it’s too small to matter.

A/B vs Multivariate vs Bandit Testing

A/B testing is one tool in a broader experimentation toolkit. The right method depends on traffic, goals, and how many elements you’re trying to evaluate.

Method	What It Tests	Traffic Required	Statistical Rigor	Best For
A/B Testing	One change vs control	Moderate (10k+/variant)	High	Single hypothesis, clear winner needed
Multivariate (MVT)	Multiple elements simultaneously	Very high (100k+)	High, with caveats	Layout combinations on high-traffic pages
Multi-Armed Bandit	Many variants, auto-allocates traffic	Low to moderate	Lower (sacrifices learning for revenue)	Time-sensitive campaigns, headlines, ads
Split URL Test	Entire pages on different URLs	Same as A/B	High	Major redesigns, different templates

In practice, A/B testing is the default. Multivariate testing only makes sense at scale — most sites simply don’t have the traffic for it. Bandits are popular for ad copy and headlines where the cost of showing a losing variant is high, but they don’t give you the clean lift estimate a proper A/B test does.

Frequentist vs Bayesian A/B Testing

There’s a quieter debate underneath every A/B testing tool: which statistical framework runs the math? Most platforms still default to frequentist methods (p-values, confidence intervals, fixed sample size). A growing number — VWO, Convert, Statsig — offer Bayesian alternatives that report probability statements like “92% chance B is better than A.”

Aspect	Frequentist	Bayesian
Core output	P-value, confidence interval	Posterior probability, credible interval
Sample size	Fixed in advance	Can stop when posterior stabilizes
Peeking	Inflates false positive rate	Less sensitive (but not immune)
Interpretation	“5% chance of false positive if there’s no real effect”	“92% probability B beats A given the data”
Prior knowledge	Ignored	Encoded in prior distribution
Best for	Regulated decisions, auditable methodology	Lower-traffic sites, faster iteration

Both frameworks work when used correctly. The Bayesian approach is more intuitive — most people naturally think in probabilities, not p-values — but it’s not a free pass to peek. Booking.com has published extensively on running large-scale frequentist experiments with variance reduction techniques, while many product-led companies prefer Bayesian for early-stage features.

The choice rarely matters at small scale. What matters is committing to one framework, sticking to it for the test’s duration, and not switching when results look bad.

Common A/B Testing Mistakes

Beyond statistical errors, several practical mistakes undermine A/B tests:

1. Testing Too Many Variants

Each additional variant increases the traffic you need. A test with 4 variants requires roughly 4x the traffic of a simple A/B test.

Moreover, testing multiple variants increases your false positive risk. If you’re comparing one control against three variants, you’re essentially running three tests simultaneously.

Stick to one variant unless you have massive traffic.

2. Changing Things Mid-Test

Never modify your test after it starts. This includes:

Adjusting the traffic split
Editing the variant design
Changing the goal metric
Adding new audience segments

Any mid-test change invalidates your results. If you must make changes, start a new test.

3. Ignoring Sample Ratio Mismatch

Sample Ratio Mismatch (SRM) occurs when traffic isn’t split evenly between variants. Even a 0.2% difference can skew results.

Before analyzing results, verify that each variant received approximately equal traffic. If one variant got significantly more or less traffic than expected, something went wrong with your test setup — usually a tracking bug, a redirect that fires unevenly, or bot traffic hitting one variant. SRM is the canary that tells you to stop and audit your GA4 event tracking before trusting any results.

4. Testing the Wrong Things

Button colors, headline tweaks, and minor copy changes rarely produce meaningful results. These micro-optimizations typically generate effects too small to detect reliably.

Instead, focus on bigger changes:

Different value propositions
Entirely new page layouts
Removing major friction points
Adding or removing entire sections

Bold changes are easier to detect and more likely to produce meaningful business impact. Use funnel analysis to identify the steps where users actually abandon — those are the pages worth testing.

5. Not Having a Hypothesis

Random testing wastes time. Before each test, articulate:

What you’re changing
Why you expect it to work
What success looks like

Without a hypothesis, you’re just guessing. Even if you get a “winner,” you won’t understand why it won — making it impossible to apply the learning elsewhere.

What to Do When Tests Fail

Most A/B tests don’t produce statistically significant results. This isn’t failure — it’s learning.

When a test shows no significant difference, consider:

The change was too small — Your variant wasn’t different enough to affect behavior
The hypothesis was wrong — Your assumption about user behavior was incorrect
The sample was too small — You didn’t have enough traffic to detect the effect

Additionally, a “losing” test is still valuable. It tells you that a particular change won’t improve your conversion rate — information that prevents wasted development effort.

When A/B Testing Isn’t Worth It

A/B testing requires significant traffic and resources. Here’s when you should skip it:

Low traffic sites — If you get fewer than 10,000 visitors per month, most tests will be inconclusive
Low-stakes decisions — Testing a footer link isn’t worth the effort
Obvious improvements — If your checkout is broken, fix it. Don’t test it.
One-time events — You can’t A/B test a product launch

In these cases, use best practices, qualitative research, and common sense instead. Qualitative methods (interviews, usability tests, session recordings) help you understand why users behave certain ways. Use these to generate hypotheses, then validate with A/B tests when traffic allows. Similarly, analyzing bounce rate patterns can reveal which pages need testing most urgently.

Tools for A/B Testing

Several platforms can run A/B tests:

Tool	Best For	Starting Price
Google Optimize (sunset)	Was free, now discontinued	N/A
VWO	Mid-market companies	~$200/month
Optimizely	Enterprise	Custom pricing
AB Tasty	European companies	Custom pricing
Convert	Privacy-focused testing	~$99/month
Statsig	Product-led teams (Bayesian)	Free tier + usage-based

Most tools handle the technical implementation, but none of them prevent statistical mistakes. That’s on you. Before launching, preview the variant copy with a tool like the SERP preview if the change touches titles or descriptions surfaced in search.

A Practical A/B Testing Checklist

Before running your next test, verify:

Hypothesis documented — Clear statement of what you expect and why
Sample size calculated — Know how many visitors you need before starting
Test duration planned — Minimum 2 weeks, maximum 8 weeks
Success metric defined — One primary metric, decided in advance
Technical setup verified — Equal traffic split, tracking working correctly
No peeking commitment — Decide not to check results until the planned end date
SRM check planned — Confirm traffic split matches expectation before reading results

If you can’t check all these boxes, you’re not ready to run a valid test.

Frequently Asked Questions

What sample size do I need for an A/B test?

It depends on your baseline conversion rate and the smallest improvement you want to detect (MDE). A site with a 3% baseline conversion rate needs roughly 50,000 visitors per variant to detect a 10% relative lift at 95% confidence and 80% power. Smaller lifts need exponentially more traffic. Use Evan Miller’s calculator before launching, not after.

How long should I run an A/B test?

A minimum of two full weeks, regardless of when you hit significance. This captures full weekly behavioral cycles. Cap tests at 6-8 weeks — beyond that, seasonality, marketing campaigns, and product changes contaminate the results. If you haven’t reached significance in 8 weeks, the effect is either too small to matter or doesn’t exist.

What’s the difference between statistical significance and practical significance?

Statistical significance tells you the difference probably isn’t due to chance. Practical significance tells you whether the difference is large enough to act on. A 0.3% lift can be statistically significant with enough traffic but not worth the engineering cost to ship. Always define a minimum detectable effect that’s worth your time before starting.

Can I trust an A/B test that ran for only 3 days?

No. Three days doesn’t capture weekly behavior cycles, sample sizes are usually too small, and stopping early (“peeking”) inflates your false positive rate to 20-30%. A “winner” after 3 days frequently regresses to the mean once you collect more data.

Is Bayesian A/B testing better than frequentist?

Neither is objectively better — they answer slightly different questions. Frequentist methods report p-values and require a fixed sample size. Bayesian methods report probability statements and allow more flexible stopping. Bayesian is often more intuitive for non-statisticians; frequentist is the standard in regulated environments. Pick one framework and commit to it for the duration of each test.

What is Sample Ratio Mismatch (SRM) and why does it matter?

SRM occurs when traffic isn’t split as expected — for example, you set 50/50 but Variant A got 51.2% of visitors. Even small mismatches signal a setup problem: a tracking bug, a redirect firing unevenly, bot traffic, or browser caching issues. SRM invalidates the test entirely. Run a chi-square SRM check before reading any results.

How do I A/B test on a low-traffic site?

Honestly, you usually can’t get meaningful A/B test results below 10,000 visitors per month. Instead: test bigger, bolder changes (full redesigns, new value propositions); rely on qualitative research and usability testing; benchmark against published conversion rate benchmarks; or batch changes and monitor pre/post effects directionally without claiming statistical significance.

The Bottom Line

A/B testing is a powerful tool, but it’s frequently misused. Most “data-driven” decisions based on A/B tests are actually noise dressed up as insight.

The solution isn’t to abandon testing. Instead, it’s to test properly: calculate sample sizes in advance, run tests long enough, resist the urge to peek, and accept that most tests won’t produce significant results.

Ultimately, a rigorous A/B testing program teaches you more from the failures than from the wins. The goal isn’t to find winners — it’s to learn the truth about what actually affects your users’ behavior.

That requires patience, discipline, and a willingness to be proven wrong.

What Is A/B Testing and How to Run It Without Fooling Yourself