A/B Test Calculator

Check statistical significance, plan your sample size and duration, or run a Bayesian test — three calculators in one. Free, instant, no signup.

Bayesian mode answers "what's the chance the variation is actually better?" from the same conversion numbers above, using a Beta(1,1) prior.

Continuous metrics (revenue, average order value) use Welch's t-test instead of the proportion z-test.

A Sample Ratio Mismatch (SRM) chi-square test flags when your actual split is off — a sign of a broken experiment.

The Region of Practical Equivalence (ROPE) is the band around "no difference" treated as a tie, powering the Worse / Equivalent / Better decision.

For educational use. This calculator applies standard statistical methods (two-proportion z-test, power analysis, and a Beta-Binomial Bayesian model) to the numbers you enter — it is a decision aid, not a guarantee. Results assume properly randomized, independent samples and a single fixed analysis; real experiments can be affected by peeking, novelty effects, seasonality, sample ratio mismatch and tracking errors. Use it to inform a decision, not to replace sound experiment design.

This free A/B test calculator does the three jobs every experiment needs, in one place: tell you whether a finished test is a real winner (statistical significance), plan how many visitors and days you'll need before you start, and run a Bayesian "chance to beat" when you'd rather think in probabilities than p-values.

Pick a mode, enter your numbers, hit Calculate — everything updates instantly and your data never leaves your browser. Every default (95% confidence, 80% power, z = 1.96) is a documented statistical standard, cited below.

From Test Data to a Clear Call in Three Steps

No account, no email, no limits — just rigorous statistics made readable.

1

Pick Your Mode

"Did my test win?" for a finished test, "Plan my test" to size one before launch, or "Bayesian" for a chance-to-beat. One tool, three jobs.

2

Enter Your Numbers

Visitors and conversions for each variation, or a baseline rate and target effect. Switch to Advanced for revenue metrics, SRM and more.

3

Read the Verdict

Get a plain-English winner call plus the p-value, confidence interval, sample size or chance-to-beat — and the charts that make it obvious.

The Numbers Every A/B Test Uses

These are the conventional thresholds this calculator defaults to — each a documented statistical standard, not an invention.

95%
the standard confidence level — you accept a 5% chance of a false positive
Wikipedia / NIST
80%
the standard statistical power — an 80% chance to detect a real effect if it exists
VWO / Evan Miller
z = 1.96
the two-sided critical value at 95% confidence, used in the z-test and the interval
Standard normal
Beta(1,1)
the uninformative prior the Bayesian mode starts from for each conversion rate
Evan Miller

A Pretty Lift Means Nothing Without the Stats

Most "winning" tests that get shipped were never significant. The math is what separates a real improvement from random noise.

Avoid False Winners

A 20% lift on small numbers is often pure chance. Significance tells you whether the difference is real before you roll it out to everyone.

Don't Run Forever (or Stop Too Soon)

Sizing the test up front tells you when you'll have enough data — so you neither waste weeks nor call it the moment it looks good.

Quantify the Risk

The confidence interval and the Bayesian expected loss tell you not just "is it better?" but "how much could I gain, or lose if I'm wrong?"

Align the Team

Share a link so PM, design and data see the same verdict and CI — fewer "but it looked like it won" debates after the fact.

How Significance Is Calculated

It's one z-test. Here's the whole thing, with the canonical example baked in.

Rate B − Rate A
+3.0pp
÷
Pooled std error
0.0143
=
Z-score
2.10

The z-score maps to a p-value of 0.035 — a 3.5% chance of seeing a gap this big by luck — so you're 96.5% confident, which clears the 95% bar. (Control 10% vs variation 13%, 1,000 visitors each.)

Why two different standard errors?

Pooled SE for the test

The hypothesis test assumes the null is true — that both rates are equal — so it pools the two samples into one shared rate to compute the standard error and the z-score. This is the textbook two-proportion z-test (Wikipedia / NIST).

Unpooled SE for the interval

The confidence interval doesn't assume the rates are equal, so it uses each rate's own variance — the unpooled standard error. Most calculators hide this; ours shows both, because using the right one matters.

vs

Absolute vs Relative Uplift

The single most common A/B-testing misread. The same result, two very different-looking numbers.

Control 10% Variation 13%

+3 percentage points absolute

The raw gap between the two rates: 13% − 10% = 3pp. This is what statisticians work in, and what the confidence interval reports. It can't be inflated.

+30% relative lift

The gap as a share of the baseline: 3pp ÷ 10% = 30%. Marketing headlines love this bigger number — but "+30%" and "+3pp" describe the exact same test.

vs

Always check which one a tool (or a vendor) is quoting. This calculator shows both, every time.

Frequentist vs Bayesian — Which to Use?

They answer subtly different questions. Both are valid; this tool gives you both.

Frequentist the p-value

Answers: "If there were no real difference, how surprising is this data?" A low p-value means the result would be unlikely by chance. Familiar, widely reported, and what "statistical significance" refers to — but easy to misinterpret and sensitive to peeking.

Bayesian chance to beat

Answers the question you actually have: "What's the probability the variation is better, given the data?" Gives a direct chance-to-beat and an expected loss, and copes more gracefully with monitoring — at the cost of choosing a prior.

vs

Rule of thumb: report significance when stakeholders expect a p-value; reach for Bayesian when you want an intuitive risk-based decision.

Sample Size, Power & MDE → Duration

Four inputs decide how long you'll wait. Smaller effects cost dramatically more traffic.

Baseline + MDE + power & confidence visitors / variation ÷ daily traffic days to run

Required sample size grows with 1 / MDE²: halving the effect you want to detect roughly quadruples the traffic you need. Detecting a 10% relative lift on a 10% baseline at 95%/80% takes about 14,300 visitors per variation. Pick the smallest lift that would actually change your decision — not the smallest you can imagine.

The Peeking Problem

Why "we hit 95%, ship it!" is often wrong.

Checking repeatedly inflates false positives. A test's p-value bounces around as data arrives. If you stop the first time it dips below 0.05, you're cherry-picking noise — a "95% significant" result found by peeking can be wrong far more than 5% of the time.
The fix: decide your sample size up front and run to it. Use the sample-size mode to set a fixed horizon, then evaluate once. If you must monitor continuously, use a sequential method or the Bayesian mode, which is more robust to repeated looks.

Sample Ratio Mismatch (SRM)

When your 50/50 split isn't 50/50, the whole test is suspect.

You intended an even split, but you got 53/47 across tens of thousands of visitors. That imbalance is statistically almost impossible by chance — so something is broken: a redirect dropping users, bot traffic, a tracking bug, or a flawed randomizer. A chi-square goodness-of-fit test flags it; if the SRM p-value falls below 0.01, don't interpret the experiment.

What the check does. Advanced mode compares your actual split against the intended one with a chi-square test and reports the p-value, so a mismatch can't slip past you.
What to do if it fails. Don't trust the result and don't "fix" it by reweighting. Find the root cause — redirects, bots, tracking, randomization — repair it, and rerun the test clean.

Common A/B Testing Mistakes

The errors that turn experiments into expensive guesses.

1
Stopping early at the first "95%." The peeking problem — fix a sample size before you start and evaluate once.
2
Samples that are too small. A few hundred visitors can't detect a small lift; size the test first or you're reading noise.
3
Ignoring Sample Ratio Mismatch. A skewed split means a broken test; check it before you read the result.
4
Calling a tie a loser. "Not significant" means inconclusive, not "B lost" — you may just need more data.
5
Many variants, no correction. Test five variations and the odds of a fluke "winner" climb — apply a Bonferroni or Šidák correction.
6
Running under a week, or over novelty. Cover whole weeks for day-of-week effects and watch for a novelty bump that fades.

How the Calculator Works

No black box. Every formula, with the cited source — verified against worked numeric cases.

Significance
Two-proportion z-test. Conversion rates p = conversions / visitors. The test uses a pooled standard error √(p̄(1−p̄)(1/n₁+1/n₂)) to get z, then the p-value from the standard normal. The confidence interval uses the unpooled SE — we show both. (Wikipedia, NIST.)
Sample size
Exact power formula. From baseline p, effect δ, and the z-values for confidence and power: n = (z_α·√(2p(1−p)) + z_β·√(p(1−p)+(p+δ)(1−p−δ)))² / δ², rounded up. Then duration = total ÷ daily traffic. (Evan Miller.)
Bayesian
Beta-Binomial. Each rate gets a Beta(1,1) prior, so the posterior is Beta(1+conversions, 1+failures). We compute the exact probability the variation's posterior beats the control's, plus the expected loss. (Evan Miller's Bayesian formulas.)

The numbers it reports

P-value & confidence
The chance of the result under the null, and 1 − that.
Confidence interval
The plausible range for the true difference; excludes 0 when significant.
Observed power
Shown in Advanced — with a caution that post-hoc power is debated.
SRM & low-data guards
Flags a broken split or too-sparse data that makes the verdict unreliable.
A note on accuracy. These are standard, widely-used statistical methods, and this tool's outputs were checked against worked examples and reference calculators (Evan Miller, ABTestGuide). Still, a calculator can't see your experiment design: it assumes properly randomized, independent samples analyzed once at a fixed horizon. Peeking, novelty effects, seasonality, sample ratio mismatch and tracking errors can all invalidate an otherwise "significant" result. Use it to inform a decision, not to replace sound experiment design. sem.chat does not provide statistical consulting.

Sources & Further Reading

The authoritative methods and standards behind the math on this page.

Two-proportion z-test — pooled SE for the test, unpooled for the interval: Wikipedia and the NIST/SEMATECH e-Handbook §7.2.4.
Sample size & significance — the exact power formula and the 16·p(1−p)/δ² rule of thumb: Evan Miller, "Sample Size Calculator" and Awesome A/B Tools.
Bayesian A/B testing — Beta(1,1) posteriors and the closed-form probability to beat: Evan Miller, "Formulas for Bayesian A/B Testing".
Practitioner tools & defaults — confidence/power conventions, SRM and ROPE: ABTestGuide and VWO.

Related calculators from sem.chat

Frequently Asked Questions

Significance, sample size, Bayesian and the gotchas — answered in plain English.

Statistical significance tells you how likely your result is just random chance. The p-value is the probability of seeing a difference at least this large if the variation actually had no effect (the null hypothesis). A p-value of 0.05 means a 5% chance the result is a fluke; at a 95% confidence level you call it significant when the p-value drops below 0.05. It is not the probability that your variation is better — that is a common misreading.
In significance mode the variation wins when the test reaches your chosen confidence level (95% by default) and the confidence interval for the difference excludes zero. The tool shows the conversion rates, the uplift, the p-value, the confidence percentage and a plain verdict. A result that is not significant means either there is no real difference or you need more data — not that the variation lost.
If your control converts at 10% and the variation at 13%, the absolute uplift is +3 percentage points (pp) and the relative uplift is +30% — (13−10)/10. Marketing tools usually headline the bigger relative number; statisticians work in absolute terms. Confusing the two is the single most common A/B-testing misread, so this calculator shows both.
Use a two-sided test (the default) when you care whether the variation is different — better or worse. Use a one-sided test only when you would never act on a negative result and you fixed the direction before seeing data. A one-sided test halves the p-value, so it reaches significance faster, which is exactly why it is easy to abuse. When in doubt, stay two-sided.
95% is the industry standard, accepting a 5% false-positive rate. Use 90% for low-risk, easily reversible changes where speed matters, and 99% for high-stakes or hard-to-undo decisions. A higher confidence level needs more data to reach.
Power is the probability your test will detect a real effect of a given size when one exists — one minus the false-negative rate. The convention is 80%, meaning if the effect is real you will catch it 80% of the time and miss it 20%. Higher power such as 90% is safer but needs a larger sample.
MDE is the smallest improvement you want the test to be able to detect. Smaller MDEs require dramatically more traffic — sample size grows with one over MDE squared — so pick the smallest lift that would actually change your decision, not an unrealistically tiny one. A common default starting point is a 20% relative MDE.
It depends on your baseline conversion rate, your MDE, the confidence level and the power. The sample-size mode computes the exact visitors per variation using the standard two-proportion power formula — for example, lifting a 10% baseline by a relative 10% (to 11%) at 95% confidence and 80% power needs about 14,300 visitors per variation.
Duration = required total sample size divided by your daily eligible visitors. Enter your average daily traffic in sample-size mode and the tool returns the number of days. Run for whole weeks to average out day-of-week effects, and do not stop the moment it looks significant.
No. Checking repeatedly and stopping the first time you see significance dramatically inflates your false-positive rate — a 95% result found by peeking can be wrong far more than 5% of the time. Decide your sample size up front and run to it, or use a sequential or Bayesian method designed for monitoring.
Frequentist (the p-value and significance mode) answers how surprising this data is if there were no real difference. Bayesian answers the more intuitive question — what is the probability the variation is actually better, given the data. Bayesian gives a direct chance-to-beat and an expected loss; frequentist gives a p-value and confidence interval. Both are valid, and this tool offers both.
It is the posterior probability that the variation's true conversion rate is higher than the control's, given the data and an uninformative Beta(1,1) prior — for example, 98% means there is a 98% chance the variation is genuinely better. A common decision threshold is 95%.
Expected loss is the average amount of conversion rate you would give up if you pick the variation and it turns out to be worse — a risk measure. You ship when expected loss is below a tiny threshold. ROPE, the Region of Practical Equivalence, is a band around no-difference (default 1%) inside which the two are treated as effectively the same, powering a Worse / Equivalent / Better decision.
A 95% confidence interval (frequentist) is a range that, across many repeats of the experiment, would contain the true difference 95% of the time. A 95% credible interval (Bayesian) is a range the true value falls in with 95% probability given your data. The calculator shows the confidence interval for the difference in significance mode and credible intervals in Bayesian mode.
SRM is when your traffic split does not match what you intended — for example you wanted 50/50 but got 53/47 at high volume. The tool runs a chi-square check; a failing result (p below 0.01) signals broken randomization, redirect or bot bias, or tracking issues. If SRM fails, do not interpret the test — fix the cause and rerun.
The z-test relies on a normal approximation that breaks down with very few conversions, roughly fewer than 5 to 10 successes or failures per cell. With sparse data the p-value is unreliable, so the tool flags it and advises collecting more before trusting the verdict.
Yes. In Advanced mode switch the metric type to continuous and enter the mean, standard deviation and sample size per variation. The tool then runs Welch's t-test, which handles unequal variances, instead of the proportion z-test. Most A/B calculators only handle binary conversions.
Yes. Add variations in Advanced mode and the tool compares each against the control while applying a multiple-comparison correction (Bonferroni or Sidak), because testing several variants at once inflates the chance of a false winner if you do not adjust.

A/B Testing Terms, in Plain English

The concepts behind the calculator — what they mean and why they matter.

Conversion rate
The proportion of visitors who convert: conversions divided by visitors.
Control vs variation
The control (A) is the existing version; the variation (B) is the change you are testing against it.
Absolute uplift
The difference between the two conversion rates in percentage points, e.g. 13% minus 10% equals +3pp.
Relative uplift
The improvement as a percent of the baseline, e.g. +3pp on a 10% baseline is +30%.
Null hypothesis
The default assumption that the variation has no real effect; a test tries to disprove it.
P-value
The probability of seeing a difference at least this large if the null hypothesis were true.
Significance level (alpha)
The false-positive rate you accept; 0.05 corresponds to 95% confidence.
Confidence level
One minus alpha; how sure you want to be before calling a result real, commonly 95%.
Statistical power (1 - beta)
The probability of detecting a real effect of a given size; 80% is the standard.
Type I and Type II error
A Type I error is a false positive (calling a non-difference real); a Type II error is a false negative (missing a real difference).
Two-proportion z-test
The test that standardizes the distance between the two conversion rates into a z-score to compute the p-value.
Pooled vs unpooled standard error
The hypothesis test uses a pooled standard error (assuming equal rates under the null); the confidence interval uses an unpooled standard error from each rate's own variance.
Confidence interval
A frequentist range likely to contain the true difference, e.g. a 95% confidence interval.
Minimum Detectable Effect (MDE)
The smallest lift a planned test is powered to detect.
Sample size and test duration
The visitors per variation a test needs, and how many days that takes at your traffic.
Bayesian posterior / chance to beat
The probability the variation's true rate beats the control's, given the data.
Credible interval, expected loss and ROPE
A Bayesian range for a value; the average downside of a wrong choice; and the band of practical equivalence treated as no difference.
Sample Ratio Mismatch (SRM)
A traffic-split imbalance versus the intended ratio that signals a broken experiment.
Peeking problem
Repeatedly checking results and stopping at the first significant moment, which inflates false positives.

Optimizing Your Conversion Rate?

The fastest "winning variation" is often just answering visitors faster. sem.chat adds an AI chat and voice agent to your site that answers questions 24/7, captures leads and books calls — a conversion lift you can measure with the calculator above. Try it free.

Try sem.chat Free
🌐 Select Language
Copied to clipboard!