Check statistical significance, plan your sample size and duration, or run a Bayesian test — three calculators in one. Free, instant, no signup.
Your test results
Settings
Plan your test
Bayesian mode answers "what's the chance the variation is actually better?" from the same conversion numbers above, using a Beta(1,1) prior.
What are you measuring?
Continuous metrics (revenue, average order value) use Welch's t-test instead of the proportion z-test.
Sample sizes come from the Control / Variation visitor fields above.
Multiple variations & SRM
A Sample Ratio Mismatch (SRM) chi-square test flags when your actual split is off — a sign of a broken experiment.
Advanced planning
Bayesian settings
The Region of Practical Equivalence (ROPE) is the band around "no difference" treated as a tie, powering the Worse / Equivalent / Better decision.
A/B test result
The shaded area is the p-value; the marker is your z-score. The further into the tail, the less likely the result is chance.
If this interval excludes 0, the difference is significant at your confidence level.
Required sample size grows as the effect you want to detect shrinks (≈ proportional to 1/MDE²). Your chosen MDE is marked.
Where each variant's true conversion rate probably lies — the less the two overlap, the clearer the winner.
Probability the variation is meaningfully worse, practically equivalent (within ROPE), or meaningfully better.
| Step | Value |
|---|
| Assumption / input | Value |
|---|
This free A/B test calculator does the three jobs every experiment needs, in one place: tell you whether a finished test is a real winner (statistical significance), plan how many visitors and days you'll need before you start, and run a Bayesian "chance to beat" when you'd rather think in probabilities than p-values.
Pick a mode, enter your numbers, hit Calculate — everything updates instantly and your data never leaves your browser. Every default (95% confidence, 80% power, z = 1.96) is a documented statistical standard, cited below.
How It Works
No account, no email, no limits — just rigorous statistics made readable.
"Did my test win?" for a finished test, "Plan my test" to size one before launch, or "Bayesian" for a chance-to-beat. One tool, three jobs.
Visitors and conversions for each variation, or a baseline rate and target effect. Switch to Advanced for revenue metrics, SRM and more.
Get a plain-English winner call plus the p-value, confidence interval, sample size or chance-to-beat — and the charts that make it obvious.
The Standards Behind the Math
These are the conventional thresholds this calculator defaults to — each a documented statistical standard, not an invention.
Why It Matters
Most "winning" tests that get shipped were never significant. The math is what separates a real improvement from random noise.
A 20% lift on small numbers is often pure chance. Significance tells you whether the difference is real before you roll it out to everyone.
Sizing the test up front tells you when you'll have enough data — so you neither waste weeks nor call it the moment it looks good.
The confidence interval and the Bayesian expected loss tell you not just "is it better?" but "how much could I gain, or lose if I'm wrong?"
Share a link so PM, design and data see the same verdict and CI — fewer "but it looked like it won" debates after the fact.
The Core
It's one z-test. Here's the whole thing, with the canonical example baked in.
The z-score maps to a p-value of 0.035 — a 3.5% chance of seeing a gap this big by luck — so you're 96.5% confident, which clears the 95% bar. (Control 10% vs variation 13%, 1,000 visitors each.)
The hypothesis test assumes the null is true — that both rates are equal — so it pools the two samples into one shared rate to compute the standard error and the z-score. This is the textbook two-proportion z-test (Wikipedia / NIST).
The confidence interval doesn't assume the rates are equal, so it uses each rate's own variance — the unpooled standard error. Most calculators hide this; ours shows both, because using the right one matters.
Don't Get Fooled
The single most common A/B-testing misread. The same result, two very different-looking numbers.
The raw gap between the two rates: 13% − 10% = 3pp. This is what statisticians work in, and what the confidence interval reports. It can't be inflated.
The gap as a share of the baseline: 3pp ÷ 10% = 30%. Marketing headlines love this bigger number — but "+30%" and "+3pp" describe the exact same test.
Always check which one a tool (or a vendor) is quoting. This calculator shows both, every time.
Two Lenses
They answer subtly different questions. Both are valid; this tool gives you both.
Answers: "If there were no real difference, how surprising is this data?" A low p-value means the result would be unlikely by chance. Familiar, widely reported, and what "statistical significance" refers to — but easy to misinterpret and sensitive to peeking.
Answers the question you actually have: "What's the probability the variation is better, given the data?" Gives a direct chance-to-beat and an expected loss, and copes more gracefully with monitoring — at the cost of choosing a prior.
Rule of thumb: report significance when stakeholders expect a p-value; reach for Bayesian when you want an intuitive risk-based decision.
Plan First
Four inputs decide how long you'll wait. Smaller effects cost dramatically more traffic.
Required sample size grows with 1 / MDE²: halving the effect you want to detect roughly quadruples the traffic you need. Detecting a 10% relative lift on a 10% baseline at 95%/80% takes about 14,300 visitors per variation. Pick the smallest lift that would actually change your decision — not the smallest you can imagine.
The #1 Mistake
Why "we hit 95%, ship it!" is often wrong.
A Silent Killer
When your 50/50 split isn't 50/50, the whole test is suspect.
You intended an even split, but you got 53/47 across tens of thousands of visitors. That imbalance is statistically almost impossible by chance — so something is broken: a redirect dropping users, bot traffic, a tracking bug, or a flawed randomizer. A chi-square goodness-of-fit test flags it; if the SRM p-value falls below 0.01, don't interpret the experiment.
Avoid These
The errors that turn experiments into expensive guesses.
Methodology
No black box. Every formula, with the cited source — verified against worked numeric cases.
p = conversions / visitors. The test uses a pooled standard error √(p̄(1−p̄)(1/n₁+1/n₂)) to get z, then the p-value from the standard normal. The confidence interval uses the unpooled SE — we show both. (Wikipedia, NIST.)p, effect δ, and the z-values for confidence and power: n = (z_α·√(2p(1−p)) + z_β·√(p(1−p)+(p+δ)(1−p−δ)))² / δ², rounded up. Then duration = total ÷ daily traffic. (Evan Miller.)Beta(1,1) prior, so the posterior is Beta(1+conversions, 1+failures). We compute the exact probability the variation's posterior beats the control's, plus the expected loss. (Evan Miller's Bayesian formulas.)References
The authoritative methods and standards behind the math on this page.
FAQ
Significance, sample size, Bayesian and the gotchas — answered in plain English.
Glossary
The concepts behind the calculator — what they mean and why they matter.
The fastest "winning variation" is often just answering visitors faster. sem.chat adds an AI chat and voice agent to your site that answers questions 24/7, captures leads and books calls — a conversion lift you can measure with the calculator above. Try it free.
Try sem.chat Free