Question 1

What is statistical significance in an A/B test, and what does a p-value actually mean?

Accepted Answer

Statistical significance tells you how likely your result is just random chance. The p-value is the probability of seeing a difference at least this large if the variation actually had no effect (the null hypothesis). A p-value of 0.05 means a 5% chance the result is a fluke; at a 95% confidence level you call it significant when the p-value drops below 0.05. It is not the probability that your variation is better — that is a common misreading.

Question 2

How do I read this calculator's result — what counts as a winner?

Accepted Answer

In significance mode the variation wins when the test reaches your chosen confidence level (95% by default) and the confidence interval for the difference excludes zero. The tool shows the conversion rates, the uplift, the p-value, the confidence percentage and a plain verdict. A result that is not significant means either there is no real difference or you need more data — not that the variation lost.

Question 3

What's the difference between absolute uplift (percentage points) and relative uplift (percent lift)?

Accepted Answer

If your control converts at 10% and the variation at 13%, the absolute uplift is +3 percentage points (pp) and the relative uplift is +30% — (13−10)/10. Marketing tools usually headline the bigger relative number; statisticians work in absolute terms. Confusing the two is the single most common A/B-testing misread, so this calculator shows both.

Question 4

Should I use a one-sided or a two-sided test?

Accepted Answer

Use a two-sided test (the default) when you care whether the variation is different — better or worse. Use a one-sided test only when you would never act on a negative result and you fixed the direction before seeing data. A one-sided test halves the p-value, so it reaches significance faster, which is exactly why it is easy to abuse. When in doubt, stay two-sided.

Question 5

What confidence level should I choose — 90%, 95%, or 99%?

Accepted Answer

95% is the industry standard, accepting a 5% false-positive rate. Use 90% for low-risk, easily reversible changes where speed matters, and 99% for high-stakes or hard-to-undo decisions. A higher confidence level needs more data to reach.

Question 6

What is statistical power, and why is 80% the standard?

Accepted Answer

Power is the probability your test will detect a real effect of a given size when one exists — one minus the false-negative rate. The convention is 80%, meaning if the effect is real you will catch it 80% of the time and miss it 20%. Higher power such as 90% is safer but needs a larger sample.

Question 7

What is Minimum Detectable Effect (MDE) and how do I pick one?

Accepted Answer

MDE is the smallest improvement you want the test to be able to detect. Smaller MDEs require dramatically more traffic — sample size grows with one over MDE squared — so pick the smallest lift that would actually change your decision, not an unrealistically tiny one. A common default starting point is a 20% relative MDE.

Question 8

How many visitors (sample size) do I need?

Accepted Answer

It depends on your baseline conversion rate, your MDE, the confidence level and the power. The sample-size mode computes the exact visitors per variation using the standard two-proportion power formula — for example, lifting a 10% baseline by a relative 10% (to 11%) at 95% confidence and 80% power needs about 14,300 visitors per variation.

Question 9

How long should I run my test, and how is duration calculated?

Accepted Answer

Duration = required total sample size divided by your daily eligible visitors. Enter your average daily traffic in sample-size mode and the tool returns the number of days. Run for whole weeks to average out day-of-week effects, and do not stop the moment it looks significant.

Question 10

Can I stop my test early as soon as it hits 95%? (the peeking problem)

Accepted Answer

No. Checking repeatedly and stopping the first time you see significance dramatically inflates your false-positive rate — a 95% result found by peeking can be wrong far more than 5% of the time. Decide your sample size up front and run to it, or use a sequential or Bayesian method designed for monitoring.

Question 11

What's the difference between the frequentist and Bayesian modes?

Accepted Answer

Frequentist (the p-value and significance mode) answers how surprising this data is if there were no real difference. Bayesian answers the more intuitive question — what is the probability the variation is actually better, given the data. Bayesian gives a direct chance-to-beat and an expected loss; frequentist gives a p-value and confidence interval. Both are valid, and this tool offers both.

Question 12

In Bayesian mode, what does chance to beat control mean?

Accepted Answer

It is the posterior probability that the variation's true conversion rate is higher than the control's, given the data and an uninformative Beta(1,1) prior — for example, 98% means there is a 98% chance the variation is genuinely better. A common decision threshold is 95%.

Question 13

What is expected loss (potential loss) and ROPE?

Accepted Answer

Expected loss is the average amount of conversion rate you would give up if you pick the variation and it turns out to be worse — a risk measure. You ship when expected loss is below a tiny threshold. ROPE, the Region of Practical Equivalence, is a band around no-difference (default 1%) inside which the two are treated as effectively the same, powering a Worse / Equivalent / Better decision.

Question 14

What is a confidence interval versus a credible interval?

Accepted Answer

A 95% confidence interval (frequentist) is a range that, across many repeats of the experiment, would contain the true difference 95% of the time. A 95% credible interval (Bayesian) is a range the true value falls in with 95% probability given your data. The calculator shows the confidence interval for the difference in significance mode and credible intervals in Bayesian mode.

Question 15

What is Sample Ratio Mismatch (SRM) and what should I do?

Accepted Answer

SRM is when your traffic split does not match what you intended — for example you wanted 50/50 but got 53/47 at high volume. The tool runs a chi-square check; a failing result (p below 0.01) signals broken randomization, redirect or bot bias, or tracking issues. If SRM fails, do not interpret the test — fix the cause and rerun.

Question 16

Why does the calculator warn me when I have too few conversions or visitors?

Accepted Answer

The z-test relies on a normal approximation that breaks down with very few conversions, roughly fewer than 5 to 10 successes or failures per cell. With sparse data the p-value is unreliable, so the tool flags it and advises collecting more before trusting the verdict.

Question 17

Can I use this for revenue, average order value, or non-conversion metrics?

Accepted Answer

Yes. In Advanced mode switch the metric type to continuous and enter the mean, standard deviation and sample size per variation. The tool then runs Welch's t-test, which handles unequal variances, instead of the proportion z-test. Most A/B calculators only handle binary conversions.

Question 18

Does it support more than two variations (A/B/n)?

Accepted Answer

Yes. Add variations in Advanced mode and the tool compares each against the control while applying a multiple-comparison correction (Bonferroni or Sidak), because testing several variants at once inflates the chance of a false winner if you do not adjust.

A/B Test Calculator

A/B Test Calculator — sem.chat

📊 Your result

How far into the tail is your result?

95% confidence interval for the difference

Statistical detail

Why smaller effects cost more traffic

Posterior distributions

Decision: Worse / Equivalent / Better

From Test Data to a Clear Call in Three Steps

Pick Your Mode

Enter Your Numbers

Read the Verdict

The Numbers Every A/B Test Uses

A Pretty Lift Means Nothing Without the Stats

Avoid False Winners

Don't Run Forever (or Stop Too Soon)

Quantify the Risk

Align the Team

How Significance Is Calculated

Why two different standard errors?

Pooled SE for the test

Unpooled SE for the interval

Absolute vs Relative Uplift

+3 percentage points absolute

+30% relative lift

Frequentist vs Bayesian — Which to Use?

Frequentist the p-value

Bayesian chance to beat

Sample Size, Power & MDE → Duration

The Peeking Problem

Sample Ratio Mismatch (SRM)

Common A/B Testing Mistakes

How the Calculator Works

The numbers it reports

Sources & Further Reading

Related calculators from sem.chat

Frequently Asked Questions

A/B Testing Terms, in Plain English

Optimizing Your Conversion Rate?