Back to prompts
Data & AnalysisPremiumadvanced
4.7

Interpret A/B Test Results Like a Data Scientist

Get statistical significance analysis, practical significance, and clear next steps from any A/B test

Copy & Paste this prompt
I ran an A/B test and need help interpreting the results.

Test details:
- What I tested: [DESCRIBE THE CHANGE — e.g., new button color, different headline]
- Metric measured: [PRIMARY METRIC — e.g., conversion rate, click-through rate]
- Test duration: [HOW LONG IT RAN]

Results:
- Control (A): [SAMPLE SIZE] visitors, [CONVERSIONS] conversions ([RATE]%)
- Variant (B): [SAMPLE SIZE] visitors, [CONVERSIONS] conversions ([RATE]%)
- Any secondary metrics: [LIST THEM]

Analyze this test:

1. STATISTICAL SIGNIFICANCE
   - Calculate the p-value and confidence interval
   - Is this result statistically significant at 95% confidence?
   - Was the sample size sufficient? What would be needed?

2. PRACTICAL SIGNIFICANCE
   - What is the absolute lift and relative lift?
   - Is this difference meaningful in business terms?
   - Calculate the projected annual impact (if I give you revenue/user data)

3. VALIDITY CHECK
   - Was the test duration long enough? (full business cycles)
   - Are there signs of sample ratio mismatch?
   - Could novelty effect or seasonality explain the result?

4. SEGMENTATION
   - Suggest 3 segments worth analyzing (device, source, new vs returning)
   - Could the result be driven by one segment?

5. DECISION & NEXT STEPS
   - Ship it / Kill it / Keep testing — with clear reasoning
   - If keep testing: what to change and required sample size
   - What follow-up test would you recommend?

Be rigorous. Do not let me make a decision on noisy data.
#data#analytics#interpret#test#results

Works with

chatgptclaudegemini

💡 Pro Tips

  • Always check for sample ratio mismatch — if A and B have very different sample sizes, something went wrong
  • Statistical significance ≠ practical significance — a 0.01% lift can be "significant" with enough data
  • Run tests for full weeks (7, 14, 21 days) to avoid day-of-week bias

✨ Example Output

Test: New CTA button ("Start Free" vs "Sign Up")

STATISTICAL SIGNIFICANCE:
- Control: 12,450 visitors → 387 conversions (3.11%)
- Variant: 12,380 visitors → 425 conversions (3.43%)
- Absolute lift: +0.32 percentage points
- Relative lift: +10.3%
- p-value: 0.038 → Statistically significant at 95% (barely)
- 95% CI for difference: [+0.02%, +0.62%]

PRACTICAL SIGNIFICANCE:
- The lower bound of the CI is nearly zero — the true effect could be tiny
- At 100K monthly visitors: ~320 extra conversions/month
- If each conversion = 0 → ~6K/month uplift

VALIDITY CHECK:
⚠️ Test ran 8 days — should run at least 2 full weeks to capture weekly cycles
✅ Sample ratio: 50.1% / 49.9% — no mismatch detected
⚠️ Consider novelty effect for UI changes

DECISION: KEEP TESTING for 1 more week. The result is promising but the confidence interval is wide. If it holds after a full 2-week cycle, ship it.