Growth Experiment Framework: Our Template and Results
The exact growth experiment framework, hypothesis template, backlog system, and prioritization matrix we use to run 10+ experiments per month without chaos.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: Most growth experiments fail before a single user sees them — because the hypothesis is vague, the success metric is wrong, or there is no baseline to measure against. This post shares the exact framework we use at every stage: hypothesis writing, scoring, templating, prioritizing, running, and reviewing experiments. You get three copy-paste templates, a 20-experiment results log with real learnings, and a FAQ covering statistical significance for small teams and everything in between.
The first time I ran growth experiments seriously, I thought execution was the hard part. Build a variant, split traffic, wait two weeks, declare a winner. How complicated could it be?
Embarrassingly complicated, it turns out. After running over 300 experiments across three companies, I've landed on a humbling conclusion: most experiments fail not because the idea was bad, but because the experiment was never properly structured in the first place. The idea never had a real chance.
There are three structural failure modes that kill experiments before a single user sees them.
A hypothesis is not an idea. "Let's try a shorter onboarding flow" is an idea. A hypothesis reads like this:
"If we reduce the onboarding flow from seven steps to three by removing the company-size and industry fields — which our product logic doesn't actually use — then 7-day activation rate will increase by at least 12%, because we're reducing friction for users who receive no visible value from providing that information."
Notice what the hypothesis specifies: what you're changing (and what you're removing, and why), what metric you expect to move, the direction and magnitude of the expected change, and the causal mechanism — the why behind the prediction.
Without that structure, your team cannot evaluate whether the experiment is worth running. You cannot learn anything meaningful from the result either. A vague hypothesis produces a vague insight. You run the test, the metric moves or it doesn't, and nobody knows what actually happened or what to do next.
"A poorly written hypothesis is not a starting point. It is a dead end disguised as a beginning. Every experiment that lacks a mechanism statement is destined to teach you nothing actionable."
This is subtler, and more dangerous. The wrong metric is often a metric that sounds right.
Consider a team testing a new pricing page. They set their success metric as "pricing page visits." It goes up after the test. They declare success. But conversion to paid didn't move. They optimized for traffic to a page, not revenue from it.
Wrong metrics cluster into three types:
Vanity metrics — they go up easily but don't connect to real business outcomes. Raw signups, page views, sessions. Each can increase while revenue stays flat or declines. See growth metrics that actually matter for the full taxonomy of vanity vs. real metrics and how to pick the right primary metric for any experiment.
Proxy metrics too far from the outcome — your experiment affects step 2 of a 10-step funnel, but you're measuring step 8. The signal gets diluted across seven other variables you're not controlling.
Aggregate metrics that mask segment movement — overall conversion is flat, but a key segment doubled, and you never saw it because you didn't segment the result.
Before every experiment, I ask one question: if this metric moves exactly as I predict, will that unambiguously represent business value? If the answer requires more than one "if," the metric is wrong.
You cannot measure lift without a baseline. This sounds obvious. It is violated constantly — especially in fast-moving teams where the pressure to ship is high and instrumentation is an afterthought.
Without a baseline, you have no idea whether a 14% conversion rate is good, bad, or unchanged from before the experiment. You cannot calculate whether you've reached statistical significance. You cannot determine the required sample size. You cannot compare this experiment's result to others in your backlog.
The discipline of recording baselines before experiments begin is one of the most valuable habits a growth team can build. It forces you to instrument your metrics before you need them, which surfaces measurement gaps early rather than mid-experiment — when it's too late to fix them without invalidating your results.
"A baseline without a source is worth nothing. Document not just the number but where it came from, what query produced it, and over what time period. Metric definitions shift subtly, and a baseline without provenance is useless six months later."
Before you can run experiments at scale, you need a consistent way to decide which experiments to run first. Three frameworks dominate the growth world. All three score experiments on multiple dimensions and produce a priority rank. They differ in what they measure and how they structure the process.
| Framework | Dimensions | Scale | Strengths | Weaknesses | Best For |
|---|---|---|---|---|---|
| PIE | Potential, Importance, Ease | 1-10 each | Forces identification of performance gaps | "Potential" is highly subjective without baselines; "Importance" hard to define consistently | CRO-focused teams testing across multiple landing pages |
| ICE | Impact, Confidence, Ease | 1-10 each | Fast to apply; makes confidence an explicit, discussable dimension | Ignores reach; two scorers produce very different numbers without calibration sessions | Teams new to experimentation — excellent starting framework |
| GROWS | Gather, Rank, Outline, Work, Study | Process-level | Treats scoring as one step in a complete workflow; the Study step enforces learning documentation | More overhead; harder to adopt quickly | Mature growth teams with established OKRs and high experiment volume |
PIE was popularized by Chris Goward at WiderFunnel. You score each experiment on Potential (how much improvement is possible?), Importance (how valuable is this traffic or user segment?), and Ease (how hard is it to run?). Average the three for a final score.
PIE's biggest weakness: "Potential" is almost entirely subjective without baseline data. Two people on the same team will score the same experiment's potential differently, which defeats the purpose of having a shared scoring system. PIE works best when paired with actual conversion data to anchor the Potential dimension.
ICE was popularized through Sean Ellis's growth hacking work. You score Impact (if it works, how big is the effect?), Confidence (how sure are you it will work, based on data and research?), and Ease (how easy is it to implement?).
ICE improves on PIE by making confidence an explicit, quantified dimension. Forcing your team to score confidence — and justify the number — surfaces assumptions that would otherwise stay hidden. The weakness: it ignores reach. A high-confidence, high-impact experiment that only affects 2% of users will score identically to one that affects 80% of users. Add reach as a fourth dimension once you have consistent traffic data.
GROWS is a complete experimentation process, not just a scoring model. The acronym maps to: Gather ideas, Rank ideas, Outline experiments, Work experiments, Study results.
What distinguishes GROWS is the Study step — a structured, documented review of what you learned, not just whether the metric moved. Teams that skip the Study step are not running a growth program. They are running a series of one-off tests that accumulate no institutional knowledge.
My recommendation: Use GROWS as your process frame, ICE as your scoring model within the Rank step (calibrated quarterly), and the template below for the Outline step. Start with ICE alone if your team is new to experimentation. Graduate to the full GROWS process once you're running five or more experiments per month.
This is the template I've used across every growth experiment since 2019. It has gone through about a dozen iterations. Every field is required. If you cannot fill out a field, the experiment is not ready to run. That is a feature, not a bug — incomplete experiments waste more time than they save.
EXPERIMENT ID: [AUTO-INCREMENT, e.g. EXP-2026-047]
EXPERIMENT NAME: [Human-readable, 5-8 words]
OWNER: [Single person accountable end-to-end]
CREATED: [YYYY-MM-DD]
STATUS: [Backlog / Ready / Running / Complete / Killed]
━━━━━━━━━━━━━━━━━━━━━━━━
HYPOTHESIS
━━━━━━━━━━━━━━━━━━━━━━━━
If [specific change],
then [primary metric] will [increase/decrease] by [X%],
because [mechanism — the causal reason this change produces this result].
━━━━━━━━━━━━━━━━━━━━━━━━
PRIMARY METRIC
━━━━━━━━━━━━━━━━━━━━━━━━
Metric: [One metric. Not two. One.]
Baseline: [Current value, e.g. 4.3%]
Observation period: [e.g. 4 weeks, Feb 3 – Mar 3]
Data source: [Platform + specific dashboard/query link]
Expected lift: [Minimum meaningful lift, e.g. +12% relative]
━━━━━━━━━━━━━━━━━━━━━━━━
GUARDRAIL METRICS
━━━━━━━━━━━━━━━━━━━━━━━━
[Metrics you watch to ensure you aren't breaking something while
optimizing the primary metric. Auto-flag if any of these degrade.]
| Guardrail Metric | Baseline | Acceptable Floor |
|------------------------|----------|------------------|
| | | |
| | | |
━━━━━━━━━━━━━━━━━━━━━━━━
EXPERIMENT DESIGN
━━━━━━━━━━━━━━━━━━━━━━━━
Control: [What control group sees / experiences]
Variant(s): [What each variant group sees / experiences]
Traffic split: [e.g. 50/50, or 33/33/33]
Targeting: [Who is eligible — new users only? US? Free plan only?]
━━━━━━━━━━━━━━━━━━━━━━━━
SAMPLE SIZE & DURATION
━━━━━━━━━━━━━━━━━━━━━━━━
Minimum detectable effect: [e.g. 10% relative lift]
Confidence threshold: [e.g. 95%]
Min sample per variant: [calculated, e.g. 1,800 users]
Estimated duration: [at current traffic, e.g. 14 days]
Start date: [YYYY-MM-DD]
Planned end date: [YYYY-MM-DD]
Early stop rule: [Do not stop before minimum sample unless
guardrail metric breached or major incident]
━━━━━━━━━━━━━━━━━━━━━━━━
IMPLEMENTATION NOTES
━━━━━━━━━━━━━━━━━━━━━━━━
Engineering effort: [e.g. 3 hours — copy and layout change, no backend]
New tracking events: [e.g. pricing_variant_b_viewed]
Dependencies: [e.g. social proof copy from marketing by Mar 8]
━━━━━━━━━━━━━━━━━━━━━━━━
RESULTS (fill after experiment ends)
━━━━━━━━━━━━━━━━━━━━━━━━
Control performance: [e.g. 4.3%]
Variant performance: [e.g. 5.1%]
Observed lift: [e.g. +18.6% relative]
Statistical significance: [e.g. p = 0.023, 97.7% confidence]
Confidence interval: [lower bound — upper bound]
Sample size reached: [Yes / No]
Guardrail status: [All within range / [Metric] breached — details]
Actual end date: [YYYY-MM-DD]
━━━━━━━━━━━━━━━━━━━━━━━━
DECISION
━━━━━━━━━━━━━━━━━━━━━━━━
[Ship / Revert / Iterate / Inconclusive — and why in 2-3 sentences]
━━━━━━━━━━━━━━━━━━━━━━━━
LEARNINGS
━━━━━━━━━━━━━━━━━━━━━━━━
[What did this experiment teach you about your users, product, or
assumptions? 3-5 sentences minimum. Was the mechanism correct?
This is the most important section — do not skip or abbreviate it.]
━━━━━━━━━━━━━━━━━━━━━━━━
FOLLOW-UP EXPERIMENTS
━━━━━━━━━━━━━━━━━━━━━━━━
[What experiments does this result suggest you run next?]
Hypothesis — the three-part structure. The "because" clause is not cosmetic. It forces you to state your causal model explicitly. When you review results, you're not just asking "did it work?" but "did it work for the reason we thought?" That distinction drives dramatically better learnings. An experiment can produce a positive result for a completely different reason than hypothesized — and if you don't examine the mechanism, you'll draw the wrong conclusion and run follow-up experiments in the wrong direction.
Single primary metric. The moment you have two primary metrics, you will cherry-pick. Human nature is not a character flaw — it is a design constraint. One experiment, one success condition. If you genuinely cannot choose between two metrics, you need to run two experiments.
Guardrail metrics. These are the metrics you watch to ensure your experiment isn't creating downstream harm while optimizing the primary. Testing aggressive email subject lines to improve open rates? Your guardrail is unsubscribe rate. Simplifying checkout to improve completion? Your guardrail is return rate and average order value. Name guardrails before you start, not after the results come in.
Minimum meaningful lift. This is different from expected lift. Expected lift is your prediction. Minimum meaningful lift is the threshold below which the result wouldn't justify shipping the change. A 0.3% lift on a metric contributing $8K/month of incremental revenue does not justify $40K of engineering work, even if it is statistically significant. Set this number explicitly before the experiment runs.
Learnings. Most teams skip this section or write one sentence. Don't. The learnings accumulated across 50 experiments become a compounding asset — they reveal which types of changes consistently work for your users, which assumptions were systematically wrong, and which areas are worth continued investment. A result without a learning is a data point. A result with a learning is institutional knowledge.
The experiment backlog is where ideas go before they become experiments. It is not a wishlist. It is a structured, scored queue with enough information to prioritize ideas and convert the best ones into fully-specced experiments without recreating context from scratch.
Most teams manage this in Notion, Linear, Airtable, or a simple spreadsheet. The platform doesn't matter. The structure does.
This is the quick-capture version for initial entry. The full experiment template gets completed only when an experiment moves to "Ready to Run."
| Field | Notes |
|---|---|
| Idea name | Short, descriptive |
| Proposed by | Person or source (customer interview, data analysis, competitor research) |
| Funnel area | Acquisition / Activation / Retention / Revenue / Referral |
| One-line hypothesis | If X, then Y, because Z |
| Target metric | The single metric this would move |
| Estimated reach | Percentage of active users affected |
| Data source | What evidence or insight generated this idea? |
| Impact score (1-10) | How large is the effect if it works? |
| Confidence score (1-10) | How sure are you it will work? |
| Ease score (1-10) | How straightforward is implementation? |
| ICE total | Average of the three scores |
| Status | Idea / Scoring / Ready / Running / Complete / Killed |
Not every idea belongs in the active backlog. Run ideas through three gates before they consume prioritization bandwidth.
Gate 1 — Minimum bar for entry
If any answer is no, park the idea in a "maybe later" list, not the active backlog.
Gate 2 — Scoring threshold
Gate 3 — Spec completeness
| Source | How to Mine It |
|---|---|
| Customer interviews | Tag transcripts for friction points and confusion moments; cluster themes monthly |
| Support tickets | Group by theme quarterly; high-frequency topics signal real friction |
| Session recordings | Watch 20-30 recordings per quarter; note precise drop-off moments |
| Funnel analytics | Find steps where conversion drops more than 15% relative to adjacent steps |
| Competitor analysis | Note what established players do differently — each is a testable hypothesis |
| Failed experiments | Review old losing variants; the idea may have been right but execution wrong |
| Team brainstorms | Monthly 30-minute ideation session; anyone on the team can contribute |
| Industry benchmarks | Significant gap vs. benchmark = opportunity worth testing |
A scoring framework gives you a number. A prioritization matrix turns numbers into a decision. These are different things. The scoring framework tells you which experiments are worth running in isolation. The prioritization matrix helps you sequence them given real-world constraints: team bandwidth, development cycles, dependencies, and where your business bottleneck actually sits right now.
Not all funnel stages have equal leverage at a given moment. Apply a weight multiplier based on your primary bottleneck. If you're unsure which stage is your bottleneck, the growth plateau diagnostic has a self-scoring framework for identifying your highest-priority constraint.
| Funnel Stage | Bottleneck: Acquisition | Bottleneck: Activation | Bottleneck: Retention |
|---|---|---|---|
| Awareness | 1.2x | 0.6x | 0.4x |
| Acquisition | 1.5x | 0.8x | 0.5x |
| Activation | 1.0x | 1.5x | 0.8x |
| Retention | 0.8x | 1.0x | 1.5x |
| Revenue | 1.0x | 1.0x | 1.2x |
| Referral | 0.7x | 0.7x | 1.0x |
Adjusted ICE = Raw ICE Average × Stage Weight Multiplier
| ID | Experiment | Stage | Impact | Conf. | Ease | Raw ICE | Stage Wt. | Adj. Score | Sprint-Ready | Status |
|---|---|---|---|---|---|---|---|---|---|---|
| EXP-041 | In-app tutorial trigger on Day 2 | Activation | 8 | 7 | 6 | 7.0 | 1.5x | 10.5 | Yes | Approved |
| EXP-040 | Reduce onboarding steps 6→3 | Activation | 9 | 6 | 4 | 6.3 | 1.5x | 9.5 | No | Backlog |
| EXP-043 | Weekly digest email for inactive users | Retention | 7 | 8 | 7 | 7.3 | 1.0x | 7.3 | Yes | Running |
| EXP-038 | Rewrite pricing page headline | Acquisition | 7 | 6 | 9 | 7.3 | 0.8x | 5.8 | Yes | Backlog |
| EXP-039 | Social proof on signup page | Acquisition | 6 | 7 | 8 | 7.0 | 0.8x | 5.6 | Yes | Backlog |
| EXP-045 | Referral incentive for power users | Referral | 5 | 5 | 7 | 5.7 | 0.7x | 4.0 | No | Backlog |
In this example the team's bottleneck is activation. EXP-041 and EXP-040 float to the top regardless of their raw ICE scores. Acquisition and referral experiments — even well-scored ones — fall because they aren't solving the current constraint.
Dependencies are one of the most overlooked complexity factors in experiment prioritization. Map them explicitly before sequencing your sprint.
When you have a cluster of dependent experiments, map the critical path and schedule accordingly. Ignoring dependencies costs weeks of rework and produces contaminated results that teach you nothing.
Running 10 experiments per month sounds like chaos. Done wrong, it is. Done right, it creates a compounding learning advantage that is almost impossible for slower competitors to replicate. This velocity matters even more if you're running growth loops alongside your experiments — each loop cycle generates hypotheses that feed directly back into the backlog. The difference between the two is not ambition — it is operational discipline across four dimensions.
Not every experiment needs to be a full A/B test with statistical significance. Categorize by rigor level:
| Level | Type | Threshold | Best Use |
|---|---|---|---|
| L1 — Qualitative | User interviews, session observations | 5-8 users; not statistically representative | Hypothesis generation; understanding mechanisms |
| L2 — Directional | Before/after with no control group | 2+ weeks of post-change data; acknowledge confounds | Low-traffic areas; validating direction before L3 investment |
| L3 — Controlled A/B | Proper randomized test with control | Statistical significance at pre-set minimum sample | Medium-to-high-traffic surfaces; decisions with real downstream cost |
| L4 — Multivariate | Multiple variables simultaneously | Requires very large traffic; complex statistical analysis | Only appropriate above 50,000 monthly unique visitors |
Target mix for 10 experiments per month: 4-5 L1 (fast, qualitative), 3-4 L2 (directional), 1-2 L3 (rigorous). Running too many L3 experiments at low traffic means you're spending weeks waiting for significance that will never arrive at meaningful effect sizes.
| Day | Activity | Who |
|---|---|---|
| Day 1 (Mon) | Sprint planning — select experiments, assign owners, confirm tracking is in place | Growth lead |
| Days 1-3 | Launch approved experiments; prior-sprint experiments continue running | Owners + engineering |
| Day 7 (Mon) | Mid-sprint sanity check — are experiments running correctly? any tracking failures? | Growth lead + analyst |
| Day 14 (Fri) | Sprint close — document all completed results, submit new backlog ideas | All owners |
| Day 14 (Fri) | Experiment review meeting | Full growth team |
Chaos in high-velocity experimentation almost always traces to role confusion — who approves an experiment? Who monitors whether tracking is correct? Who decides to stop early?
| Role | Responsibilities |
|---|---|
| Growth Lead | Approves experiments before launch; final ship/no-ship call; maintains master backlog |
| Experiment Owner | Writes the brief; coordinates with engineering and design; monitors results; writes learnings |
| Growth Analyst | Validates baseline measurement; calculates sample sizes; runs significance calculations; flags data anomalies |
| Engineering DRI | Implements experiment; manages feature flags; flags feasibility issues before prioritization, not after |
One person owns each experiment end-to-end. If that person is unavailable, the experiment pauses. Running experiments without accountable ownership is how you end up with unmonitored tests contaminating each other for weeks without anyone noticing.
The peeking problem: if you check results every day and stop when you first see statistical significance, you will declare false winners at rates far higher than your stated confidence level. With p < 0.05 as your threshold, peeking daily and stopping when you hit it produces actual false positive rates closer to 25-30%, not 5%.
The rule: the experiment owner is not permitted to check results until the minimum sample size is reached. If business urgency requires stopping early — a product incident, a major traffic anomaly — the growth lead documents the reason and the experiment is flagged inconclusive. Not a winner.
Running multiple experiments simultaneously introduces interaction effects — one experiment changes the user experience in a way that affects another's results. Minimize risk with these rules:
Statistical significance is the most misunderstood concept in growth experimentation. Small teams make one of two mistakes: they declare results significant too early (peeking), or they demand textbook rigor that's impossible at their traffic volumes and end up never shipping anything.
Here is a practical standard that doesn't require a staff statistician.
A p-value of 0.05 means: if there were truly no difference between control and variant, you would observe a result at least this extreme by random chance only 5% of the time. Equivalently, you can be 95% confident that a real effect exists.
It does not mean the effect is large. It does not mean the effect will persist. It means the result is unlikely to be noise. Statistical significance is a necessary condition for acting on a result, not a sufficient one.
A 0.2% lift in conversion with p = 0.03 is statistically significant. It is almost certainly not worth shipping if it requires two weeks of engineering work and the incremental revenue impact is marginal.
Before every experiment, set your minimum detectable effect (MDE) — the smallest lift that would justify the cost of the change. Don't declare victory on results smaller than your MDE even if they're statistically significant. The MDE is also what you use to calculate required sample size.
Approximate minimum users per variant (90% confidence, 80% power, two-tailed test):
| Baseline Rate | 10% Relative Lift | 20% Relative Lift | 30% Relative Lift |
|---|---|---|---|
| 2% | ~33,000 | ~8,600 | ~3,900 |
| 5% | ~13,000 | ~3,400 | ~1,500 |
| 10% | ~6,400 | ~1,700 | ~750 |
| 20% | ~3,000 | ~800 | ~360 |
| 40% | ~1,300 | ~340 | ~155 |
Use 90% confidence rather than 95% for most growth decisions. Reserve 95% for high-cost, hard-to-reverse changes. The cost of under-experimentation for most small teams is higher than the cost of occasionally acting on a false positive.
If sample size requirements exceed your available traffic, you have four honest options:
What you should not do: run underpowered experiments, observe a promising number, and call it significant because the p-value crossed 0.05 on the last day. That is motivated reasoning wearing a statistics costume.
A negative result is not a failure. It is evidence. Document it with the same rigor as a positive result.
When an experiment shows no significant effect, ask three distinct questions:
Each has a different implication. Falsified hypotheses update your mental model of how your users work. Underpowered experiments should be re-run with more traffic or a larger change. Implementation errors should be diagnosed before you declare the idea dead.
The experiment review meeting is the most important ritual in a high-velocity growth program. Done well, it's where the team builds shared knowledge and sharpens its collective intuition about users. Pairing this cadence with a structured growth OKR framework ensures experiments are always connected to quarterly priorities rather than running in isolation. Done badly, it becomes a status update where experiments get a thumbs-up or thumbs-down and nobody learns anything transferable to the next experiment.
| Time Block | Activity | Owner | Purpose |
|---|---|---|---|
| 0-15 min | Results presentations (3 min each) | Experiment owners | Hypothesis, result, decision — no deep analysis yet |
| 15-35 min | Learning synthesis | Growth lead + team | What do results collectively tell us? Which user models do we update? |
| 35-50 min | Backlog prioritization | Growth lead | Re-score items with new information; confirm next sprint experiments |
| 50-57 min | Blocked experiments | All | Resolve blockers; assign concrete actions |
| 57-60 min | Learning log entry | Designated note-taker | 3-5 sentences capturing the meeting's key insights |
Each experiment owner presents using this structure — no slides required, just the completed template:
The mechanism check is the most generative question in the meeting. It forces the team to examine whether the experiment's causal story held up — which is what distinguishes real learning from statistical coincidence.
Every experiment's learning gets logged in a shared database. I use a Notion table:
| Field | Content |
|---|---|
| Experiment ID | Link to full experiment template |
| Date | When this learning was established |
| Category | User psychology / UX friction / copy / pricing / email / onboarding |
| Learning | 1-3 sentences summarizing the insight |
| Confidence | High (significant + replicated) / Medium (significant, not replicated) / Low (directional only) |
| Replicated? | Has a subsequent experiment confirmed this finding? |
| Implications | Where else in the product or funnel does this learning apply? |
After a year of consistent documentation, the learnings database becomes one of your most valuable growth assets. New team members onboard faster by reading it. You avoid re-running experiments that were already tested. You can search it before speccing a new experiment and either validate that the idea is worth testing or confirm it has already been answered.
The HiPPO override. The Highest Paid Person's Opinion should not override a properly powered statistical result. Build an explicit team norm: results are results. If the CEO disagrees with what the data shows, the correct response is to run a follow-up experiment — not to override the finding.
The retrospective justification. After seeing results, generating post-hoc explanations for why they make sense. Counter this by requiring hypothesis and mechanism to be written and locked before the experiment launches. The team can only explain a result using evidence that existed before the result was known.
The inconclusive indefinite hold. Some experiments produce ambiguous results and the team debates endlessly whether to act. Set a policy: if there is no consensus within 10 minutes of discussion, the growth lead makes the call. Indecision has a real cost — it blocks the experiments that should have been running in its place.
These are real experiments run across companies I've worked with or advised. Metrics are generalized but directional results and learnings are authentic.
| # | Area | Hypothesis Summary | Primary Metric | Baseline | Variant | Lift | Sig | Decision | Key Learning |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Activation | Remove company-size + industry fields from signup | 7-day activation | 31% | 38% | +22.6% | 97% | Ship | Users resist providing data they don't see used in the product |
| 2 | Acquisition | Add 5 customer logos to pricing page | Trial start rate | 4.2% | 4.5% | +7.1% | 71% | Inconclusive | Directionally positive; insufficient traffic to conclude |
| 3 | Acquisition | Show "X teams signed up this week" counter on signup | Signup conversion | 8.1% | 7.6% | -6.2% | 91% | Revert | Scarcity signals feel manipulative to B2B audience |
| 4 | Retention | Send welcome email at 9am local vs. 2pm | Day-1 return rate | 22% | 26% | +18.2% | 96% | Ship | Morning email aligns with users' daily planning mindset |
| 5 | Revenue | Two-step checkout vs. single page | Checkout completion | 61% | 68% | +11.5% | 98% | Ship | Progress indicators reduce abandonment anxiety in checkout |
| 6 | Activation | Show estimated time to complete onboarding | Onboarding completion | 44% | 51% | +15.9% | 95% | Ship | Naming the time cost removes it as a barrier |
| 7 | Acquisition | "Start free" vs. "Try for free" CTA copy | CTA click rate | 11.3% | 12.1% | +7.1% | 89% | Monitor | Marginal; not worth continued copy iteration on this element |
| 8 | Retention | In-app tooltip on day 3 pointing to key feature | 30-day retention | 39% | 43% | +10.3% | 93% | Ship | Users who find the key feature on day 3 retain at significantly higher rates |
| 9 | Revenue | Annual pricing displayed first vs. monthly | Annual plan selection | 18% | 24% | +33.3% | 97% | Ship | Anchoring to annual normalizes it as the default choice |
| 10 | Activation | Remove navigation from onboarding flow | Onboarding completion | 44% | 49% | +11.4% | 94% | Ship | Removing escape paths during onboarding reliably increases completion |
| 11 | Revenue | Exit-intent modal on pricing page | Trial start rate | 4.2% | 4.4% | +4.8% | 68% | Kill | Insufficient lift; feels intrusive to B2B audience |
| 12 | Acquisition | Email subject line with company name vs. first name | Open rate | 21% | 29% | +38.1% | 99% | Ship | Company name personalization outperforms first name for B2B |
| 13 | Acquisition | Video testimonial vs. text testimonial on landing page | Demo request rate | 3.1% | 2.9% | -6.5% | 85% | Revert | Video increases time-on-page but users watch then leave without converting |
| 14 | Revenue | Show feature usage stats in upgrade nudge | Upgrade conversion | 2.8% | 3.9% | +39.3% | 98% | Ship | Usage-based nudges dramatically outperform generic upgrade prompts |
| 15 | Revenue | Add free plan to pricing page | Paid conversion | 5.1% | 4.6% | -9.8% | 91% | Revert | Free plan cannibalized paid trial signups without expanding total TAM |
| 16 | Retention | Reduce email sequence from 7 to 4 emails | 14-day return rate | 34% | 33% | -2.9% | 61% | Inconclusive | Sequence length not the variable; test email content instead |
| 17 | Activation | Add progress bar to multi-step setup form | Form completion | 58% | 66% | +13.8% | 96% | Ship | Progress indicators work consistently across all form types tested |
| 18 | Revenue | "Most popular" badge on middle pricing tier | Middle tier selection | 31% | 41% | +32.3% | 98% | Ship | Social proof on pricing is consistently one of the highest-impact levers |
| 19 | Acquisition | Show integration count on homepage hero | Signup rate | 8.1% | 8.3% | +2.5% | 63% | Kill | Integrations not a primary decision driver for this ICP |
| 20 | Activation | Move Slack integration to onboarding step 1 (was step 5) | Slack connection rate | 28% | 44% | +57.1% | 99% | Ship | Moving high-value sticky actions earlier dramatically increases completion and downstream retention |
Looking at the full set, clear patterns emerge:
What reliably works (confirmed across multiple experiments):
What consistently does not work (for B2B SaaS specifically):
The most important pattern in the table: The highest-lift experiments were almost always about removing friction or moving a valuable moment earlier — not about adding persuasion mechanisms. This points to a principle I return to repeatedly: users want to succeed at their job. Remove obstacles. Don't add urgency designed to override their judgment.
How many experiments should we be running at our stage?
Traffic and team size — not ambition — should set your target. One experiment per growth team member per month is a sustainable starting pace. As process matures, two to three per person per month is achievable without quality degradation. Running fewer, better-structured experiments is always preferable to running many poorly-specced ones. Volume without rigor produces a large library of inconclusive results and a team that is busy but not learning.
Should every experiment be a proper A/B test?
No. Not every question requires a randomized controlled trial. Some questions are better answered by user interviews, session recordings, or simple before/after measurements. Use A/B tests when: you need to isolate the effect of a specific change, you have sufficient traffic to reach significance in a reasonable timeframe, and the decision is important enough to justify the rigor. Use simpler methods for exploration and hypothesis generation — they're faster and often more revealing.
What do we do when results contradict each other across segments?
First, verify that the segment sample sizes are large enough to draw conclusions from. A segment result based on 200 users is usually noise. If sample sizes are adequate and the contradiction is real, don't try to reconcile it into a single conclusion — document both results. Segment-level contradictions are often the most valuable learnings because they reveal that your user population is not homogeneous. Consider running dedicated experiments designed for each segment rather than seeking one solution that works for everyone.
How long should we run an experiment?
Minimum two full business cycles — two weeks — to account for day-of-week effects. Maximum six weeks before external factors contaminate the data (seasonality, product changes, marketing campaigns). If you haven't reached statistical significance in six weeks, your minimum detectable effect was set too small for your traffic volume. Accept a directional result, or redesign the experiment around a larger change.
What is the single biggest mistake teams make with growth experimentation?
Not writing down learnings. Teams invest real resources running experiments and then move on without extracting or recording the insight. Six months later, the same experiment gets re-run because nobody remembers the original result. The learnings database compounds over time — a year of consistent documentation is worth more than any single experiment result.
How do we handle seasonality when running experiments?
For experiments running over periods with known seasonality — end of quarter, holidays, major marketing campaigns — either avoid that period when possible, or ensure both control and variant are exposed to the same seasonal conditions simultaneously. What to avoid: comparing a variant period that includes a seasonal spike to a pre-change baseline period that didn't. That measures the calendar, not your experiment.
Should we use a dedicated A/B testing platform or build our own?
Use an existing platform for most teams. Optimizely, VWO, Statsig, LaunchDarkly, and GrowthBook are all viable options depending on your stack. Building experimentation infrastructure is expensive, easy to do incorrectly, and almost never a source of competitive advantage. The exception: if you need tight integration with custom data infrastructure or highly specific assignment logic that no platform supports. Start off-the-shelf. Invest in custom infrastructure only when platform limitations are measurably constraining experiment quality.
How do we get engineering buy-in for experiments requiring code changes?
Frame it as risk reduction, not additional work. Engineers who have shipped features that underperformed in production understand the value of a seatbelt. "Ship this change with a clean way to revert if it underperforms" resonates more than "we need to test before we ship." Investing in good feature flag infrastructure reduces per-experiment engineering cost dramatically after the initial setup. The first five experiments carry high overhead. Experiments 20 through 100 are cheap.
When should we stop experimenting and just make the call?
When you have a clear pattern from multiple experiments. If four consecutive tests have shown positive effects from simplifying your onboarding, you don't need a fifth to confirm the direction. Ship and move to the next question. The goal of experiments is to build a model of your users that guides product decisions. Once the model is clear enough to act on, act on it. Continued experimentation on a settled question is a form of analysis paralysis dressed up as rigor.
When should we kill an experiment vs. iterate on it?
Kill when: the result is clearly negative and the mechanism you hypothesized was wrong. If users actively dislike the change — guardrail metrics deteriorate, support ticket volume increases — kill it without revisiting the idea unless you have significantly new evidence. Iterate when: the result is neutral or negative but the mechanism still seems sound, suggesting execution was the problem rather than the idea. Also iterate when guardrail metrics are intact but the primary metric didn't move — sometimes a good idea needs a different implementation to work.
Growth experimentation at scale is not primarily a technical challenge. The tools exist. The methods are documented. The statistics are learnable with an afternoon of study. The hard part is discipline: writing crisp hypotheses before you're tempted to just try something, recording baselines before you need them, resisting the urge to peek at results early, and — most importantly — writing down what you learned even when the result was disappointing.
The framework in this post is the result of years of iteration across companies of different sizes and stages. It will not survive first contact with your specific team and context unchanged. Take what's useful, discard what doesn't fit, and build the habit of documenting learnings from your very first experiment. That compounding asset will pay dividends long after any individual result is forgotten.
If you run an experiment using this template and get a result worth sharing, I'd genuinely like to hear about it.
Enjoyed this? I write about growth, product, and SaaS strategy. Subscribe below to get new posts as they come out.
The complete bootstrapped growth playbook — capital efficiency metrics, zero-CAC acquisition channels, pricing strategy, and real benchmarks by ARR stage for founders scaling without external capital.
How PitchGround scaled from zero to $25M GMV — marketplace mechanics, flywheel, crisis moments, and lessons for any marketplace founder.
How to write growth OKRs that drive outcomes — with 20+ real examples, scoring rubric, quarterly cadence, and a ready-to-use template.