Determining statistical significance for email split tests, pt. 2: sample sizes

In one of the last posts, we addressed the chi-squared test for independence. With this test, we wanted to calculate, if e.g. two subject lines have a significantly different impact on the absolute number of email opens. I provided you with a “flexible” solution. “Flexible” means, it can now easily be extended to your needs. One extension would be to determine the required sample size for each of your test cells a priori. There’s no question that a split A/B test, which only incorporates 2 x 100 recipients delivers a different reliability than one, which includes 2 x 1500 recipients. So here’s a solution for choosing the right sample size — again using the R package.

First, let’s think back and recall our significant split A/B test example from the last post. Subject line A yielded 300 openers from 1,000 emails sent (30% open rate), subject line B generated 342 and 658 non-openers respectively (34.1% open rate). Here’s the R-code including the commands and the console output:

> (test.mat <- matrix(c(
+ 300,700,
+ 342,658),2,
+ dimnames=list(c("openers","nonopeners"))))

[,1] [,2]
openers     300  342
nonopeners  700  658

> round(chisq.test(x=test.mat)$p.value*100,2)
[1] 4.96

(Hint: If you copy/paste this into the R console, exclude the > and the + first)

Beware the alpha and beta error

The X-squared test’s p-value is about 4.96%. Fortunately, it’s smaller than the “alpha” we chose, namely 5%. Thus, we can (statistically) be 95% confident in not falsely seeing a dependency between subject line and email opens. The false positive error represented by alpha is also called type I error.

Now, there’s also a type II error. It’s called “beta” and quantifies the false negative risk. That is the probability of falsely seeing no dependency between subject lines and email opens. Where we call 1-alpha the confidence, 1-beta is called the test’s statistical power. You can view it as the chance of discovering a significant result.

Sample size depends on several factors

Some rules of thumb: The smaller (bigger) the alpha (confidence) and beta (power), the higher is the required sample size “n”. Furthermore, n depends on the expected effect size of our subject line variation. In email marketing, we are mostly dealing with small effect sizes of about +/-10% when testing opens, clicks, and conversions. Sometimes we see medium changes of maybe +/-30%. Anyway, proving small effects demand for greater sample sizes than measuring large ones of +-/50% and more.

Down to business: Calculating cell sizes!

Our R function of interest is called power.prop.test. It expects the parameters

  1. n (sample size per cell),
  2. p1 & p2 (expected email open rates / proportions),
  3. sig.level (desired level of significance, alpha),
  4. and power (statistical power, 1-beta).

You can fiddle around with the values here (press “run code“):

The function invites to play with it: set one parameter to NULL and power.prop.test will automatically calculate the missing one from the other three parameters that you specified.

As a little warm-up, let’s see what power we achieved in our previous example by setting “power=NULL”:

> power.prop.test(n=1000,p1=0.3,p2=0.341,
+ sig.level=0.05,power=NULL)

Two-sample comparison of proportions power calculation

n = 1000
p1 = 0.3
p2 = 0.341
sig.level = 0.05
power = 0.5018259
alternative = two.sided

NOTE: n is number in *each* group

What does this mean? Well, with 1,000 observations per subject line group, there’s only a 50% chance of revealing a difference between 0.341 and 0.300 as significant on a (two-sided) significance level of 5%. That’s not much. Many significant results could slip us away unseen.

Common selections for power are 0.8 and 0.9. So let’s see, how big the test would have to be with a chance of 80% (beta=0.2,power=0.8):

> 2*power.prop.test(n=NULL,p1=0.3,p2=0.341,
+ sig.level=0.05,power=0.8)$n
[1] 4065.047

We would have to include 4,066 recipients in our test – more than twice as many as before (2 x 1,000). What, if we (1) accepted a greater type I error of, let’s say, 10% and (2) if we’d additionally be a little bit more optimistic concerning the effect size of our subject line impact – e.g. +33,34% more opens instead of +13.67%?

> 2*power.prop.test(n=NULL,p1=0.3,p2=0.341,
+ sig.level=0.1,power=0.8)$n
[1] 3201.799
> 2*power.prop.test(n=NULL,p1=0.3,p2=0.4,
+ sig.level=0.1,power=0.8)$n
[1] 560.5162

In the end, we could try our subject lines on 562 recipients – 281 per test group. But we would assume a significant result falsely in one out of ten tests (alpha = 0.1 = 10%). What to choose for each parameter? Well, the setup depends on you.

Sample size too small? Try this…

Of course, you can also evaluate all parameters post-hoc, i.e. after or even during collecting your responses from the test send-out. If you see a winner at an acceptable confidence level already after 30 minutes – go for it! If you expected 30% and 40% open rates, but you only get 28.3% and 34.6% after three hours…

> power.prop.test(n=281,p1=0.283,p2=0.346,
+ sig.level=c(0.05,0.1),power=NULL)

     Two-sample comparison of proportions power calculation 

              n = 281
             p1 = 0.283
             p2 = 0.346
      sig.level = 0.05, 0.10
          power = 0.3622373, 0.4853832
    alternative = two.sided

 NOTE: n is number in *each* group

…you’d suffer statistical power. A clear winner might fly under your radar at a 64-51% risk (depending on your alpha risk).

One solution would be to repeat the test on another sample from your list (ceteris paribus): e.g. same subject lines, different 561 recipients. The trick is to sum up the resulting X-squared values and their degrees of freedom from each test. It’s possible due to the additive property of chi-square. Then compare the sum of Pearon’s aproximated X-squared to the critical X-squared value from the theoretical distribution for the corresponding degrees of freedom and your desired confidence level.

What if we got 28.3%/34.6% open rate in test #1, and 27.1%/35.2% in test #2 – can we safely assume a subject line impact for the whole test on a 5% significance level?

> (test.1<-chisq.test(matrix(c(
+     0.283*281,
+     0.346*281,
+     (1-0.283)*281,
+     (1-0.346)*281),nrow=2,byrow=T)))

        Pearson's Chi-squared test with Yates' continuity correction

X-squared = 2.3026, df = 1, p-value = 0.1292

> (test.2<-chisq.test(matrix(c(
+     0.271*281,
+     0.352*281,
+     (1-0.271)*281,
+     (1-0.352)*281),nrow=2,byrow=T)))

        Pearson's Chi-squared test with Yates' continuity correction

X-squared = 3.9288, df = 1, p-value = 0.04747

> (test.chisqSum <- test.1$statistic + test.2$statistic);
> test.df <- test.1$parameter + test.2$parameter;
> (test.chisqCrit <- qchisq(0.95,test.df))
[1] 5.991465
> test.chisqCrit < test.chisqSum

“TRUE” means “yes, we see enough evidence to assume a significant result for both tests” and for the experiment at a whole. Although individual experiments delivered different results.

What, if you have only 561 subscribers in your list available? Well, test again. Howeever, at least you should divide the Type I error probability alpha by the number of tests on the same sample. This is called Bonferroni correction. So in the above example you want to be 97.5% confident, not just 95% (5% / 2 tests = 2.5%; 1 – 2.5% = 97.5%). Why? Well… have a look at this cartoon

All in all, statisticians would perhaps throw their hands up in horror here and there, but hey – we are internet marketers, not pharma researchers. (Although pharma and email so closely interwoven with one another. But that’s a different story. ;-) ..)

Enjoyed this one? Subscribe for my hand-picked list of the best email marketing tips. Get inspiring ideas from international email experts, every Friday: (archive♞)
Yes, I accept the Privacy Policy
Delivery on Fridays, 5 pm CET. You can always unsubscribe.
It's valuable, I promise. Subscribers rate it >8 out of 10 (!) on average.

3 Responses to Determining statistical significance for email split tests, pt. 2: sample sizes

  1. Determining statistical significance for email split tests, pt. 2: sample sizes

  2. Pingback: Email pre-testing: Determining required group sizes and margins of error | E-Mail Marketing Tipps

  3. Pingback: Multivariate tests in email marketing – a practical step-by-step guide to boost your response rates | E-Mail Marketing Tipps

Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

All information is voluntary. Your email address will not be published. When commenting, you agree that your IP address will be processed and stored by Askimet in the U.S. for the purpose of recognizing comment-spam.