# Determining statistical significance for email split tests, pt. 2: sample sizes

In one of the last posts, we addressed the chi-squared test for independence. With this test, we wanted to calculate, if e.g. two subject lines have a significantly different impact on the absolute number of email opens. I provided you with a “flexible” solution. “Flexible” means, it can now easily be extended to your needs. One extension would be to determine the required sample size for each of your test cells a priori. There’s no question that a split A/B test, which only incorporates 2 x 100 recipients delivers a different reliability than one, which includes 2 x 1500 recipients. So here’s a solution for choosing the right sample size — again using the R package.

First, let’s think back and recall our significant split A/B test example from the last post. Subject line A yielded 300 openers from 1,000 emails sent (30% open rate), subject line B generated 342 and 658 non-openers respectively (34.1% open rate). Here’s the R-code including the commands and the console output:

```> (test.mat <- matrix(c(
+ 300,700,
+ 342,658),2,
+ dimnames=list(c("openers","nonopeners"))))

[,1] [,2]
openers     300  342
nonopeners  700  658

> round(chisq.test(x=test.mat)\$p.value*100,2)
 4.96```

(Hint: If you copy/paste this into the R console, exclude the > and the + first)

### Beware the alpha and beta error

The X-squared test’s p-value is about 4.96%. Fortunately, it’s smaller than the “alpha” we chose, namely 5%. Thus, we can (statistically) be 95% confident in not falsely seeing a dependency between subject line and email opens. The false positive error represented by alpha is also called type I error.

Now, there’s also a type II error. It’s called “beta” and quantifies the false negative risk. That is the probability of falsely seeing no dependency between subject lines and email opens. Where we call 1-alpha the confidence, 1-beta is called the test’s statistical power. You can view it as the chance of discovering a significant result.

### Sample size depends on several factors

Some rules of thumb: The smaller (bigger) the alpha (confidence) and beta (power), the higher is the required sample size “n”. Furthermore, n depends on the expected effect size of our subject line variation. In email marketing, we are mostly dealing with small effect sizes of about +/-10% when testing opens, clicks, and conversions. Sometimes we see medium changes of maybe +/-30%. Anyway, proving small effects demand for greater sample sizes than measuring large ones of +-/50% and more.

### Down to business: Calculating cell sizes!

Our R function of interest is called power.prop.test. It expects the parameters

1. n (sample size per cell),
2. p1 & p2 (expected email open rates / proportions),
3. sig.level (desired level of significance, alpha),
4. and power (statistical power, 1-beta).

You can fiddle around with the values here (press “run code“): The function invites to play with it: set one parameter to NULL and power.prop.test will automatically calculate the missing one from the other three parameters that you specified.

As a little warm-up, let’s see what power we achieved in our previous example by setting “power=NULL”:

```> power.prop.test(n=1000,p1=0.3,p2=0.341,
+ sig.level=0.05,power=NULL)

Two-sample comparison of proportions power calculation

n = 1000
p1 = 0.3
p2 = 0.341
sig.level = 0.05
power = 0.5018259
alternative = two.sided

NOTE: n is number in *each* group```

What does this mean? Well, with 1,000 observations per subject line group, there’s only a 50% chance of revealing a difference between 0.341 and 0.300 as significant on a (two-sided) significance level of 5%. That’s not much. Many significant results could slip us away unseen.

Common selections for power are 0.8 and 0.9. So let’s see, how big the test would have to be with a chance of 80% (beta=0.2,power=0.8):

```> 2*power.prop.test(n=NULL,p1=0.3,p2=0.341,
+ sig.level=0.05,power=0.8)\$n
 4065.047```

We would have to include 4,066 recipients in our test – more than twice as many as before (2 x 1,000). What, if we (1) accepted a greater type I error of, let’s say, 10% and (2) if we’d additionally be a little bit more optimistic concerning the effect size of our subject line impact – e.g. +33,34% more opens instead of +13.67%?

```> 2*power.prop.test(n=NULL,p1=0.3,p2=0.341,
+ sig.level=0.1,power=0.8)\$n
 3201.799
> 2*power.prop.test(n=NULL,p1=0.3,p2=0.4,
+ sig.level=0.1,power=0.8)\$n
 560.5162```

In the end, we could try our subject lines on 562 recipients – 281 per test group. But we would assume a significant result falsely in one out of ten tests (alpha = 0.1 = 10%). What to choose for each parameter? Well, the setup depends on you.

### Sample size too small? Try this…

Of course, you can also evaluate all parameters post-hoc, i.e. after or even during collecting your responses from the test send-out. If you see a winner at an acceptable confidence level already after 30 minutes – go for it! If you expected 30% and 40% open rates, but you only get 28.3% and 34.6% after three hours…

```> power.prop.test(n=281,p1=0.283,p2=0.346,
+ sig.level=c(0.05,0.1),power=NULL)

Two-sample comparison of proportions power calculation

n = 281
p1 = 0.283
p2 = 0.346
sig.level = 0.05, 0.10
power = 0.3622373, 0.4853832
alternative = two.sided

NOTE: n is number in *each* group```

…you’d suffer statistical power. A clear winner might fly under your radar at a 64-51% risk (depending on your alpha risk).

One solution would be to repeat the test on another sample from your list (ceteris paribus): e.g. same subject lines, different 561 recipients. The trick is to sum up the resulting X-squared values and their degrees of freedom from each test. It’s possible due to the additive property of chi-square. Then compare the sum of Pearon’s aproximated X-squared to the critical X-squared value from the theoretical distribution for the corresponding degrees of freedom and your desired confidence level.

What if we got 28.3%/34.6% open rate in test #1, and 27.1%/35.2% in test #2 – can we safely assume a subject line impact for the whole test on a 5% significance level?

```> (test.1<-chisq.test(matrix(c(
+     0.283*281,
+     0.346*281,
+     (1-0.283)*281,
+     (1-0.346)*281),nrow=2,byrow=T)))

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 2.3026, df = 1, p-value = 0.1292

> (test.2<-chisq.test(matrix(c(
+     0.271*281,
+     0.352*281,
+     (1-0.271)*281,
+     (1-0.352)*281),nrow=2,byrow=T)))

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 3.9288, df = 1, p-value = 0.04747

> (test.chisqSum <- test.1\$statistic + test.2\$statistic);
X-squared
6.231427
> test.df <- test.1\$parameter + test.2\$parameter;
> (test.chisqCrit <- qchisq(0.95,test.df))
 5.991465
> test.chisqCrit < test.chisqSum
X-squared
TRUE ```

“TRUE” means “yes, we see enough evidence to assume a significant result for both tests” and for the experiment at a whole. Although individual experiments delivered different results.

What, if you have only 561 subscribers in your list available? Well, test again. Howeever, at least you should divide the Type I error probability alpha by the number of tests on the same sample. This is called Bonferroni correction. So in the above example you want to be 97.5% confident, not just 95% (5% / 2 tests = 2.5%; 1 – 2.5% = 97.5%). Why? Well… have a look at this cartoon

All in all, statisticians would perhaps throw their hands up in horror here and there, but hey – we are internet marketers, not pharma researchers. (Although pharma and email so closely interwoven with one another. But that’s a different story. 😉 ..) Enjoyed this one? Subscribe for my hand-picked list of the best email marketing tips. Get inspiring ideas from international email experts, every Friday: (archive♞) Yes, I accept the Privacy Policy ♞Delivery on Fridays, 5 pm CET. You can always unsubscribe.It's valuable, I promise. Subscribers rate it >8 out of 10 (!) on average.

### 3 Responses to Determining statistical significance for email split tests, pt. 2: sample sizes

1. (@absolit) (@absolit)

Determining statistical significance for email split tests, pt. 2: sample sizes http://t.co/XSGl0gN5

This site uses Akismet to reduce spam. Learn how your comment data is processed.