In one of the last posts, we addressed the chi-squared test for independence. With this test, we wanted to calculate, if e.g. two subject lines have a significantly different impact on the absolute number of email opens. I provided you with a “flexible” solution. “Flexible” means, it can now easily be extended to your needs. One extension would be to determine the required sample size for each of your test cells a priori. There’s no question that a split A/B test, which only incorporates 2 x 100 recipients delivers a different reliability than one, which includes 2 x 1500 recipients. So here’s a solution for choosing the right sample size — again using the R package.
First, let’s think back and recall our significant split A/B test example from the last post. Subject line A yielded 300 openers from 1,000 emails sent (30% open rate), subject line B generated 342 and 658 non-openers respectively (34.1% open rate). Here’s the R-code including the commands and the console output:
> (test.mat <- matrix(c( + 300,700, + 342,658),2, + dimnames=list(c("openers","nonopeners")))) [,1] [,2] openers 300 342 nonopeners 700 658 > round(chisq.test(x=test.mat)$p.value*100,2)  4.96
(Hint: If you copy/paste this into the R console, exclude the > and the + first)
Beware the alpha and beta error
The X-squared test’s p-value is about 4.96%. Fortunately, it’s smaller than the “alpha” we chose, namely 5%. Thus, we can (statistically) be 95% confident in not falsely seeing a dependency between subject line and email opens. The false positive error represented by alpha is also called type I error.
Now, there’s also a type II error. It’s called “beta” and quantifies the false negative risk. That is the probability of falsely seeing no dependency between subject lines and email opens. Where we call 1-alpha the confidence, 1-beta is called the test’s statistical power. You can view it as the chance of discovering a significant result.
Sample size depends on several factors
Some rules of thumb: The smaller (bigger) the alpha (confidence) and beta (power), the higher is the required sample size “n”. Furthermore, n depends on the expected effect size of our subject line variation. In email marketing, we are mostly dealing with small effect sizes of about +/-10% when testing opens, clicks, and conversions. Sometimes we see medium changes of maybe +/-30%. Anyway, proving small effects demand for greater sample sizes than measuring large ones of +-/50% and more.
Down to business: Calculating cell sizes!
Our R function of interest is called power.prop.test. It expects the parameters
- n (sample size per cell),
- p1 & p2 (expected email open rates / proportions),
- sig.level (desired level of significance, alpha),
- and power (statistical power, 1-beta).
You can fiddle around with the values here (press “run code“):
The function invites to play with it: set one parameter to NULL and power.prop.test will automatically calculate the missing one from the other three parameters that you specified.
As a little warm-up, let’s see what power we achieved in our previous example by setting “power=NULL”:
> power.prop.test(n=1000,p1=0.3,p2=0.341, + sig.level=0.05,power=NULL) Two-sample comparison of proportions power calculation n = 1000 p1 = 0.3 p2 = 0.341 sig.level = 0.05 power = 0.5018259 alternative = two.sided NOTE: n is number in *each* group
What does this mean? Well, with 1,000 observations per subject line group, there’s only a 50% chance of revealing a difference between 0.341 and 0.300 as significant on a (two-sided) significance level of 5%. That’s not much. Many significant results could slip us away unseen.
Common selections for power are 0.8 and 0.9. So let’s see, how big the test would have to be with a chance of 80% (beta=0.2,power=0.8):
> 2*power.prop.test(n=NULL,p1=0.3,p2=0.341, + sig.level=0.05,power=0.8)$n  4065.047
We would have to include 4,066 recipients in our test – more than twice as many as before (2 x 1,000). What, if we (1) accepted a greater type I error of, let’s say, 10% and (2) if we’d additionally be a little bit more optimistic concerning the effect size of our subject line impact – e.g. +33,34% more opens instead of +13.67%?
> 2*power.prop.test(n=NULL,p1=0.3,p2=0.341, + sig.level=0.1,power=0.8)$n  3201.799 > 2*power.prop.test(n=NULL,p1=0.3,p2=0.4, + sig.level=0.1,power=0.8)$n  560.5162
In the end, we could try our subject lines on 562 recipients – 281 per test group. But we would assume a significant result falsely in one out of ten tests (alpha = 0.1 = 10%). What to choose for each parameter? Well, the setup depends on you.
Sample size too small? Try this…
Of course, you can also evaluate all parameters post-hoc, i.e. after or even during collecting your responses from the test send-out. If you see a winner at an acceptable confidence level already after 30 minutes – go for it! If you expected 30% and 40% open rates, but you only get 28.3% and 34.6% after three hours…
> power.prop.test(n=281,p1=0.283,p2=0.346, + sig.level=c(0.05,0.1),power=NULL) Two-sample comparison of proportions power calculation n = 281 p1 = 0.283 p2 = 0.346 sig.level = 0.05, 0.10 power = 0.3622373, 0.4853832 alternative = two.sided NOTE: n is number in *each* group
…you’d suffer statistical power. A clear winner might fly under your radar at a 64-51% risk (depending on your alpha risk).
One solution would be to repeat the test on another sample from your list (ceteris paribus): e.g. same subject lines, different 561 recipients. The trick is to sum up the resulting X-squared values and their degrees of freedom from each test. It’s possible due to the additive property of chi-square. Then compare the sum of Pearon’s aproximated X-squared to the critical X-squared value from the theoretical distribution for the corresponding degrees of freedom and your desired confidence level.
What if we got 28.3%/34.6% open rate in test #1, and 27.1%/35.2% in test #2 – can we safely assume a subject line impact for the whole test on a 5% significance level?
> (test.1<-chisq.test(matrix(c( + 0.283*281, + 0.346*281, + (1-0.283)*281, + (1-0.346)*281),nrow=2,byrow=T))) Pearson's Chi-squared test with Yates' continuity correction X-squared = 2.3026, df = 1, p-value = 0.1292 > (test.2<-chisq.test(matrix(c( + 0.271*281, + 0.352*281, + (1-0.271)*281, + (1-0.352)*281),nrow=2,byrow=T))) Pearson's Chi-squared test with Yates' continuity correction X-squared = 3.9288, df = 1, p-value = 0.04747 > (test.chisqSum <- test.1$statistic + test.2$statistic); X-squared 6.231427 > test.df <- test.1$parameter + test.2$parameter; > (test.chisqCrit <- qchisq(0.95,test.df))  5.991465 > test.chisqCrit < test.chisqSum X-squared TRUE
“TRUE” means “yes, we see enough evidence to assume a significant result for both tests” and for the experiment at a whole. Although individual experiments delivered different results.
What, if you have only 561 subscribers in your list available? Well, test again. Howeever, at least you should divide the Type I error probability alpha by the number of tests on the same sample. This is called Bonferroni correction. So in the above example you want to be 97.5% confident, not just 95% (5% / 2 tests = 2.5%; 1 – 2.5% = 97.5%). Why? Well… have a look at this cartoon…
All in all, statisticians would perhaps throw their hands up in horror here and there, but hey – we are internet marketers, not pharma researchers. (Although pharma and email so closely interwoven with one another. But that’s a different story. ..)