Sample size calculation

Randomised trials and observational studies

Sample sizes analytical studies such as randomised trials and observational studies are based on required power, rather than required accuracy. The information required to determine the required sample for an analytical study includes:

  • Required Type 1 error
  • Required Power ( 1 – Type 2 error)
  • Expected effect in terms of the outcome measure
  • Some measure of the variability of the outcome measure

Let us have a look at these pieces of information that are required.

 

Type 1 error and power

In order to understand these concepts, it is probably best to first have some knowledge of hypothesis testing. Hypothesis testing usually involves 6 steps:

  1. State H0 the null hypothesis
  2. State HA the alternative hypothesis
  3. Decide on a suitable statistical test based on H0
  4. Calculate the test statistic
  5. Check the calculated p-value against the acceptable Type 1 error (Usually set at 0.05)
  6. If p £ 0.05 reject H0

For example, the null hypothesis might be that the Intervention and Control groups have the same mean value. This is a bit like assuming a prisoner is not guilty before the start of a trial. In fact, we usually want to disprove or reject the null hypothesis.

The alternative hypothesis might be that the Intervention group mean is not the same as the Control group mean (this is called a two-sided alternative hypothesis), or the Intervention group mean is higher than the Control group mean (one-sided alternative hypothesis).

The test statistic depends on what it is you are comparing. For example, if we are comparing two means, then we usually use the independent samples t-test. If we are comparing two correlation coefficients, we would use a Z-test. If we are comparing two proportions, we would use a chi-square test. An F-test is used to compare two variances.

The next stage is calculating the value of the test statistic. Luckily, our statistical software usually does this for us. The computer than looks up the value of the test statistic in a set of tables; for each value of the test statistic, there is an associated p-value. The p-value is a probability, that is, it lies between 0 and 1. If the null hypothesis is true, we expect a very small value of the test statistic, and a corresponding large p-value.  If the null hypothesis is wrong, we expect a large value of the test statistic, and a corresponding small p-value. In the latter case, we say we reject the null hypothesis.

However, in making this decision, there are two mistakes or errors that can happen, which are called Type 1 and Type 2 errors. These are demonstrated in Figure 2.

 

Figure 2: Type 1 and 2 errors 

figure here

 

As pointed out earlier, epidemiologists love two put things into 2 x 2 tables. Here we have the true state of the null hypothesis (that is, whether true or false) on the top, and the statistical decision (reject or accept the null hypothesis on the side).

Let’s start with the top left hand corner. Here, the null hypothesis is genuinely true, that is, there was no treatment effect, yet based on the results of the statistical test and associated p-value, we have decided to reject it. This is called a Type 1 error, and is usually set at 0.05, or 5%. Type 1 errors are usually caused by things like selection bias, misclassification, confounding or effect modification. 

In the top right hand corner, the null hypothesis is false, there was a genuine treatment effect, and based on the statistical test and associated p-value, we have decided to reject the null hypothesis. We were very clever and made the right decision. This is called power, and is usually set at 0.80 or 80%.

In the bottom left hand corner, the null hypothesis is true and we wisely do not reject it.

Finally, in the bottom right hand corner, the null hypothesis is genuinely wrong, but we decide to not to reject it. This is called a Type 2 error, and is in fact calculated by (1 – power), so that if the power is 80%, then the type 2 error is 20%. Type 2 errors are invariability caused by having a sample size that is too small.

Most sample size programs or websites ask you for the acceptable Type 1 error (usually set at 0.05), and either the acceptable Type 2 error (usually set at 0.1 or 0.2) or power (usually set at 90% or 80%).

 

Expected effect

The expected effect is the difference between the control and intervention group at the end of the study. This could be a simple comparison of post-intervention outcome means, or a comparison of mean change scores. In other words, HA states that you expect one group to have a different mean or proportion to the other group, but by how much? An estimate of the expected effect is usually obtained from:

  • The literature
  • A pilot study
  • Clinical judgment
  • Interim analysis of a current study
  • Generic effect size

The generic effect size can be used if none of the things above it are available. It is usually based on Cohen’s d, which is basically the difference between two means divided by the standard deviation of the data. In other words:

            d = (mean1 – mean2) / s

Here, s is defined as:

 

and s12 is the variance of group 1 and s22 the variance of group 2.

Cohen’s d falls in the range 0 to a large positive number, but is usually interpreted as:

  • 0.2 Small effect
  • 0.5 Medium effect
  • 0.8 Large effect

Although Cohen’s d was developed in psychology, it has now been widely adopted in other disciplines. A rule of thumb is that you would like to see at least a medium effect in a study, in order for it to be clinically significant. So if you do not have an estimate of the effect size, many sample size programs allow you to enter a generic effect size instead. Be aware that in some branches of clinical study, a small effect size would be of clinical interest. There are different effect sizes for different types of statistical test. A good explanation can be found here:

https://en.wikipedia.org/wiki/Effect_size

 

Measures of variability

The usual measure of variability of the outcome measure asked for by sample size software is the standard deviation.  Like the effect size, the expected standard deviation can be found from:

  • The literature
  • A pilot study

In the published literature, you are often provided with the standard error of the mean (SEM), rather than the standard deviation (s). However, it is easy to convert one to the other.

                                                   

where n is the sample size. So if SEM=2 and the sample size is 36, s=12.

 

 

Another way of getting an approximate estimate of the standard deviation is by using the range. The range is the difference between the highest and smallest expected values of the outcome measure – then simply then divide the range by 4. Why does this work? Well if your outcome measure is Normally distributed (i.e., the frequency distribution looks like a bell-shaped curve), then 95% of the observations fall within the mean ± 4 standard deviations. So if you know the range, then 1 standard deviation is approximately a quarter of the range.

 

Putting it all together

Comparing two means

Suppose we have two groups of people, and we measure their quality of life using the SF-36 quality of life measure. The first group has a mean Physical Component Score (PCS) of 50, with a standard deviation of 10. The second group has a mean PCS of 40 with a standard deviation of 8. What sample size do you need to demonstrate that the two groups differ significantly with respect to PCS?

 

First, let us go to an online sample size calculator:

 https://www.sealedenvelope.com/power/continuous-superiority/

You should see this screen.

Note that the default Type 1 error (alpha) is 5% or 0.05, and the default power is 90%. Now change the power to 80%, and enter the two means, and the average of the two standard deviations, approximately 9. The answer provided is:

In other words, you would need 13 people in each group.

Comparing two proportions

Suppose you again had two groups, and wanted to compare the proportion in each group that had been to university. The two expected proportions are 28% and 20%. What sample size do you need to demonstrate that the two groups differ significantly with respect to a university education? Note that you do not an estimate of variability, as it is a function of the proportion itself.

We again turn to the Sealed Envelope website:

https://www.sealedenvelope.com/power/binary-superiority/

  Now change the power to 80% and enter the two proportions. You should get the answer below.

You would need 444 people in each group.