Sample size calculation

Site:	learnonline
Course:	Research Methodologies and Statistics
Book:	Sample size calculation

Printed by:	Guest user
Date:	Saturday, 16 August 2025, 11:12 PM

Introduction
Descriptive studies
Randomised trials and observational studies
Where to next?

Introduction

Whether you are undertaking qualitative or quantitative research as part of your PhD, and whether your study is on communities, people, rats or test tubes, you must justify the size of the sample you intend to use. Epidemiologists love breaking things into two: for example, dead or alive, smoker or non-smoker. So following on from this tradition, studies can also be broadly broken down into two types, descriptive and analytic.

With descriptive studies, we are simply trying to describe something about a group of people. Examples of descriptive studies include: Surveys, Case series, and Qualitative research. Of the descriptive study designs, it is only surveys that need concern us with respect to sample size. Sample sizes for case series or qualitative research are usually based on the need to obtain saturation, that is further interviews would provide no additional new information.

Analytic studies are different in that they usually involve hypothesis testing. Examples of analytic studies include: Randomised Controlled Trials (RCT), cohort studies, case-control studies and cross-sectional studies. Sometimes there is a bit of overlap. For example, you might want to compare groups as part of a survey. However, the primary objective is still descriptive, and the analytic part a secondary objective, or hypothesis generating exercise. Similarly, when reporting the results of an RCT, you still need to describe the subjects in each study arm.

Topic Objectives

On completion of this topic students should be able to:

Appreciate the different requirements for sample sizes for descriptive and analytic studies
For descriptive studies, be able to calculate sample sizes based on desired accuracy
For analytic studies, be able to calculates sample sizes based on power
Have an understanding of available software and websites for sample size calculations, and when to see further help

Descriptive studies

Descriptive studies are usually undertaken using surveys. This requires some understanding of the target population, i.e., the entire group of individuals to which researchers are interested in generalizing the results to. For example, you might be undertaking a survey of adults (20 years and over) in South Australia (survey population of approximately 1,200,000), but wish to be able to generalize the results to all adults in Australia (approximately 16,800,000) – the target population.

Surveys

Sample sizes for surveys are based on accuracy in other words, if you are trying to measure a rate (or proportion), how accurate do you want the estimate to be. For example, you might want to estimate the proportion (p) of adults in Australia who are married. You think it is probably around 50%, and you would like to get an estimate within ± 1% accuracy – that is the true value, is likely to fall within the range 49 – 51% with 95% confidence. This is in fact the 95% Confidence Interval for the sample proportion (p) of 0.5, and we use this to determine a required sample size.

The formula for the 95% Confidence Interval (CI) for p is:

where SE (p) is the standard error of the proportion. The formula for SE (p) has the square root of n, the sample size, in the denominator. In other words, as the sample size gets bigger, SE (p) gets smaller, the 95% CI (p) gets narrower, and we get a more accurate estimate. So to increase accuracy, we simply need to increase the sample size.

From the above example, with an expected proportion of 50% and a large target population, and a required accuracy of ± 1%, Figure 1 shows the required sample size is about 6,500 responses out of a sample population of 1,200,000 adults in South Australia.

Figure 1: Required sample size

In fact the required sample size is very much dependent on the expected response for each individual question in a survey. The required sample size is at a maximum when 50% of respondents say “yes” and 50% say “no”. For this reason, we tend to use 50% in the sample size formula since it will provide us with the maximum sample size required for any question.

Some points to remember are:

Higher requirements for accuracy, e.g., ±1% require a higher sample size than lower accuracy, e.g., ±5%
Any deviation from an expected prevalence estimate of 50% requires a smaller sample size
The bigger the target population, the bigger the required sample size, tailoring off after about 3,000.

A website that allows you to undertake these calculations online can be found here:

https://www.surveymonkey.com/mp/sample-size-calculator/

Randomised trials and observational studies

Sample sizes analytical studies such as randomised trials and observational studies are based on required power, rather than required accuracy. The information required to determine the required sample for an analytical study includes:

Required Type 1 error
Required Power ( 1 – Type 2 error)
Expected effect in terms of the outcome measure
Some measure of the variability of the outcome measure

Let us have a look at these pieces of information that are required.

Type 1 error and power

In order to understand these concepts, it is probably best to first have some knowledge of hypothesis testing. Hypothesis testing usually involves 6 steps:

State H₀the null hypothesis
State H_Athe alternative hypothesis
Decide on a suitable statistical test based on H₀
Calculate the test statistic
Check the calculated p-value against the acceptable Type 1 error (Usually set at 0.05)
If p £ 0.05 reject H₀

For example, the null hypothesis might be that the Intervention and Control groups have the same mean value. This is a bit like assuming a prisoner is not guilty before the start of a trial. In fact, we usually want to disprove or reject the null hypothesis.

The alternative hypothesis might be that the Intervention group mean is not the same as the Control group mean (this is called a two-sided alternative hypothesis), or the Intervention group mean is higher than the Control group mean (one-sided alternative hypothesis).

The test statistic depends on what it is you are comparing. For example, if we are comparing two means, then we usually use the independent samples t-test. If we are comparing two correlation coefficients, we would use a Z-test. If we are comparing two proportions, we would use a chi-square test. An F-test is used to compare two variances.

The next stage is calculating the value of the test statistic. Luckily, our statistical software usually does this for us. The computer than looks up the value of the test statistic in a set of tables; for each value of the test statistic, there is an associated p-value. The p-value is a probability, that is, it lies between 0 and 1. If the null hypothesis is true, we expect a very small value of the test statistic, and a corresponding large p-value. If the null hypothesis is wrong, we expect a large value of the test statistic, and a corresponding small p-value. In the latter case, we say we reject the null hypothesis.

However, in making this decision, there are two mistakes or errors that can happen, which are called Type 1 and Type 2 errors. These are demonstrated in Figure 2.

Figure 2: Type 1 and 2 errors

figure here

As pointed out earlier, epidemiologists love two put things into 2 x 2 tables. Here we have the true state of the null hypothesis (that is, whether true or false) on the top, and the statistical decision (reject or accept the null hypothesis on the side).

Let’s start with the top left hand corner. Here, the null hypothesis is genuinely true, that is, there was no treatment effect, yet based on the results of the statistical test and associated p-value, we have decided to reject it. This is called a Type 1 error, and is usually set at 0.05, or 5%. Type 1 errors are usually caused by things like selection bias, misclassification, confounding or effect modification.

In the top right hand corner, the null hypothesis is false, there was a genuine treatment effect, and based on the statistical test and associated p-value, we have decided to reject the null hypothesis. We were very clever and made the right decision. This is called power, and is usually set at 0.80 or 80%.

In the bottom left hand corner, the null hypothesis is true and we wisely do not reject it.

Finally, in the bottom right hand corner, the null hypothesis is genuinely wrong, but we decide to not to reject it. This is called a Type 2 error, and is in fact calculated by (1 – power), so that if the power is 80%, then the type 2 error is 20%. Type 2 errors are invariability caused by having a sample size that is too small.

Most sample size programs or websites ask you for the acceptable Type 1 error (usually set at 0.05), and either the acceptable Type 2 error (usually set at 0.1 or 0.2) or power (usually set at 90% or 80%).

Expected effect

The expected effect is the difference between the control and intervention group at the end of the study. This could be a simple comparison of post-intervention outcome means, or a comparison of mean change scores. In other words, H_A states that you expect one group to have a different mean or proportion to the other group, but by how much? An estimate of the expected effect is usually obtained from:

The literature
A pilot study
Clinical judgment
Interim analysis of a current study
Generic effect size

The generic effect size can be used if none of the things above it are available. It is usually based on Cohen’s d, which is basically the difference between two means divided by the standard deviation of the data. In other words:

d = (mean₁ – mean₂) / s

Here, s is defined as:

and s₁² is the variance of group 1 and s₂² the variance of group 2.

Cohen’s d falls in the range 0 to a large positive number, but is usually interpreted as:

0.2 Small effect
0.5 Medium effect
0.8 Large effect

Although Cohen’s d was developed in psychology, it has now been widely adopted in other disciplines. A rule of thumb is that you would like to see at least a medium effect in a study, in order for it to be clinically significant. So if you do not have an estimate of the effect size, many sample size programs allow you to enter a generic effect size instead. Be aware that in some branches of clinical study, a small effect size would be of clinical interest. There are different effect sizes for different types of statistical test. A good explanation can be found here:

https://en.wikipedia.org/wiki/Effect_size

Measures of variability

The usual measure of variability of the outcome measure asked for by sample size software is the standard deviation. Like the effect size, the expected standard deviation can be found from:

The literature
A pilot study

In the published literature, you are often provided with the standard error of the mean (SEM), rather than the standard deviation (s). However, it is easy to convert one to the other.

where n is the sample size. So if SEM=2 and the sample size is 36, s=12.

Another way of getting an approximate estimate of the standard deviation is by using the range. The range is the difference between the highest and smallest expected values of the outcome measure – then simply then divide the range by 4. Why does this work? Well if your outcome measure is Normally distributed (i.e., the frequency distribution looks like a bell-shaped curve), then 95% of the observations fall within the mean ± 4 standard deviations. So if you know the range, then 1 standard deviation is approximately a quarter of the range.

Putting it all together

Comparing two means

Suppose we have two groups of people, and we measure their quality of life using the SF-36 quality of life measure. The first group has a mean Physical Component Score (PCS) of 50, with a standard deviation of 10. The second group has a mean PCS of 40 with a standard deviation of 8. What sample size do you need to demonstrate that the two groups differ significantly with respect to PCS?

First, let us go to an online sample size calculator:

https://www.sealedenvelope.com/power/continuous-superiority/

You should see this screen.

Note that the default Type 1 error (alpha) is 5% or 0.05, and the default power is 90%. Now change the power to 80%, and enter the two means, and the average of the two standard deviations, approximately 9. The answer provided is:

In other words, you would need 13 people in each group.

Comparing two proportions

Suppose you again had two groups, and wanted to compare the proportion in each group that had been to university. The two expected proportions are 28% and 20%. What sample size do you need to demonstrate that the two groups differ significantly with respect to a university education? Note that you do not an estimate of variability, as it is a function of the proportion itself.

We again turn to the Sealed Envelope website:

https://www.sealedenvelope.com/power/binary-superiority/

Now change the power to 80% and enter the two proportions. You should get the answer below.

You would need 444 people in each group.

Where to next?

The above examples are very simple, and there is a good chance you might have more complicated analyses requiring a sample size calculation. More advanced analyses might include:

Analysis of variance (ANOVA)
Analysis of covariance (ANCOVA)
Many types of regression analysis
Repeated measure designs
Clustered designs

A very good and comparatively powerful sample size program is called GPower. It can be downloaded free of charge from here:

http://www.gpower.hhu.de/en.html

GPower can undertake sample size calculations for most of the above situations, and primarily uses effect sizes. It comes with an extensive help manual.

However, unless you are very familiar with the different type of effect sizes (and they can be complicated), it would be better to make some time with one of the biostatisticians.