Descriptive statistics: Correlation | learnonline

1. Home
2. NC00437
3. ...
4. Descriptive statistics
ACST

Introduction

Health professionals are confronted with statistics on a daily basis. Examples include interpreting clinical values measured on a patient, understanding clinical guidelines or departmental reports, and importantly, reading scientific papers in order to assess the evidence for treatment. Having an understanding of statistics will empower health professionals and provide them with key tools in both understanding and applying evidence in their practice.

Topic Objectives

On completion of this topic students should be able to:

Appreciate the rationale for using statistics in health sciences
Differentiate between types of data
Use correct descriptive statistics for categorical and numeric variables
Describe the mean, median, standard deviation, range, IQR and correlation coefficient

Descriptive vs Inferential statistics

Whenever we collect health information, it is invariably on a sample. Apart from a national census, it is usually impossible to collect information on everyone in the population, for either logistical or cost reasons. When we describe our sample in terms of for example, average age, or percentage female, we are undertaking descriptive statistics. When we draw conclusions about the whole population based on our sample data, it is called inferential statistics, because we are making inferences about the population based only on sample data.

Types of quantitative data

When undertaking any statistical analysis, the type of statistics calculated or statistical test undertaken depends to a large extent on the type of variable being analysed. In this section you will learn about continuous, categorical and nominal variables.

A variable is by definition, something that you measure that is able to vary. For example, height, weight and gender are variables. In contrast, a constant is something that always keeps the same value. Examples include pi (approximately 3.142) and e (approximately 2.718). Variables can broadly be divided into two types, categorical and numerical.

Categorical variables can be dichotomous (also called binary), nominal or ordinal. Nominal variables (from Latin for name) are things like eye colour or hair colour. We might have: 1=blue eyes, 2=brown eyes, 3=green eyes. However, we might equally have: 1=brown eyes, 2=green eyes, 3=blue eyes. In other words, is the label that is important, not the number attached to it. When we describe nominal variables or dichotomous variables, we simply count the number and percentage in each category. It would make no sense to for example, ask what the average eye colour is!

Dichotomous variables are nominal variables that can only take on two values, for example males and females. They are often coded 0 or 1, for example 0=males, 1=females. Dichotomous variables can either be true dichotomous variables like dead or alive, or they can be continuous, nominal or ordinal variables divided into two categories.

Ordinal variables have categories in which only the ordering counts. For example, we might have 0=no disease, 1=mild disease, 2=severe disease. There is a clear order here. However the distance between no disease and mild disease, might not be the same as the distance between mild disease and severe disease – only the ordering is important.

Numerical data can be counts or continuous variables. Counts are whole numbers starting from zero. Typical variables that are counts are cells on an agar plate, falls in a hospital, or the number of people with a particular disease. Continuous variables are things like blood pressure, height and body temperature. They can take on any number between their minimum and maximum value. Continuous variables are sometimes divided into interval variables and ratio variables. In an interval variable the distance between readings is interpreted the same no matter where you are on the continuum. For example, the distance between 3kg and 5kg, is the same as the distance between 7kg and 9kg. Ratio variables are interval variables where zero means the absence of something. For example, height is a ratio variable. On the other hand, temperature in degrees centigrade is not a ratio variable, since zero degrees does not mean the absence of temperature. Since interval and ratio variables are in most cases described and analysed in the same way, from here on, we will simply call them continuous variables.

Test yourself

In 2014, Melbourne had a population of 4,440,328.

Population size is:

a) a nominal variable

b) a ratio variable

c) a count variable

d) an ordinal variable

In terms of population size order, the five biggest cities in Australia in 2014 were: 1=Sydney, 2=Melbourne, 3=Brisbane, 4=Perth, 5=Adelaide.

Population size order is:

a) an interval variable

b) a continuous variable

c) a count variable

d) an ordinal variable

Students on this course come from different ethnic backgrounds. These are coded: 1=Aboriginal or TSI, 2=Caucasian, 3=Asian, 4=Middle Eastern, 5=Other

Ethnicity is:

a) a nominal variable

b) a continuous variable

c) a dichotomous variable

d) an ordinal variable

The average age of students who take this course is 24.3 years.

Age in years is:

The rating of degree of difficulty is:

a) a nominal variable

b) a continuous variable

c) a dichotomous variable

d) an ordinal variable

Describing continuous variables

When collecting data, we end up with jumble of numbers which somehow we need to make sense of. Our first task is usually to summarise the data. For example, what is a typical value? What is the smallest and largest value? Descriptive statistics refers to this task of summarising a set of data.

One of the easiest ways of starting to understand the collected data is to create a frequency table. For example, the data below are the weights of 50 students in kilograms.

Table 1: Weight of 50 students

37 63 56 54 39 49 55 114 59 55

54 30 107 38 51 31 19 95 87 82

65 38 110 57 64 105 58 55 85 35

64 96 43 56 41 55 50 99 105 28

63 76 65 77 68 55 89 66 66 74

Just looking at the numbers doesn’t really help us decide, for example, what a typical student weighs. Let us know create a frequency table in which we break weight into 10kg categories, and count how many students fall into each category.

The table below was created using the Stata tab command. The Freq. column is the simple count of how many students are in each weight group. The count describes the shape of the variable, and the table is often called the frequency distribution of the variable. In most health data, the shape of the distribution typically has the highest frequency in the middle, getting smaller as you get further away from the most common values. More of this later.

The Percent column is the count as a percentage of the total. The cumulative percent column as its name implies, simply sums up the percentages.

Now simply looking at the table we can say that the most common weight group, or weight group with the highest frequency is 50-59 kg, and 76% of all students in our sample weigh less than 80 kilograms.

Table 2: Frequency histogram of student weights

The cumulative percent column as its name implies, simply sums up the percentages.

Now simply looking at the table we can say that the most common weight group is 50-59 kg, and 76% of all students in our sample weigh less than 80 kilograms.

We can also turn the table into a chart called a histogram. Figure 2 shows a frequency histogram of the above data.

Figure 2: Histogram of the weight of 50 students

Measures of Central Tendency

This is really a fancy name for asking what a typical value of the variable is. For example, what is a typical weight of the students? In common language, we call a measure of central tendency an average. However, in statistics there is more than one measure of central tendency or average. The most commonly used ones are the arithmetic mean (often just called the mean) and the median. You may occasionally come across the mode, which is the most common frequency - for example, in the student weights data set, the mode was 50-59 kg. However, because the mode is very difficult to handle mathematically, we tend to not use it much.

The Mean

The mean is the most commonly used measure of central tendency or typical value. It is very easy to handle mathematically, simple to calculate, and usually falls in the middle of the data set. To calculate the mean, we simply add up all of the values and divide by the sample size. For our student weights:

In statistics, we often use shorthand to make the formula easier (at least for statisticians) to read. For example, instead of “Sum of values” we use the Greek letter sigma (Σ). Thus you might find the above formula written in a textbook as:

where x represents the variable to be summed.

Let us now return to the histogram and the shape of distributions. Have a look again at Figure 2. Imagine if you can, instead of having 10 kg categories for weight we had 1 kg categories, and a sample size of 1,000. Now imagine if we had 1 gm categories and a sample size of 100,000. As the category size gets smaller and smaller, and the sample size gets bigger and bigger, the number of bars in Figure 1 would get greater and greater, and the outline would change from a ragged shape to a beautiful smooth curve. Eventually, it might look like the shape you see in Figure 3. This distribution shape is known as the Normal or Gaussian distribution, and is very commonly found in variables related to health.

Figure 3: Normal or Gaussian distribution

Note that if a variable has a Normal distribution, the mean, median (and mode) all fall in exactly the same place, the centre of the distribution. Variables that tend to have a Normal distribution are those that have lots of inputs going into them. For example, blood pressure can be impacted on by age, weight, obesity, genetics, sex, position of the body etc.

However, some variables we measure in health do not have a symmetrical distribution. These are typically variables with only a few inputs. Blood lead levels in children is a good example. The two key factors driving blood levels in children are the child’s age (blood lead peaks at about 12 months) and exposure to a source of lead. Because of this, the distribution of children’s blood lead levels is asymmetrical, or in statistics, we say it is skewed. In the case of blood lead, it is skewed to the right positively skewed. This is because most children have very low levels of blood lead, but some (for example those living near a smelter), have very high levels.

When the variable of interest has a skewed distribution, the mean is no longer a good typical value. This is because the extreme values in the long tail drag the mean towards them. This happens because of the way the mean is calculated. Instead, we use another measure of central tendency – the median. To obtain the median, we simply sort all our observations in to size order, and take the middle one. This works fine for an odd number of observations. For an even number of observations, we take midpoint between the two middle ones. Let’s have a look at the weight data sorted into size order, and provided in Table 3.

Table 3: Weight of students sorted by size

19	43	55	65	87
28	49	56	65	89
30	50	56	66	95
31	51	57	66	96
35	54	58	68	99
37	54	59	74	105
38	55	63	76	105
38	55	63	77	107
39	55	64	82	110
41	55	64	85	114

Since there are 50 observations, the median lies midway between the 25^th and 26^th observation, that is halfway between 58 and 59, so the median is 58.5.

To demonstrate a skewed distribution and its impact on the mean, I have taken the weight of the students, and deliberately changed some of the higher values to lower values. Figure 4 shows the result.

Figure 4: Example of a skewed distribution

The distribution is clearly skewed with a long right hand tail, i.e., it is positively skewed. For the above data, the mean is 52.9 kg, whereas the median is 43.0 kg, clearly demonstrating that the mean is no longer a typical value for this dataset.

Test yourself

All people aged 70 or over in Broken Hill complete a survey measuring their physical and mental health. The mental health scale ranges from 1 to 100, where 1 represents someone with extremely poor mental health, and 100, someone with no mental health issues. The mean mental health score is much lower than the median. Discuss the implications of this.

Measures of Dispersion

As well as knowing what a typical value is for a variable, we also like to know how spread-out the observations are around that typical value. In other words, are they tightly clustered around the mean, or very scattered. In order to measure this, a sensible question might be “what is the average distance of each observation from the mean?” Let’s have a look at an example. Here are some observations:

4 4 6 2 5 3

The mean of the above observations (hopefully you can calculate it in your head) is 4. To get the average distance from the mean, we subtract the mean from each observation, add up these differences, and divide by 6. Let’s have a go:

What has happened is that because some observations are greater than the mean, and some less, when you add the differences up, they always come to zero! Hmmm – so not a particularly useful measure of dispersion. To get around this problem, we square the distance of each observation from the mean.

The Standard deviation

The standard deviation (commonly abbreviated to sd), is the usual measure of dispersion or variability that you will see in published papers. Its formula looks complicated, but the idea is relatively straightforward. The variance is the square of the standard deviation, and we will look at how that is calculated first.

In other words, subtract the mean from each observation, square the result to get rid of the minus sign, add up all of these values, and divide by (n – 1). In the above equation, n - 1 is called the degrees of freedom (you will often see the abbreviated to df). We use (n – 1) rather than n because we already know the mean value, and this makes one observation redundant. You are probably scratching your head at this!

Suppose I told you that five numbers had a mean of 3. The numbers are:

1 5 4 2 ?

Since the mean is 3, the total must be 5 x 3 = 15. Therefore the missing number has to be 3. In other words, if you are given the mean, you only need n -1 observations to calculate the last one.

If the original observations were in kilograms, then the variance is in units of kg². To get it back to the original units, we take the square root of the variance to arrive at the standard deviation. Thus:

So when we are describing a set of observations with a Normal distribution shape, we present the mean and standard deviation. For our weights of 50 students shown in Table 1, the mean is 63.7, the variance 541.9 kg², and the standard deviation 23.4 kg. As an aside, things like the mean, standard deviation, median, and proportion obtained from a sample are called sample statistics.

The range and Interquartile range

We pointed out earlier, that the mean is not a good typical value for skewed distributions. However the formula for the standard deviation includes the mean. Does this imply that the standard deviation is not valid as a measure of variability for skewed distributions? In short, yes!

The median was calculated by first sorting all observations into size order and then taking the middle value. If we sort all observations into size order, then calculate the cumulative percentage for each observation as we go along, the cumulative percentages are known as percentiles. Table 4 shows an example of this process.

Table 4: Percentiles of weights of 50 students

x	Percentile	x	Percentile	x	Percentile	x	Percentile	x	Percentile
19	2	43	22	55	42	65	62	87	82
28	4	49	24	56	44	65	64	89	84
30	6	50	26	56	46	66	66	95	86
31	8	51	28	57	48	66	68	96	88
35	10	54	30	58	50	68	70	99	90
37	12	54	32	59	52	74	72	105	92
38	14	55	34	63	54	76	74	105	94
38	16	55	36	63	56	77	76	107	96
39	18	55	38	64	58	82	78	110	98
41	20	55	40	64	60	85	80	114	100

In order to describe the spread or variability of the variable when it is skewed we usually use either the range or interquartile range (IQR). The range is the difference between the maximum and minimum value. In the table above, this is 114-19 which equals 95 kg. In fact most researchers report the maximum and minimum values rather than the range. The IQR is the difference between the 25^th and 75^th percentile. In the above table, this is 76.5-49.5 which is 27 kg.

Test yourself

1 When the average family income is reported in the news, do they mean the arithmetic mean or median?

2 When reporting blood lead levels in children, which measures of central tendency and dispersion would you use?

Correlation

Up until now, we have only considered descriptive statistics for a single variable. However, what if we have two variables and we are interest in whether or not they are associated. In other words, if one variable goes up does the other go up with it? The measure of association we use to demonstrate how to variables are related is called the correlation coefficient – yet another sample statistic.

The correlation coefficient (also known as the Pearson correlation coefficient) measures how well two variables are related in a linear (straight line) fashion, and is always called r. r lies between -1 and +1. A value of r = -1 means that the two variables are exactly negatively correlates, i.e., as one variable goes up, the other goes down. A value of r = +1 means that the two variables are exactly positively correlates, i.e., as one variable goes up, the other goes up. A value of r = 0, means that the two variables are not linearly related.

Figure 5 shows the association between the heights and weights of 100 military recruits. This type of graph is called a scatter diagram.

Figure 5: Scatter diagram of the heights and weights of military recruits

There clearly appears to be a straight line trend between height and weight and the association is positive, that is weight increases with height. In fact for the above data, the Pearson correlation coefficient is r = 0.56.

Test yourself

1 Can you think of an example where you would expect a negative correlation?

2 From a graph, two variables are clearly highly associated. However, the correlation coefficient is close to zero. Why might this be the case?

Correlation

The usual measure of correlation is the Pearson correlation coefficient r. Here are some example data.

We obtain the usual correlation for these two measures using the Stata corr procedure.

corr measure1 measure2

So, r = 0.79, which is reasonably high.

If we believe that the distribution that these two measures come from is not normally-distributed, we could instead calculate the Spearman rank correlation, which in Stata is called spearman.

spearman measure1 measure2

We see that the rank correlation is a bit lower than the Pearson statistic.

To demonstrate how this works, let’s turn the above data into ranks. We combine the two measures for this purpose.

Now we will calculate the standard Pearson correlation on the ranks.

corr rmeasure1 rmeasure2

and we get the same answer (to 2 decimal places) as the rank correlation.

When one or both variables are either ordinal (not numeric) or have a distribution that is far from normal, the significance test seen will no longer be valid, and nonparametric analogue is needed.

With the example above using Stata, we get:

spearman measure1 measure2, stats(rho p)