Overview of statistical software packages: Introduction to Statistical Software

Introduction to Statistical Software

Background

For many students, the thought of having to undertake statistical analyses is uncomfortable. This is because mathematics and statistics are often poorly taught at school, and barely covered during undergraduate training. Further – let’s face it, mathematics and statistics are conceptually difficult. However, there really is no need to panic. There is lots of support available to make you more comfortable with undertaking statistical analyses, including this online course, biostatistical consultants, websites, Youtube tutorials, and even MOOC courses.

If you would like face-to-face assistance, then information about biostatistical support can be found here:

http://www.unisa.edu.au/Health-Sciences/Research/Biostatistical-and-epidemiological-support/

In addition, there are a multitude of statistical software packages available that can do a lot of the work for you – and these are the focus of this current module. However, before we start looking at these, a question that often arises is “How do I get my data into a statistical package?”. The good news is that most statistical software can read data directly from an Excel spreadsheet, so using Excel is often the easiest solution. Secondly, you can always enter data directly into a statistical package, since they nearly all have some form of inbuilt spreadsheet. Another solution is to use software like SurveyMonkey (https://www.surveymonkey.com/) to collect the data. SurveyMonkey has the facility to convert the data into an Excel spreadsheet or SPSS format. A final solution is to use specialised data entry software. This has the advantage of being able to put things like range checks on data entry fields, so for example, if a data entry field should only have a 0 or 1 entered, if you try and put anything else, it won’t let you. A really good and free data entry program is EpiData Entry provided by CDC Atlanta. It is available from here: http://www.epidata.dk/download.php

There are many commercial statistical packages available, some of which UniSA has licenses for. In addition, there are several free statistical packages available from the internet. For example, PSPP is a clone of SPSS, and can be downloaded here:

https://www.gnu.org/software/pspp/get.html.

There are also many websites where you can undertake online statistical analyses. A good starting place is:

http://statpages.info/

There are also many specialised software programs for things like graphs, sample size calculations, and genetic analyses. Again, some are commercial, but others can be freely downloaded. A good example is the sample size software G*Power, which can be downloaded here: http://www.gpower.hhu.de/en.html

In fact the diversity and number of software packages and available websites is so large, that reviewing all of them would be a full-time job!

However, there are some software packages that are readily available and often used at UniSA, including Microsoft Excel, SPSS, SAS, Stata and R, which will briefly overviewed here. Then further details are provided in subsequent modules about each of these packages.

Microsoft Excel

History

This is part of the Microsoft Office suite of programs. Excel version 1.0 was first released in 1985, with the latest version Excel 2016.

Good points

Extremely easy to use and interchanges nicely with other Microsoft products
Excel spreadsheets can be read by many other statistical packages
Add on module which is part of Excel for undertaking basic statistical analyses
Can produce very nice graphs

Bad points

Excel is designed for financial calculations, although it is possible to use it for many other things
Cannot undertake more sophisticated statistical analyses without purchase of expensive commercial add ons.

Availability

Most computers come with Microsoft software already installed. For blue-plated (UniSA) computers, contact the IT Help Desk to install the latest Microsoft office software. For your own computer, you can always purchase Microsoft Office from a retail store.

SPSS

SPSS stands for Statistical Package for the Social Sciences. It was one of the earliest statistical packages with Version 1 being released in 1968, well before the advent of desktop computers. It is now on Version 23.

Good points

Very easy to learn and use
Can use either with menus or syntax files
Quite good graphics
Excels at descriptive statistics, basic regression analysis, analysis of variance, and some newer techniques such as Classification and Regression Trees (CART)
Has its own structural equation modelling software AMOS, that dovetails with SPSS

Bad points

Focus is on statistical methods mainly used in the social sciences, market research and psychology
Has advanced regression modelling procedures such as LMM and GEE, but they are awful to use with very obscure syntax
Has few of the more powerful techniques required in epidemiological analysis, such as competing risk analysis or standardised rates

Availability

SPSS is available on blue-plated (UniSA) computers. If it is not on the one that you use, then contact the IT Help Desk to install it. Staff are allowed to use SPSS at home for a cost of $10. Unfortunately, students have no home use rights, but can purchase a pretty much full version called a Premium Grad-pack with a 2-year license for approximately $250 from Hearne software.

SAS

SAS stands for Statistical Analysis System. It was developed at the North Carolina State University in 1966, so is contemporary with SPSS.

Good points

Can use either with menus or syntax files
Much more powerful than SPSS
Commonly used for data management in clinical trials

Bad points

Harder to learn and use than SPSS

Availability

Health Sciences has a Division licence for SAS 9.4M3 which is available for the Division’s staff and students. To organise installation contact the IT Help Desk. SAS also has a free version SAS University, details are available here: http://www.sas.com/en_us/software/university-edition.html

Stata

Stata is a more recent statistical package with Version 1 being released in 1985. Since then, it has become increasingly popular in the areas of epidemiology and economics, and probably now rivals SPSS and SAS in it user base. We are now on Version 14.

Good points

Can use either with menus or syntax files
Much more powerful than SPSS – probably equivalent to SAS
Excels at advanced regression modelling
Has its own in-built structural equation modelling
Has a good suite of epidemiological procedures
Researchers around the world write their own procedures in Stata, which are then available to all users

Bad points

Harder to learn and use than SPSS
Does not yet have some specialised techniques such as CART or Partial Least squares regression

Availability

Stata can be downloaded onto blue-plated computers by contacting the IT Help Desk. Students can purchase a full copy with a perpetual license from the Australian distributors (Survey Design and Analysis) for about $200. The Division is currently examining licensing arrangements.

S-plus is a statistical programming language developed in Seattle in 1988. R is a free version of S-plus developed in 1996. Since then the original team has expanded to include dozens of individuals from all over the globe. Because it is a programming language and environment, it is used by giving the software a series of commands, often saved in text documents called syntax files or scripts, rather than having a menu-based system. Because of this, it is probably best used by people already reasonably expert at statistical analaysis, or who have an affinity for computers.

Good points

Very powerful – easily matches or even surpasses many of the models found in SAS or Statas
Researchers around the world write their own procedures in R, which are then available to all users
Free!

Bad points

Much harder to learn and use than SAS or Stata

Availability

R can be downloaded from here:

http://cran.csiro.au/