Introduction to SAS

Site: learnonline
Course: Research Methodologies and Statistics
Book: Introduction to SAS
Printed by: Guest user
Date: Friday, 22 November 2024, 11:16 AM

1. What is SAS

SAS or Statistical Analysis System, is a software suite developed by the SAS Instituted for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics. Review of statistical packages concluded that SAS provides researchers with extraordinary range of data analysis and data management tasks capabilities, however it can be difficult to learn when compared to SPSS and Stata (although these have fewer analytic abilities without user add-ons).

This software is commonly used among researchers working with very large data sets because it is a powerful software suite that allows you to do almost anything you like with your data.

 

2. Navigating the SAS interface

SAS is a windows program, however a web based version, SAS Studio, can be accessed via programs such as Oracle VM VirtualBox for Mac users, or similar for Windows users.

There are three main windows: program editor window, log window, output window. A side panel with two windows can help you manage your files and output (explorer and results).

The program editor window

Programs can be opened, saved and edited within this window. There are two versions of this window: Program Editor and Enhanced Editor. The latter will colour code your command statements.

The log window

As you run(or execute) you data steps and procedures they will appear in the log window. Errors in your code will be highlighted in this window. SAS usually points out your mistakes or other issues with warning and error messages so that you can correct them:

ï‚· Note (blue text): information which does not indicate errors. They are usually informative to rule out any methodological errors in your programming. E.g. how many observations were read, and if there were any missing values

ï‚· Warning (green text): points out errors which SAS could correct itself. The execution was performed with these changes. Still you should check whether it was done properly. Example: misspelled keywords

ï‚· Error (red text): serious errors which SAS could not handle. The execution was stopped. These errors must be corrected by you. Example: forgotten semicolons, invalid options, misspelled variable names. etc

Especially if you are updating datasets, be aware that red errors mean NO update has occured!

Make a habit of checking the Log window after every execution. Even if SAS has accepted and executed your program, you may have made a mistake in your code.

The output window

When you open SAS the Output window contains you ‘printed’ output from the SAS procedures that you have run. When you run your first program, by default SAS will create a Results window that will show the printed output from your procedure in HTLM format. To change this preference so that your output is only sent to the Output window (the choice is yours) - click on the Tools button, select Options, then Preferences, the Results tab and in this box - remove the tick from Create HTML and tick Create listing.

It may be necessary (running out of memory) or convenient to clear the contents of the output and log windows. To clear the output window press the blank page button. It’s important to remember that you cannot edit a LOG file or an OUTPUT file in their windows. You can only read them (or delete them). To edit these files you will need to save/copy the contents, then open/paste in the SAS Editor or Word.

The results tab

This window displays a summary of your output and can help navigate your results.

There are some variations in the user interface between SAS and SAS Studio. If you are used the web based version SAS Studio, the following video is a good introductory guide:

3. Getting data in and out of SAS

There are many methods of getting data into SAS. SAS allows us to enter (type) data directly into a program, read in (open) a previously saved SAS data file, read in data from ASCII data files that have the data stored in a variety of formats, or import a data file from several different packages.

4. Rules for SAS variable names

  • SAS names must be 33 characters long
  • The first character must be an English letter (A, B, C, . . ., Z) or underscore (_). Subsequent characters can be letters, numeric digits (0, 1, . . ., 9), or underscores.
  • You can use upper or lowercase letters.
  • Blanks cannot appear in SAS names.
  • Special characters, except for the underscore, are not allowed.
  • SAS reserves a few names for automatic variables and variable lists, SAS data sets, and librefs.
  • SAS When creating variables, do not use the names of special SAS automatic variables (for example, _N_ and _ERROR_) or special variable list names (for example, _CHARACTER_, _NUMERIC_, and _ALL_).

When entering data manually, it’s important to identify character values so SAS knows to input the data as a string variable. For example, if we wish to enter data some data collected from male and female patients, gender must be identified as a character, this is done using ‘$’ as summarised below:

This code enters a data set with 5 patients (IDs 1-5) where each patient has their age, gender (as a character) systolic BP, diastolic BP, height and weight recorded. You will notice a semi-colon ‘;’ is at the end of each command line. This is critical, if the semi-colon is omitted the command line will not run. Once the code has been typed in the Editor window, click the "running man" to execute the code

Most of the time we will have large data sets in spreadsheets like Excel. You are able import .csv data from Excel directly into SAS. You first need to tell SAS which directly you are using. If I’m working on my desktop I would type:

Next step is the name the data that you’ll be importing. This allows SAS to associated a name with the stored data so you can easily use data from this set. What it also means is that you can have multiple data sets running at once, each with a different name. Lets consider a new example of data from a large study looking plasma iron levels in patients (file called healthiron2015.csv). Let’s import the data and understand the code:

There are a number of code lines included above which we will define: ï‚·

firstobs tells SAS that the actual observations are in the 2nd row of the spreadsheet. This is included if and only if your first row contains column labels such as ID, weight etc. ï‚·

delimiter=’,’ tells SAS that your data is separated by a comma. 

missover prevents an INPUT statement from reading a new input data record if it does not find values in the current input line for all the variables in the statement. When an INPUT statement reaches the end of the current input data record, variables without any values assigned are set to missing. ï‚·

dsd tells SAS that two comma in a row should be read as a missing value. ï‚·

informat is an instruction that SAS uses to read data values into a variable. Unless you explicitly define a variable first, SAS uses the informat to determine whether the variable is numeric or character. SAS also uses the informat to determine the length of character variables. In this example we’ve told SAS what each column variable is.

input describes the arrangement of values in the input data record and assigns input values to the corresponding SAS variables.

Next you may want to check that the data has imported. To do this we use a proc print statement. This is a really useful command line as it tells SAS to return a table (with all or some of the observations). To limit the number of observations printed we can say obs=10. We are also able to give the table a title. The command line would be:

The following table is produced.

 

You can also do a 'global' check of the data set which has been imported using:

which returns the following output:

Data labels like ‘sex’ and code 1 and 2 for males and female isn’t very helpful. Creating variable labels makes data easier to read and work with. The following code will be used to create a value label for genotype, sex, menopause etc and this label will be added to the variables imported:

Running the proc print  command will now show Male and Female as opposed to 1 and 2 in addition to text for other variables. The numerical code is still there, but we now see it as text.

It’s often a good idea to check for and identify invalid or missing data. Lets look at frequency counts for the variable menopause in females and check that men haven’t been coded as menopausal!

In this example you’ll notice that we have included two proc statements before running the code. You'll also notice we used where sex=1 in the command (remember the numerical code is still there even though we see Male). If you have binary data, you can limit your search to one group using the where command. Once the code has run, the following tables will be produced.

We can see form the output that there are 2 missing data point in the menopausal data for females. Let's identify the ID number of these patients:

In this code we used the where sex=2 command again, but we've added some additional code to filter the data to ensure only the missing IDs are reported. The extra code menopause ne 0 and menopause ne 1 tell SAS to filter for sex=2 (females) and when menopause does not equal (ne) 0 AND when menopause does not equal 1; this will filter for missing data.

Being able to easily present descriptive statistic is one of the strength of all statistical packages. In SAS the MEANS procedure provides descriptive statistics for variables across all observations and within groups of observations. The code is proc means.  If we want to present the number of observations, number of missing observations, the minimum and maximum of a the variables agebase, agefup, ferrbase, ferrfup, tsatbase, and tsatfup, the code would be as follows:

In addition to those shown above, the proc means statement outputs a number of descriptive statistics using the following key words: