3. Excel background: data management and best practices
Excel is probably the commonest software for data entry, and probably also for data analysis. There are many other more appropriate software available that are far more appropriate for data analysis – Excel is not designed for data analysis, and you will probably outstrip its capabilities very quickly. We suggest that you browse the “Software” section of this site for more information – with a little effort you’ll be up and running in no time. That being said, Excel is a simple platform for data entry and storage in many cases. It’s not without its issues, but there are some things you should avoid, and other practices you should institute in your data management routine. Some of these practices are concerned with preventing errors – this is not a small issue, as highlighted by a recent paper showing that “the proportion of published articles with Excel files containing gene lists that are affected by gene name errors is 19.6 %.” (http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7). This was caused by Excel “recognising” some gene symbols as dates and floating-point numbers. As a result, the finding of these publications should be called into question. Importantly, many of the affected articles were in very highly regarded journals.
Some of these practices listed below revolve around keeping the data in a form that makes it easy to import your data into other, more appropriate, software for your analyses and/or for the clear interpretation by colleagues with whom you will share your data. This is important, as it’s likely that you will need to get your data into other software for analysis in any case.
Rules for using Excel for data management:
- ONLY enter data. As a minimum, do any analyses in a separate worksheet of the same spreadsheet that you entered the data in. On a related note, NEVER EVER copy/cut and paste the data for use elsewhere in the spreadsheet/workbook/worksheet. Refer to the cell(s) containing the original data.
- TRY to use variable names (ie. column headings) that are no longer than 8 characters. Furthermore, DO NOT use variable names that contain spaces, start with a number, or include slashes or commas. If you need spaces to make the title clear (eg. “Date of birth”) either use underscores (“Date_of_birth”), or capitalisation to indicate new words (eg. “DateOfBirth”) or clear abbreviations (eg. “dob” or “DOB”).
- ALWAYS use the same spelling/format for entering data. For example, don’t use “1990 DEC 22” and “22.12.90”, “22.12.1990”, “22/12/90” and “22/12/1990”. Similarly, don’t use “Female”, ‘female”, “F” and “f”. In the case of categorical variables, it’s often best to use numbers rather than text as it will make reformatting the data easier later on. It’s also best to start with “0” rather than “1”, as this makes regression type analyses easier to interpret. For example, you might use “0” for healthy volunteer, “1” for Rheumatoid arthritis patient and “2” for Osteoarthritis patient. Choose ONE format and stick to it.
- DO NOT use colour coding to classify data. It might look pretty, but other software that you might use to analyse your data cannot access this colour coding, and neither can excel for labelling plots/graphs or tables. Use a new column with a coded number to indicate the classification – see point 3.
- DO NOT use comments (eg. like in Word or PowerPoint documents) under any circumstances. Other software that you might use to analyse your data cannot access these comments, and neither can excel for labelling plots/graphs or tables. If you do feel the need to include “notes” for a given subject for example, then use a “notes” column with an entry in the appropriate row (also refer to point 7). Some people prefer no text in the spreadsheet other than the column headings (see point 2). In this case you will want to keep a companion spreadsheet that clearly relates these notes to the relevant data items/subject. Others just remove this “notes” column during import in to the analysis software.
- DO NOT round off values if they are calculated – you will only compound the rounding errors if you use this new calculated value in a subsequent calculation. In any case, see the comment “Only enter data”.
- When blank cells are acceptable in your data (eg. a missing sample), consider some other descriptive value, such as “na” “NA”, “.”or even a number that is incompatible with the nature of the missing data (eg. “-1” in a weight field). DO NOT use “0” as this is a number and not a missing value. This is somewhat controversial, as many other software that you may use to analyse your data can recognise empty cells appropriately. Whatever your choice BE CONSISTENT.
- DO NOT mix data types within a column (eg. dates/numbers/text).
- DO NOT have “islands” of data spread out over a spreadsheet. Similarly, DO NOT leave entire empty rows or columns.
- DO NOT merge cells in your spreadsheet. This seriously interferes with the row and column formatting of your data making importing into other software very problematic.
See this spreadsheet <Data Storage examples.xlsx> for some examples of bad and good practices. Pretend that each table is actually a separate spreadsheet with the data starting in the uppermost left hand of the spreadsheet (or else I’d be violating rules 1 and 9!). The first two examples on the left are acceptable, while the others are not – see how many of the rules above you can spot being violated.
Other things that are useful if you must use Excel to explore your data.
Exploring your data (“data checkout”) prior to analysing it is a vital aspect of research. Its importance cannot be overstated, and the consequences of not doing so are high. Mistakes in data entry can lead to incorrect conclusions, which at a very minimum will result in an embarrassing correction to your published paper. While other software can provide almost limitless facilities for data checkout, Excel can be useful. However, there are a few tips below for commonly used data checkout techniques which you might find useful to ensure things go well.
- Conditional formatting
This is a very useful tool that allows you to identify potential errors in your data. Essentially you apply rules to cells which change the colour of the entry based upon the content of the cell. This tool is available on the Home” tab of the ribbon. For example you might like to have cells with values > 250 highlighted with a red fill for the column containing weight (in kg). This can also be expanded to specify ranges of values.
An alternative method to prevent errors in data entry is to use the Data Validation function. This tool prevents entering data which is outside of a pre-specified range, and even prompts the user with the appropriate information to be entered. Highlight the column (or row, or group of cells) and choose the Data Validation tool from the Data tab of the ribbon. Then enter the appropriate type of data, its range and even an “input message” that you require for that data item.
- Sorting data
The Sort tool in Excel is very useful to allow you sort data, based upon values (eg. highest to lowest) in a specific column. This is a potentially very dangerous tool though: if done incorrectly it will only sort the data in a single column, rather than the entire data set being re-ordered to maintain the integrity across rows. To avoid this, only click on the heading of a single column when sorting data. If you highlight the entire column, then only that column will be re-ordered, thus breaking the connection with the columns alongside it – which would be bad as now your dataset is broken! One way to make sure that the entire dataset has been re-ordered safely is to highlight a few entire rows at random and fill them with a solid colour prior to sorting the dataset. After sorting the rows should remain highlighted in a consistent horizontal row. If things didn’t go as planned, then the coloured row will be split up and distributed throughout the dataset. Have a play with the spreadsheet <link>: highlight the rosw for ID 4 and 9; click on the weight heading and click on the “AZ->” button of the sort tool in the Data tab of the ribbon. Then undo this sorting (Edit/Undo). Then select the entire weight column and sort again. Excel will come up with a prompt asking you if you mean to do this. Click yes. Now you will see the data has been broken.
- Filtering data
The Sort tool in Excel is very useful to allow you view only some of the data, based upon values in a specific column, or combination of columns. For example you might only with to view the data for male subjects in the spreadsheet <Data Storage examples.xlsx>. In this case click on the any heading cell, then click on the “filter” button on the Data tab of the ribbon. Then click on the newly presented arrow on the heading “Gender”, and make sure only the “M” or “1” option is ticked in the left hand most datasets.
Obviously there are a rang of opinions, but the above tips are generally very common advice. The following links might be useful reading for you, although at times might contradict some of the advice above. The important thing is that you are consistent and methodical!