Biostatistics Data Files Setup

Preparing an Excel Data File for Statistical Analysis

This page is designed to provide guidance on setting up a data file. It may help you avoid common pitfalls and save you and your statistician considerable time.

Data file setup

In general, data is set up in a rectangular array (often recorded in an Excel spreadsheet). Columns represent variables which are different pieces of information; rows represent cases (e.g., data from an individual patient or data from a hospitalization). The first (or top) row should provide variable names for each of the columns, while the first column(s) should contain the values of ID variable(s) for each case. The term cell refers to a particular combination of row and column.

What to include in the file: In general we recommend that you include an ID variable such as MRN or other unique identifier as the first column in the file. This makes it easier to identify and correct errors (such as negative length of stay) and makes it possible to combine different data files with information on the same subjects (provided the ID variable has the same values for a given subject across files). If you have both a name and a numeric ID, place them in separate columns. If you have multiple spreadsheets containing data on the same individuals, include their identifier(s) in each spreadsheet.

If in doubt about whether to include a variable, you might choose to place it in the file; it takes more time to add additional information to the file later. Only include the raw, un-summarized data.

Variable Names

Place the variable names in the first row. Be sure that the names follow these rules:

  • Names must start with a letter.
  • Names may only include letters, numbers, and underscores (_); however, variable names should not start with an underscore. Avoid using blanks or special characters (e.g., % $ # @ ! + * " ).
  • Keep names to a minimal length (just long enough to let you remember what the variable is). A name with no more than 10 characters would be optimal.
  • Be sure that variable names are unique.
  • Make sure that variable names appear in the first row only; statistics programs do not accommodate names that take up multiple rows.

Include a variable catalog with your data.

Data Entered

There are three general data types: numeric, character, and date. For a given column, stick to the same data type. Do not include comments in the same field as a numeric value; you can always have a separate comment column.

Only include one value per cell. For example, use separate fields for systolic and diastolic blood pressure (as opposed to 120/80 in the same field).

Numeric values: Record original values when possible; the computer can be used to create a grouped version. For grouping variables, the possible values need to be mutually exclusive and exhaustive (i.e., each subject has exactly one response); numeric values are simpler. For questions which have more than one possible response (race, side effects, comorbidities), have a separate yes/no column for each (common) answer. Medical history is best handled by separate yes/no columns for each diagnosis of specific interest. For example, include a yes/no column for diabetes, a separate yes/no column for hypertension, etc. For yes/no variables, you do want to be able to distinguish a value of "no" from "unknown" or "not determined".

Dates: You can enter month, date and year in a single field with the values separated by slashes (e.g. 06/14/2013). We recommend using two digits for month and day and four digits for year (e.g. 05/04/2013 rather than 5/4/13). Do not include time in this column if possible.

Time: Use military time (midnight is 00:00, noon is 12:00, 11 p.m. is 23:00). If both date and time are recorded, keep them in separate columns if possible.

Character data: Keep in mind that the computer is very literal. The computer treats "Y", "y", and "yes" as three different values. Character variables should have one consistent entry for values that mean the same thing, with both the spelling and capitalization constant for that entry within that column.

Missing data: Using "NA", "?", etc. for missing data will change a numeric variable into a character variable. We recommend using blanks for missing or unknown values. Please check with a statistician if you have any questions about entering missing data.

Version given to statistician

Save the spreadsheet with values only, not with formulas.

Statistical programs will not recognize underlining, boldface, colors or italics.

What not to include: Blank rows or columns. Extraneous data such as row or column totals, graphs, comments, etc. should be deleted. You can copy only the raw data onto a new worksheet and save that. Multiple repetitions of variable names within the file should be removed.

Information on variables (explanations of numeric codes, treatment of missing data, etc.) can be contained in a Word document (preferred), a separate sheet within your Excel file, or in the first line in your Excel file (in which case the variable names appear in row 2 and the data starts in row 3).