2.3 Building Reproducible Reports

Reproducibility is a key component of all of the research done in Institutional Research. The odds are very high that someone will ask for the same report or analysis in the future. As such it is critically important that you are able to re-run each report. This takes a concerted effort and requires that you follow a few best practices.

2.3.1 Random Seeds

Always set your random seed at the top of every program. When in doubt, set the seed.

Or in SAS

seed = 1;

2.3.2 Clear References and Paths

In a Windows environment with common network drives not everyone uses the same drive mapping scheme. As such it is important to use the proper drive name. Additionally, if you can add descriptive data to the file name like the date or the data provider it will also help with clear references.

An example of this is below:

GOOD

DATAFILE = \\admin2\InstRes\survey\2018_results.xlsx

BAD

J:\survey\2018_results.xlsx
C:\Users\DEWITTME\myfiles.xlsx

If you are using R, it is recommended that you use R projects. When R projects are used, relative paths can be used. The here and fs packages are also recommended to be used in order to ensure that file paths are consistent between platforms.5

2.3.3 Clean Data Programmatically

Sometimes you need to clean the data. Something isn’t formatted correctly or only a few items are missing data. Don not ever change the raw data! The raw data files should never be altered. Raw data represent what you were provided and should not include any cleaning from you. Any data cleaning activities should take place in code and not directly in the raw data. If someone new picks up your analysis they won’t be able to replicate it if you made changes in the raw data. Write a program to do what you did in Excel. It may take you longer at first, but it will result in a better product. If you find that you need to manipulate the data in excel it is important that you save a copy of the raw data, and make your changes in the copy. An additional tab should be added to the excel file which includes what manipulations have been done. This should be used only as a last results and is not the preferred method.

Changing Raw Data

Figure 2.10: Changing Raw Data

2.3.4 Shut it Down, Run it Again

The best way to verify that you have produced a repoducible analysis is to try and reproduce it yourself. In R this can be done by restarting your R session and running all of your scripts. If you have a makefile you could type make clean then make into the console to rerun all of your programs.6 In SAS this can be as simple as closing the program and then restarting SAS and rerunning the program. It is important that these steps are completed. Re-running your analysis ensures that you can get the same results should you resume or update this analysis in the future. Additionally, it ensures that you don’t have any objects floating in space that cannot be reproduced. There is nothing more frustrating than not being able to reproduce an analysis on demand.


  1. The usualsuspects package template uses these functions extensively. See here

  2. We will talk more about makefiles in subsequent sections.