2.3 Building Reproducible Reports
Reproducibility is a key component of all of the research done in Institutional Research. The odds are very high that someone will ask for the same report or analysis in the future. As such it is critically important that you are able to re-run each report. This takes a concerted effort and requires that you follow a few best practices.
2.3.1 Random Seeds
Always set your random seed at the top of every program. When in doubt, set the seed.
Or in SAS
seed = 1;
2.3.2 Clear References and Paths
In a Windows environment with common network drives not everyone uses the same drive mapping scheme. As such it is important to use the proper drive name. Additionally, if you can add descriptive data to the file name like the date or the data provider it will also help with clear references.
An example of this is below:
GOOD
DATAFILE = \\admin2\InstRes\survey\2018_results.xlsx
BAD
J:\survey\2018_results.xlsx
C:\Users\DEWITTME\myfiles.xlsx
If you are using R, it is recommended that you use R projects.
When R projects are used, relative paths can be used.
The here
and fs
packages are also recommended to be used in order to ensure that file paths are consistent between platforms.5
2.3.3 Clean Data Programmatically
Sometimes you need to clean the data. Something isn’t formatted correctly or only a few items are missing data. Don not ever change the raw data! The raw data files should never be altered. Raw data represent what you were provided and should not include any cleaning from you. Any data cleaning activities should take place in code and not directly in the raw data. If someone new picks up your analysis they won’t be able to replicate it if you made changes in the raw data. Write a program to do what you did in Excel. It may take you longer at first, but it will result in a better product. If you find that you need to manipulate the data in excel it is important that you save a copy of the raw data, and make your changes in the copy. An additional tab should be added to the excel file which includes what manipulations have been done. This should be used only as a last results and is not the preferred method.
2.3.4 Shut it Down, Run it Again
The best way to verify that you have produced a repoducible analysis is to try and reproduce it yourself.
In R this can be done by restarting your R session and running all of your scripts.
If you have a makefile
you could type make clean
then make
into the console to rerun all of your programs.6
In SAS
this can be as simple as closing the program and then restarting SAS
and rerunning the program.
It is important that these steps are completed.
Re-running your analysis ensures that you can get the same results should you resume or update this analysis in the future.
Additionally, it ensures that you don’t have any objects floating in space that cannot be reproduced.
There is nothing more frustrating than not being able to reproduce an analysis on demand.
The
usualsuspects
package template uses these functions extensively. See here↩We will talk more about
makefiles
in subsequent sections.↩