Office of Institutional Research Data Scientist's Handbook
Preface
Welcome
Prerequisites
Our Tool Set
R
SAS
Power BI
PuTTy
Notepad++
Adobe Cloud
Others
Some Notes About the Text
Software information and conventions
About the Authors
Michael DeWitt
1
Introduction
2
Good Programming Practices
2.1
Why Good Programming Practices Are Important
2.2
Our Approach
2.2.1
Each Project Contains a README
2.2.2
Each Script Starts with a Title Block
2.2.3
Include a Purpose Statement
2.2.4
Use ISO8601 Date Formats
2.2.5
Calling Libraries
2.2.6
Naming Variables
2.2.7
SAS and SQL Function Key Words
2.2.8
Structures and Commenting
2.2.9
User Generated Functions or Macros
2.2.10
Data Validation and Unit Checks
2.2.11
Defensive Programming
2.2.12
Build Modular Programs
2.3
Building Reproducible Reports
2.3.1
Random Seeds
2.3.2
Clear References and Paths
2.3.3
Clean Data Programmatically
2.3.4
Shut it Down, Run it Again
2.4
Available Templates
2.4.1
SAS
2.4.2
R
2.5
makefiles
2.5.1
projects
2.5.2
packages
2.5.3
websites
I Data Sources
3
Data At Wake Forest University
3.1
Systems
3.1.1
Admissions Records
3.1.2
Student Records and Financial Aid
3.1.3
Student Activity Participation
3.1.4
HR and Finance
3.1.5
Housing
3.1.6
Learning Management System (LMS)
3.1.7
Job Placement
3.1.8
Academic and Residential Badge Use
3.1.9
Dining Hall Swipes
3.2
Getting Access
3.3
Accessing Data
3.4
Thoughts and Considerations
3.5
External Data Sources
3.5.1
IPEDS Data Center
3.5.2
NCSES Elementary and Secondary School Information
3.5.3
National Student Clearing House
3.5.4
US Census
4
Longitudinal Student Data Set
4.1
Introduction to the LSDS
4.2
General Guide
4.2.1
Input Files
4.2.2
Output Files
4.3
Deep Dive
4.4
Admissions
4.4.1
Admitted Students
4.4.2
First Generation
4.4.3
Institutional Committee
4.4.4
High School GPA
4.5
Office of the Dean of the College
4.5.1
Office of Academic Advising
4.5.2
Writing Center
4.5.3
Pre-Orientation Participation
4.5.4
Magnolia Scholars
4.5.5
Undergraduate Research Participation
4.6
Registrar
4.6.1
AP/IP Credits
4.6.2
Course History
4.6.3
Degree Completion
4.6.4
Continuing Enrollment Status
4.6.5
First Year Seminar/ Writing 111 Scores
4.7
Student Financial Aid
4.7.1
Admitted as Athlete
4.7.2
Financial Aid Data
4.8
Institutional Research
4.8.1
First Year Cohort Tag
4.8.2
Census Enrollment
4.8.3
GRE Scores
4.9
Finance
4.9.1
Deposits
4.9.2
First Payments
4.10
Campus Life
4.10.1
Club Participation
4.10.2
Disciplinary Records
4.10.3
Greek Participation
4.10.4
Greek Registration
4.10.5
Student Housing
4.10.6
Intramural Participation
4.10.7
Living and Learning Communities
4.11
Human Resources
4.11.1
Student Employees
4.12
Advancement
4.12.1
Donations
4.13
Office of Personnel and Career Development
4.13.1
First Destination
4.14
Campus Services
4.14.1
Swipes
4.15
Supplemental Programs
4.16
Combining the Data and Generating the LSDS
4.16.1
Combining
4.16.2
Calculating Retention
4.17
Database Formats
4.18
Running the Entire Update Procedure
4.19
Data Validation Procedures
4.20
Adding to the LSDS
5
Data Sharing Agreement
5.1
Introduction
5.2
Data Sharing Documents
5.3
Overview of the Process
5.4
Unit Records
5.5
Legacy of the Process
II Internal Packages
6
Internal Packages
6.1
About Each Package
6.2
Installation
6.2.1
From Source
6.2.2
From Gitlab
6.3
Updating and Modifying
7
irverse
7.1
Why the
irverse
?
8
wfutemplates
8.1
wfutemplates
Document Templates
8.2
Course Website Template
8.3
Installation of
wfutemplates
9
wfudata
9.1
Introduction
9.2
Seeing What’s Available
9.3
Accessing the Data
9.4
Writing to Disk
10
irtools
10.1
Introduction to
irtools
10.1.1
Getting Started
10.1.2
Our Data
10.1.3
Fitting the Model
10.1.4
Summarising the Results
10.1.5
Displaying the Outputs
10.2
Introduction to
compare_ap_ib
10.2.1
Getting Started
10.2.2
Prepping the data
10.2.3
All That’s Fit to Print
10.3
Introduction to
compare_chi_square2
10.3.1
When to Use it
10.3.2
Running the Function
10.3.3
Basic Arguments
10.3.4
Advanced Arguments
10.3.5
Alert
10.3.6
Some Examples
10.3.7
The Log
10.3.8
Rinse and Repeat
10.4
Effect Size Definitions
10.4.1
Cohen’s Effect Sizes
10.4.2
Kraft’s Effect Sizes
10.4.3
Summary of
irtools
11
usualsuspects
11.1
Introduction to
usualsuspects
III Practice
12
Usual Suspects Workflow
12.1
Getting Started with the Usual Suspects
12.2
Directory Structure
12.3
Making it Happen
12.3.1
Create A Project in a New Directory
12.3.2
Make New R Markdown Document from Template
12.3.3
Generating the Remainder of the Usual Suspects Project Template
12.4
Now Build Do Your Analysis
12.4.1
Import/ Munge
12.4.2
Check the Analysis
12.4.3
Verify the Parameters in the RMarkdown Document
12.4.4
Then
make
13
Updating and Maintaining the Internal IR Website
13.1
Introduction
13.2
Requirements
13.3
How It Works
13.3.1
Location of Files
13.4
Explanation of Files
13.4.1
_site.yml
13.4.2
robots.txt
13.4.3
footer.html
13.4.4
.htaccess
13.4.5
.gitignore
13.4.6
.Rmd files
13.4.7
sub-folders
13.4.8
makefile
13.5
Updating the Website
13.5.1
Update an Existing Page
13.5.2
Create a New Page
13.5.3
Modifying the Site Structure
13.5.4
Setting Security
14
Retention Modeling Method
14.1
The Major Project Locations {retentiondocs}
14.2
Key Considerations
14.2.1
The Rare Event
14.2.2
Small N
14.2.3
Small Effects and Heterogeneous Effects
14.3
Data
14.4
Setting up testing architecture
14.5
Variable Selection
14.6
Frequentist Modeling
14.6.1
Data Pre-Treatment
14.6.2
Additional Techniques
14.6.3
Running the Models
14.7
Bayesian Modeling Approaches
14.7.1
LOO and Model Averaging
14.8
Model Scoring
14.9
Timing
14.10
Documentation
15
Power BI
15.1
Introduction to Power BI
15.2
Version Control
15.3
Themes
15.4
Add Percent Calculation
15.5
Data Privacy and Masking Small Cell Sizes
16
Completed Projects
16.1
Works Completed
17
Definitions
17.1
Higher Education Specifics
17.2
Wake Forest Data Definitions
17.3
Wake Forest Abbreviations
17.4
Wake Forest Administrative Groups
18
Final Words
References
Wake Forest University
Data Scientist’s Handbook
5.1
Introduction