14.4 Setting up testing architecture

It is important to divide your data into three parts before you begin modeling. This is important because you do not want to use your entire data set to model the data. The goal of an algorithm is to predict new cases, not perfectly predict the data it has already seen. As such you need to divide your data into three parts

Training Data – This is yours to model and explore freely
Testing Data – This data is used to train and tweak your model
Validation Data – This is the final hold out data that you can use to test your best model on. This will give you an idea of how it will perform “in the field” or in the presence of data it has never seen. Test this data only once.

In the case of student retention modeling, I typically like to hold the most year back as my validation data. This allows me to test the effect on a new cohort of students. For the previous years, it is up to you on how to divide into testing and training sets. If you use the “blocking” approach you can use divide the testing and training again by cohorts and add a fixed effect for the year (this can help to train the model to detects trends by year and can help with the assumptions that come along with pooling over years). If you were to randomly sample from the remaining data, you would could do so.

Typically 70/20/10 split is used. However, this depends on what makes sense in your situation. If you do not have enough data to make this split then you can without the validation data and use cross validation more extensively when training models (especially Leave-one-out cross validation). In Bayesian Modeling LOO is preferred.