14.6 Frequentist Modeling

Student retention is typically a classification problem (binomial or a series of Bernoulli trials in which each student has a single trial that results in either a success, staying at Wake Forest, or a failure, leaving Wake Forest). From this you would like to assign each student a probability of retention (or attrition) and use this information to make an intervention.

It is important to make sure that you choose the correct minimization function for your algorithm. Most algorithms seek to reduced the Mean Squared Error (MSE) for continuous outcomes. Additionally, most classification algorithms optimize for the accuracy. You must be cognizant of which metrics you are trying to optimize, and if accuracy is truly the best thing to train your model to optimize.

Some options include:

MSE
Accuracy (TP + TN)/(Population)
Precision (TN)/(TN + FN)
Recall (TP)/(TP + FP)

For retention in rare event cases often recall is the better metric. The cost of the intervention is low and the benefit of identifying students at high risk is great.

14.6.1 Data Pre-Treatment

If you are dealing with a rare case scenario (<20% of events) then you will need to adopt some difference modeling strategies. One approach is data pre-treatment to increase the number of rare cases in your training data set. This can be approached through several methods:

Up-Sampling – Increase the ratio of rare cases by duplicating them at a higher rate.
Down-Sampling – This reduces the number of non-rare cases. This is generally not advisable because you reduce the information in the model.
Synthetic Data – Through ROSE or SMOTE you can interpolate between rare cases and make data that “looks” like your rare cases.

All of these methods are not free. By treating the data you may be introducing some more stability in your predictors than you really have. This will result in better estimates than may truly exist, especially when dealing with high dimensional data (high number of predictor compared to number of samples).

14.6.2 Additional Techniques

Penalised regression can be used. In this case you penalise the algorithm if it guess “wrong” and the student leaves at a higher rate than if it guess correctly that a student stays. This can also be changed based on the case.

14.6.3 Running the Models

After you have specified all of the data you plan to use for training, type of model (e.g. binary classification), and the metric to optimize, you are ready to run the models. Typically it is relatively free to use a variety of machine learning algorithms to fit your data.

As you will see in the src directory, there are a series of these different models trained on the training data. Typically, I use the caret package as this allows me to set a lot of features. Additionally, the R package has a multi-core capability so it scales well on the DEACCluster. Speaking of the DEACCluster, I strongly recommend that you get access to this tool. This allows you to submit your jobs and run many models all at once without taxing your local machine.