14.2 Key Considerations
14.2.1 The Rare Event
The problem with Retention modeling at Wake Forest is that we suffer from the “rare event” problem. This means that the event that we are trying to model does not occur very often (e.g. 94% first year retention means that we are looking for that 6%…out of 1,350 students). Because of this a model could predict one class (most commonly that all students will be retained) and achieve high levels of accuracy. Obviously, this is not the desired outcome because the model is not telling us something that we don’t already know.
14.2.2 Small N
The other key consideration with data science and applied statistics at Wake Forest is the problem with small numbers of people. If a cohort is only 1350 students, and you don’t learn anything about these students for an entire semester, or an entire year, it is difficult to make inferences. Some of this can be addressed by moving to more rapidly generated data like badge swipes and LMS engagement, but none of this analysis was completed during my time here. Regardless, small N problem drives us to pool data across years. By this I mean that all First Year students are pooled from multiple years. This erroneously assumes that there are no changes in culture, on campus programming, and the global climate (socially, politically, environmentally) during these years. More detail will be provided for how to deal with some of these key considerations, but they should be kept in your mind during each analysis.
14.2.3 Small Effects and Heterogeneous Effects
In general, small effect sizes are very common in educational settings. With small effects, to detect a difference one will need larger sample sizes. We have already discussed how there is a limit to the sample size, and the time it takes to acquire more samples. Thus one should temper their expectations as realise that while statistically significant effects can be detected, we will always be under-powered, and conversely while something might not be statistically significant, there may be evidence that it is making a difference.
People are multitudes; this means that people are complex and made of up all kinds of thoughts, opinions, experiences and emotions. Engaging in intramural sports may foster a sense of belonging and improve retention for one group of students who have a high value of that latent trait, while for others playing intramural sports creates a sense of alienation. This underlines that there will be heterogeneous effects. Returning the intramural sports, while playing them for some may increase belonging and create social bonds, breaking up with one’s significant other may completely erase these gains in belonging (and this information certainly isn’t captured in the data). With small Ns, the law of large numbers and asymptotics don’t necessarily play in our favor (hence why Bayesian modeling is quite attractive). Regardless, we must keep these constraints in mind as we move through any kind of analysis.