Friday, November 22, 2024

Predicting My Students' SGPA: From OLS to Machine Learning

Summary: This article highlights the importance of machine learning algorithms and traditional econometrics models. Using a classic classroom example, this article suggests that a student of economics should use both tools in economic modeling.  

While teaching econometrics, the fundamental challenge we face is to choose the perfect example or data to use while explaining econometrics and the importance of being present in class. Most of the time, I ended up using the classic example of grade points (GPA or SGPA) and how it is affected by attendance, IQ, internal marks, and so on. The example and the method to prove to my students that these factors are crucial for getting a good grade remained the same for the past few years. The mighty ordinary linear least square (OLS) regressions always do their tricks and show that the student will get lower grades if they perform poorly in the internal exams or have less attendance. However, I always questioned whether the OLS is the best model. In most cases, students are in their first year. I cannot teach them non-linear equations, time-varying state space, or any fancy model that may fit the data perfectly.

Figure 1: SGPA and Average Internal Marks of the students of the Department of Economics


An OLS model seems perfect for the data presented in Figure 1. However, the data has a higher dispersion at specific ranges, such as 60 to 70 or 87 to 94, which is a classic case of heteroskedasticity. One can remove these data points and label them outliers, but then the students will question my intention. So, removing data points or applying a complex model is not an option.

If a student who has an average internal mark of 65 approaches me and wants to know the predicted SGPA, I will use OLS to show that, based on the regression result of Table 1, the student will get an average SGPA of 4.9 with a mean squared error (MSE) of 0.93 and r2=0.81. However, as I mentioned, students in this cluster have a higher variation, which means my prediction may be misleading.

Table 1: Simple OLS result of SGPA on Average Internal Marks

Figure 2: OLS prediction of SGPA for the student with 65 average internal marks


In the era of data analytics and machine learning, I should use machine learning techniques to predict my students' SGPAs. One of the basic methods is the K-Nearest Neighbourhood algorithm. The idea is that we can predict the behavior of data by looking at its nearest neighbors. I used 20% of the data for testing and K=8 nearest neighbors to predict the SGPA of the student with 65 average internal marks. The prediction has changed to 5.23 with a mean squared error of 2.0 and r2=0.49, as depicted in Figure 3. I changed the value of K many times, and it remained above the predicted value of OLS.

Figure 3: K- Nearest Neighborhood prediction of SGPA for a student with 65 Average Internal Marks


The data's clustering behavior may still lead to wrong predictions. So, I used the decision tree algorithm, which is more appropriate when neighboring clusters display different patterns or the data has a more complex pattern. Using a basic decision tree algorithm, I predicted that the student with an average of 65 internal marks might get an SGPA of 6.28 with a mean squared error of 2.03 and r2=0.41, which is way above the OLS prediction (figure 4).

Figure 4: Decision tree prediction of SGPA for a student with 65 average internal marks

All three models have strengths and weaknesses; no one can claim that one model is better in all situations. As the literature has mentioned, there is always a trade-off between unbiasedness and standard error. So the investigator should be careful while using these models for forecasting or predicting a variable. Although machine learning algorithms are popular, OLS is a powerful and simple technique with a solid theoretical background. The overall relationship between Internal marks and SGPA or attendance and SGPA is positive and significant as predicted by the OLS. And remember, under all the assumptions of classical linear regression, OLS is still BLUE (Best Linear Unbiased Estimate). 

Please Note: Don’t take this post seriously. Econometrics is just for fun. (All Python codes are available in open sources.)


By

Dr. Akash Kumar Baikar

Assistant Professor, Department of Economics, SBSS, MRIIRS

No comments:

Post a Comment