DESIGN OF AN IMPROVED MODEL USING XGBOOST, LIGHTGBM, AND LSTM FOR PREDICTING GRADUATION AND STRESS RATES IN COLLEGE STUDENTS

Main Article Content

Mrs. M. M. Mohod, Dr. P. M. Jawandhiya

Abstract

Abstract: The understanding and mitigation of the level of stress in college students are very critical since it bears on them both in their academic performance and general well-being. Most literature focuses on specific causes of stress and graduation rate—a non-comprehensive approach to identify rather multifaceted issues. Traditional models may obscure complex interactions that take place with the variables and also fail to optimally use sequential data samples. The paper presents an integrated, multi-method framework for analyzing the effects of independent variables that influence graduation and stress rates in colleges. In the model proposed herein, advanced feature engineering, robust machine learning algorithms, and sequence models have been embedded to ensure elaborate analysis for accurate predictions. For this to be achieved, begin with an automated feature engineering in Featuretools, followed by recursive feature elimination with cross Validation. This combination not only automated the process of generating new features but also efficiently selected the most relevant ones, reducing the feature dimensionality from more than 100 raw features to 20-30 optimized ones, thus improving model accuracy or F1-score by 5-10%. Then Gradient Boosting Machines, including XGBoost and LightGBM, were used because they were efficient and accurate in the presence of large data sets and complex interactions between features. They could achieve classifying accuracy to the range of 85-90% with an AUC-ROC of 0.88-0.92, which showed their strong predictive capability. Another attempt at improving performance would be the stacking method with a meta-learner, such as Logistic Regression, in order to combine XGBoost, LightGBM, and Random Forest models. This increased the accuracy by another 3-5% and improved AUC-ROC by another 0.02-0.05. Long Short-Term Memory (LSTM) networks and Bidirectional LSTMs captured the temporal dependencies of student behavior, yielding an accuracy of 80-85% in the prediction of future stress levels, with an RMSE of 0.15-0.2 for academic performance. It employs methods for exploratory data analysis, including t-Distributed Stochastic Neighbor Embedding and Principal Component Analysis to achieve the visualization of data structure and relationships. In this instance, PCA explained a range from 90% to 95% of the variance, while t-SNE clearly marked the clusters of stressed versus non-stressed students. The impact of the work has been enormous in providing a robust framework for identifying learners who are stressed, along with educational outcomes of targeted interventions in different use case scenarios. This paper applies state-of-the-art techniques of machine learning and deep learning in a comprehensive and practical manner to a very relevant question in higher education scenarios.

Article Details

Section
Articles