A data-driven machine learning framework for predicting total agricultural food grain yield

India's agricultural sector is experiencing significant shifts due to climate variability, evolving farming practices, and changing crop patterns. While previous studies have explored crop yield prediction using machine learning commonly adopt image-based Machine learning techniques, many relie...

Full description

Bibliographic Details
Published in:Results in Engineering
Main Authors: Nirmaladevi S, Jagatheswari S
Format: Article
Language:English
Published: Elsevier 2025-09-01
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2590123025028543
Description
Summary:India's agricultural sector is experiencing significant shifts due to climate variability, evolving farming practices, and changing crop patterns. While previous studies have explored crop yield prediction using machine learning commonly adopt image-based Machine learning techniques, many relied on manual hyperparameter tuning and rarely applied these techniques to decision tree based algorithms, limiting reproducibility and optimization. To address this gap, our study presents a consistent, data-driven approach using automated hyperparameter configuration and a moderate-scale, publicly available dataset from the Reserve Bank of India (1967–2024), covering key crops such as Groundnut, Rapeseed, Mustard, Soybean, Coffee, Sugarcane, and Tea. Four ensemble models - Gradient Boost, XGBoost, LightGBM, and CatBoost are used to predict total food grain yield. The model pipeline incorporates five-fold cross-validation, GridSearchCV for hyperparameter tuning, early stopping to prevent overfitting, and SHapley Additive exPlanations(SHAP) for interpretability. Model performance is evaluated using Mean Error (ME), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Coefficient of Determination (R2). Gradient Boost achieved the highest accuracy, followed by CatBoost, XGBoost, and LightGBM. The results demonstrate the reliability and scalability of ensemble models and provide a strong foundation for future integration of geographic and climatic variables into yield forecasting systems.
ISSN:2590-1230