Assessing the performance of different variations of ensembled tree models in chlorophyll concentration prediction
Abstract
Severe microalgae blooming is detrimental towards human life and the aquatic ecosystem in which it is blooming uncontrollably in. Chlorophyll concentration is a common parameter used to predict microalgal bloom. In this study, a variety of ensembled tree models which consists of random forest, gradient boosting frameworks, specifically XGBoost and LightGBM, and extra trees were implemented to predict amounts of chlorophyll concentration that can be found in running water. A comparison was also made between the models to find which performs the best in prediction and computational time. The comparison was conducted by comparing the NRMSE of each model and the average computing time. Each of the model’s hyperparameters has been tuned with the help of random search, as a method for hyper-parameter optimization. The results were as such: random forest took 16.95 ms to compute and the result of the NRMSE was 0.75, XGBoost took 8.28 ms to compute and the result of the NRMSE was 0.71, LightGBM took 2.81 ms to compute and the result of the NRMSE was 0.63, and extra trees took 17.15 ms to compute and the result of the NRMSE was 0.72. The comparison showed that both of the gradient boosting based frameworks performed better compared to both random forest and extra trees. Specifically, LightGBM performed the best in terms of both predictive performance and computational time. The results of this study serves as a purpose to find a faster alternative with similar or better accuracy compared to random forest as a baseline in predicting chlorophyll concentration.
References
[2] Y. Park, K. H. Cho, J. Park, S. M. Cha and J. H. Kim, "Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea," Science of the Total Environment, vol. 502, pp. 31-41, 2015.
[3] Q. V. Ly, X. C. Nguyen, N. C. Le, T.-D. Truong, T.-H. T. Hoang, T. J. Park, T. Maqbool, J. Pyo, K. H. Cho, K.-S. Lee and others, "Application of Machine Learning for eutrophication analysis and algal bloom prediction in an urban river: A 10-year study of the Han River, South Korea," Science of The Total Environment, vol. 797, p. 149040, 2021.
[4] X. Li, J. Sha and Z.-L. Wang, "Application of feature selection and regression models for chlorophyll-a prediction in a shallow lake," Environmental Science and Pollution Research, vol. 25, pp. 19488-19498, 2018.
[5] "Department for Environment Food & Rural Affairs," [Online]. Available: https://environment.data.gov.uk/water-quality/view/download/new. [Accessed 29 July 2024].
[6] L. Breiman, J. Friedman, C. J. Stone and R. A. Olshen, Classification and Regression Trees, Taylor & Francis, 1984
[7] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001.
[8] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016.
[9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye and T.-Y. Liu, "Lightgbm: A highly efficient gradient boosting decision tree," Advances in neural information processing systems, vol. 30, 2017.
[10] P. Geurts, D. Ernst and L. Wehenkel, "Extremely randomized trees," Machine learning, vol. 63, pp. 3-42, 2006.
[11] L. Yang and A. Shami, "On hyperparameter optimization of machine learning algorithms: Theory and practice," Neurocomputing, vol. 415, pp. 295-316, 2020.
[12] J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization.," Journal of machine learning research, vol. 13, no. 2, pp. 281-305, 2012.