Benzene Pollution Forecasting by Recurrent Neural Networks Tuned with Adapted Elk Heard Optimizer
Abstract:
Benzene is a toxic airborne contaminant and a recognized cancer-causing agent that presents substantial health hazards even at minimal concentrations. The precise prediction of benzene concentrations is crucial for reducing exposure, guiding public health strategies, and ensuring adherence to environmental regulations. Because of benzene's high volatility and prevalence in metropolitan and industrial areas, its atmospheric levels can vary swiftly influenced by factors like vehicular exhaust, weather patterns, and manufacturing processes. Predictive models, especially those driven by machine learning algorithms and real-time data streams, serve as effective instruments for estimating benzene concentrations with notable precision. This research emphasizes the use of recurrent neural networks (RNNs) for this objective, acknowledging that careful selection and calibration of model hyperparameters are critical for optimal performance. Accordingly, this paper introduces a customized version of the elk herd optimization algorithm, employed to fine-tune RNNs and improve their overall efficiency. The proposed system was tested using real-world air quality datasets and demonstrated promising results for predicting benzene levels in the atmosphere.1. Introduction
As a widespread issue and an environmental occurrence, atmospheric contamination represents a persistent and alarming danger to many industrialized and emerging countries around the world. Air contamination can be defined as the presence or mixture of chemical, physical, or biological substances that progressively alter the composition of the air. Predominantly triggered by human-related activities, air pollution arises from the combustion of fossil fuels, such as power plants, automobiles, and residential heating systems, as well as from natural phenomena like wildfires, volcanic outbursts, and similar events [1]. Numerous investigations have examined the consequences that atmospheric pollution imposes on human civilization: adverse health effects on individuals, including respiratory ailments, cardiovascular conditions, pulmonary disorders, and early mortality [2], [3]. Additionally, there are economic and social repercussions driven by escalating pollution levels, for instance, in Shanghai, declining property values have resulted in significant setbacks within the housing sector [4]. Some research further identifies a connection between population movements and the intensity of pollution. In the study [5], the authors studied patterns during China’s Lunar New Year celebrations, where the Shanghai population can shrink by up to 60%, leading to a decrease in food preparation and transportation activities. Subsequently, this contributed to a measurable reduction in specific airborne contaminants. Environmental contamination also exerts a profound influence on natural ecosystems, with one of the most significant contributors being global warming. Altering climatic conditions, such as increasing global temperatures, can drastically harm and remodel habitats, potentially causing species extinction, melting polar ice, and elevated ocean levels [2]. Persistent air contamination further damages vegetation by altering leaf structures [6], while fauna suffers either direct contact or indirectly through the consumption of polluted nutritional sources.
Airborne pollutant particles are classified into two primary groups: PM10 and PM2.5, referring to particulate matter with diameters smaller than 10 and 2.5 micrometers, respectively. The pollutants most frequently monitored include sulfur dioxide ($\left(\mathrm{SO}_2\right)$), nitrogen oxides (NOx), carbon monoxide (CO), and ozone ($\mathrm{O}_3$) [7]. Although benzene ($\left(\mathrm{C}_6\mathrm{H}_6\right)$) is not classified as particulate matter, it exists alongside these particles as an extremely hazardous atmospheric contaminant and a widely recognized cancer-causing substance. Benzene is a volatile organic compound (VOC) that can rapidly vaporize into the atmosphere and is commonly linked to pollution both indoor and open air settings. The main contributors to benzene emissions include automobile exhaust, particularly from gasoline-powered engines, discharges from petroleum refineries and chemical manufacturing plants, burning of fossil fuels like coal and oil, and natural events such as wildfires. Consequently, developing the ability to forecast the magnitude, specific pollutants, and temporal occurrence of benzene-related air contamination is of vital significance.
Artificial intelligence techniques, particularly machine learning (ML), are increasingly applied in atmospheric pollution prediction due to their capacity to capture intricate, nonlinear dependencies among environmental factors. Using extensive data sets, including meteorological conditions, vehicular flow, industrial operations and historical contamination records, frameworks such as random forests, support vector machines, and advanced neural networks can effectively estimate concentrations of pollutants like PM2.5, $NO_2$, and benzene. These predictive models consistently surpass conventional statistical approaches, providing timely analyses and early alerts that assist in safeguarding public health and shaping environmental regulations. However, as described by Wolpert's no free lunch theorem [8], no singular ML methodology guarantees optimal results in all forecast scenarios.
An additional challenge arises from the heavy dependency of the ML model performance on the selection of hyperparameter settings. Therefore, precise calibration of these parameters is essential for achieving reliable predictions, a process widely acknowledged as an NP-hard optimization challenge, which cannot be resolved using standard deterministic algorithms and instead requires probabilistic or stochastic optimization strategies.
In this research, an innovative atmospheric pollution prediction system was introduced, built on recurrent neural networks (RNNs) [9]. These architectures are specifically designed to process time series air quality data and accurately estimate benzene contamination levels by identifying hidden sequential trends. To further enhance the resilience and precision of the model, a customized form of metaheuristics based on swarm intelligence was incorporated into the proposed system, namely the elk herd optimization algorithm (EHO) [10]. This refined EHO was used to autonomously adjust the hyperparameters of the RNN, ensuring that the network adapts swiftly to pollution prediction tasks while maintaining high detection accuracy.
This study was consequently motivated by three main goals:
Creation of an upgraded version of the EHO algorithm, explicitly designed to surpass the limitations of the original approach and fine-tuned for tackling the benzene pollution forecasting problem.
Design of an RNN-driven framework capable of capturing the complex temporal interdependencies associated with air contamination, enabling dependable predictions of the benzene level.
Integration of the modified EHO mechanism into the RNN-based prediction model to execute hyperparameter tuning, with the aim of achieving superior performance tailored to the specific forecasting challenge.
The remainder of this paper is organized as follows. Section 2 provides an overview of pertinent studies on machine learning applications in air quality monitoring. In addition, it elaborates on hyperparameter tuning techniques and the RNN framework, emphasizing its function in processing sequential atmospheric datasets. Section 3 explains the fundamental principles of the EHO algorithm and introduces a modified variant of this metaheuristic approach. Subsequently, Section 4 details the experimental configuration, while Section 5 showcases the obtained outcomes. Finally, Section 6 reflects on the significance of the results and outlines potential avenues for future investigations.
2. Related Works
Beyond their harmful impacts on human well-being, benzene and its related aromatic compounds are highly reactive substances and serve as primary precursors in the formation of secondary organic aerosols (SOA) and ground-level ozone within the atmosphere. Photochemical transformations involving BTEX (benzene, toluene, ethylbenzene and xylene) are influenced by sunlight exposure and the availability of oxidative agents such as nitrogen oxides and various transient radicals like hydroxyl (OH), alkyl peroxides and hydrogen peroxide radicals [11], [12].
The complex and non-linear behavior of BTEX compounds and the seasonal fluctuations in their gas-particle phase transitions necessitate an interdisciplinary research perspective and sophisticated computational modeling approaches [13]. These advanced tools facilitate the exploration of environmental interconnections, broaden existing scientific understanding, and establish a foundation for future sustainable development. The authors [14] utilized multiple linear regression to estimate benzene concentrations based on independent variables such as other air pollutants and meteorological parameters, while the study [15] applied Bayesian hierarchical models to investigate benzene exposure linked to petrochemical industries, relating industrial discharges with pollution incidents and regional mortality disparities.
In general, researchers have used various methodologies, including extreme gradient boosting (XGBoost), Generalized AutoRegressive Conditional Heteroskedasticity (GARCH), artificial neural networks, and the light gradient boosting machine (LightGBM) to forecast concentrations of volatile organic compounds (VOCs), particulate matter (PM), polycyclic aromatic hydrocarbons (PAHs) or haze occurrences (e.g., [16], [17], [18], [19], [20], [21]).
Machine learning techniques, designed to associate pollutant behavior with surrounding environmental factors, require customization for each specific scenario (dataset), a process inherently categorized as a nondeterministic polynomial-hard (NP-hard) problem. Performing this procedure manually is labor intensive and time-consuming. Furthermore, NP-hard tasks cannot be efficiently solved using conventional deterministic methods, as they would require excessive computational resources and impractical timeframes. In contrast, probabilistic algorithms—particularly those based on swarm intelligence metaheuristics—are capable of identifying near-optimal solutions within acceptable time limits.
Consequently, several studies have explored the utilization of metaheuristic strategies to refine and train artificial intelligence models or statistical regression techniques, in order to improve predictive performance and uncover deeper insights into the impacts of air quality on human well-being [22], [23], [24], [25].
Prominent examples of successful hyperparameter calibration using metaheuristic algorithms span various sectors. These approaches have been implemented in healthcare [26], intelligent energy systems [27], software engineering [28], precision agriculture [29] and opinion mining [30]. The cybersecurity domain has also leveraged these techniques for tasks such as intrusion detection [31], insider threat identification [32], along with other applications like meteorological predictions [33] and ecology [34], [35], [36].
In this study, an enhanced version of the EHO algorithm was embedded in the RNN architecture, targeting the fine-tuning of RNN hyperparameters to improve the precision and adaptability of click fraud detection mechanisms.
RNNs [9] represent a subset of artificial neural models specifically engineered to process sequential datasets. Unlike conventional feedforward architectures, RNNs incorporate recursive connections that enable the retention of prior information, allowing the model to capture time-based dependencies. In numerous domains, such as natural language modeling and text generation, speech recognition and voice assistants, sentiment analysis to comprehend emotions hidden in textual documents, as well as temporal data prediction, RNNs have proven capable of learning intricate patterns that change over time. Practical implementations include financial modeling and forecasting of stock markets, weather prediction, tracking vital signs of patients in medical domain, video and music analysis and anomaly detection.
Within this framework, the RNN is structured to handle benzene records by preserving a hidden state that stores contextual knowledge from earlier time steps. At every moment $t$, the network processes an input $X_t$ and refreshes its hidden state $h_t$ by integrating the current input and the preceding hidden state $h_{t-1}$. This transition is governed by a nonlinear activation mechanism, specifically the hyperbolic tangent function ($\tanh$). The update operation is mathematically expressed as follows:
In this context, $W_{xh}$ denotes the weight matrix linked to the input data, $W_{hh}$ represents the recurrent weight matrix associated with the hidden layer and $b_h$ is the bias component. This recursive structure enables the network to retain information from earlier inputs, which is vital for identifying patterns in user click activity that may extend across several time intervals.
After updating the hidden state, the network computes the output $Y_t$ by applying a linear transformation to the hidden representation:
where, $W_{hy}$ signifies the weight matrix that projects the hidden state into the output domain. The streamlined nature of this model ensures efficient training while preserving the ability to capture critical temporal characteristics embedded within the benzene pollution sequences. While modern studies often incorporate attention mechanisms to further boost predictive accuracy, this section concentrates on the core RNN architecture, which serves as the foundation for the prediction system.
3. Methods
This section initially presents the fundamental version of the EHO metaheuristic algorithm. It then highlights the limitations of the standard EHO approach and introduces an enhanced modification, which was subsequently utilized in the conducted experimental studies.
The elk herd optimization (EHO) algorithm [10]. is a relatively new swarm-based computational strategy inspired by the reproductive behavior observed in elk populations. The methodology is modeled on two core mating phases: rutting and calving.
During the rutting stage, the herd is fragmented into several family groups of varying sizes. This segregation occurs as dominant male elks compete to form groups containing multiple harems. In the calving stage, offspring are produced from the most dominant bulls and their respective harems. The original EHO model incorporates a control variable $B_r$ denoting the initial bull proportion within the herd.
The algorithmic procedure starts by generating the initial elk group, structured as a population of individual bulls and harems. This population Elk herd is mathematically represented as a matrix defined in Eq. (3).
where, in $n \times N$ the $N$ shows the population size.
Eq. (4) illustrates the generation of each individual solution $x^j$.
where, $ub$ represents the upper boundary and $lb$ denotes the lower boundary of the search space.
The elk population is then arranged in ascending order based on their fitness scores. Family groups are formed according to the starting bull ratio $B_r$. The total number of families is determined by $B = |B_r \times N|$. The fitness assessment is used to select the top-performing males from the population. Individuals with the highest fitness values are designated bulls within $B$. These selected bulls then compete to form their harems.
Harem allocation is performed using a roulette-wheel selection strategy, where bulls in $B$ are assigned harems proportionally based on their fitness relative to the total fitness sum. Each bull $x^j$ is assigned a selection probability $p_j$ calculated from its absolute fitness $f(x^j)$ as shown in Eq. (5).
During the calving phase, the offspring of each family is represented as $x^j_i(t+1)$produced by inheriting traits from both the father bull $x^{h_j}$ and the mother harem $x^j_i(t)$. If the calf $x_i(t+1)$ shares the same index $i$as the father, its generation follows Eq. (6).
where, $\alpha$ is a random coefficient within $[ 0,1]$, which influences the degree of inherited characteristics of a randomly selected elk $x^k(t)$. Higher $\alpha$ values increase the randomness in the offspring, promoting diversity.
Alternatively, if the calf shares the index with the mother, the offspring $x_i(t+1)$ is generated using both the mother $x^j$and the corresponding father $x^{h_j}$ as formulated in Eq. (7).
where, $x_i^j(t+1)$ represents the $i$-th component of calf $j$ in generation $t+1$, $h_j$ identifies the father bull of the $j$-th harem, $r$ is the index of a randomly chosen bull from the population.
In line with natural elk behavior, there exists a possibility that the harem female mates with another bull if the dominant bull fails to guard her adequately. The parameters $\gamma$ and $\beta$ are random values within $[ 0,2]$ that determine the proportional influence of the father and random bull on the traits of the offspring.
At the end of each iteration, bulls, harems, and newborn calves are combined in all families. The population is then reclassified by fitness and the top performing individuals are preserved for the subsequent generation.
Although EHO is a relatively recent optimization technique that has demonstrated strong performance in various fields, there is room for improvement. Specifically, both the exploration and the exploitation phases of the original EHO framework present opportunities for refinement. The weakest point of the algorithm include susceptibility to premature convergence issue, as in some runs EHO can converge too fast to the local optimum, which is particularly expressed in high dimensional search domains. This may happen either due to insufficient exploration power in the early phases, or too aggressive exploitation of the attractive areas of the search space. Another weakness is sluggish converging speed in some runs, as EHO tries to avoid local optima pitfalls, especially if the population is lacking diversity. To address those obstacles, the current study introduces a dynamic modification of the algorithm designed to improve both components of the search process.
During the initial phase of execution (the first $T/2$ iterations), the focus is on intensifying the exploration. The individual with the lowest fitness score (that is, the weakest candidate solution) is replaced by a newly generated individual created through a hybridization process. This new solution is formed by applying the uniform crossover technique - borrowed from the genetic algorithm (GA) methodology [37] - to a randomly selected pair of individuals from the population.
In the last half of the optimization process (the final $T/2$), the algorithm shifts its emphasis to exploitation. In this stage, the poorest performing individual is replaced with a hybrid offspring generated by combining the elite (best performing) individual with a randomly chosen member of the population, again using the crossover operation. This improved version of the EHO algorithm is termed adaptive EHO (AEHO), and its detailed procedure is presented in Algorithm 1.
Algorithm 1. ACOA metaheuristics pseudo-code Produce starting population P of N random |
while (t < T) do for (every crayfish in P) do Utilize original COA search process end for Arrange individuals in $P$ with respect to their fitness scores if (t < T/2) then Replace the poorest crayfish within P by a hybrid between a pair of arbitrary individuals, utilizing crossover mechanism [37]. else Replace the poorest crayfish within P by a hybrid between the best crayfish and an arbitrary crayfish, utilizing crossover mechanism [37]. end if end while return Crayfish with the best fitness score in P |
The complexity of metaheuristics techniques in terms of fitness function evaluations (FFEs) is a principal metric for measuring their computational efficiency. FFEs represent the count of evaluations of the objective function in a single run. It allows a platform-independent way to perform side by side comparisons of metaheuristics approaches. Considering that this modification does not introduce additional FFEs, which are typically the most computationally intensive operations in metaheuristic algorithms, the proposed AEHO maintains the same computational complexity as the standard EHO in terms of FFEs.
4. Experimental Setup
This research utilized openly accessible dataset from Kaggle for validation of the proposed benzene forecasting framework. The dataset is comprised of over 9300 entries of hourly mean readings gathered from a collection of five metal-oxide gas sensors integrated into an Air Quality Chemical Multisensor module [38]. This instrument was positioned outdoors at street level in a heavily contaminated urban zone located inside one Italian municipality. Recordings were captured between March 2004 and February 2005, spanning across an entire year, and represent the most extensive publicly accessible time series of in-situ air quality chemical sensor outputs. Confirmed ground truth data includes hourly averaged levels of carbon monoxide (CO), non-methane hydrocarbons (NMHC), benzene, total nitrogen oxides (NOx) and nitrogen dioxide ($NO_2$). Benzene ($C_6H_6$) was set as the target in this study. Dataset was split into 70%/10%/20% segments used for training, validation and testing, as outlined in Figure 1.

The capabilities of the proposed AEHO algorithm were compared to the collection of potent widely recognized metaheuristics, including baseline EHO [10], GA [37], particle swarm optimization (PSO) [39], bat algorithm (BA) [40] and COLSHADE [41]. Competing algorithms were individually crafted utilizing Python, employing the standard configuration settings for their control variables as recommended by their original developers. Every assessed optimizer was allotted a populace of 6 candidate solutions ($N=6$) and permitted 8 cycles ($T=8$) to carry out the optimization process. Due to the stochastic essence of metaheuristic methods, which inherently involve randomness, experiments were conducted over 30 independent executions. All considered algorithms were tasked with refining the models' efficacy via hyperparameters tuning. Table 1 delineates the assortment of optimized RNN configuration variables with their corresponding search intervals.
The performance of the generated RNN structures was evaluated using a standard set of regression KPIs [42], as outlined with Eqs. (8)-(11), including RMSE, MAE, MSE and $R^2$.
A supplementary evaluation, known as the index of agreement (IoA) [43], was likewise monitored across the experiments, as it offers a more comprehensive insight into the RNN's performance. The IoA is calculated according to Eq. (12).
Within the provided equations, $c_{i}$ and $\hat{c}_{i}$ correspond to the actual and predicted values of the $i$-th sample, $\bar{c}$ denotes the mean score, whereas $m$ denotes the length of entries. Throughout the conducted experiments, MSE was allocated as the fitness function that needs to be minimized, while $R^2$ was employed to be the indicator function.
Bound | Learning Rate | Dropout | Epoch Number | Cells within Layer | Count of Layers |
Min | 0.0001 | 0.05 | 100 | 100 | 1 |
Max | 0.0100 | 0.20 | 300 | 250 | 2 |
5. Simulation Outcomes
Table 2 delineates the results of the fitness function optimization trials carried out over 30 separate runs, with the top score in each category highlighted in bold. The proposed AEHO demonstrated outstanding effectiveness, achieving the highest values for the best run, mean and median outcomes of 0.002265, 0.002413, and 0.002422, respectively. In this context, COLSHADE achieved the top score in the least favorable run, while it also demonstrated enhanced consistency in performance across independent runs, recording the lowest standard deviation and variance among the considered optimization algorithms.
Method | Best | Worst | Mean | Median | Std | Var |
RNN-AEHO | 0.002265 | 0.002563 | 0.002413 | 0.002422 | 0.000113 | 1.28E-08 |
RNN-EHO | 0.002462 | 0.002720 | 0.002545 | 0.002507 | 0.000106 | 1.13E-08 |
RNN-GA | 0.002355 | 0.002615 | 0.002463 | 0.002421 | 0.000095 | 9.04E-09 |
RNN-PSO | 0.002457 | 0.002963 | 0.002625 | 0.002462 | 0.000208 | 4.33E-08 |
RNN-BA | 0.002336 | 0.002517 | 0.002441 | 0.002459 | 0.000075 | 5.55E-09 |
RNN-COLSHADE | 0.002328 | 0.002435 | 0.002374 | 0.002355 | 0.000045 | 1.98E-09 |
Figure 2 illustrates a comparative side by side analysis of the consistency of the examined optimizers over multiple independent runs. The displayed violin plot indicates that the introduced AEHO is not the most reliable algorithm in terms of stability, as it is notably outperformed by several other metaheuristic strategies, like COLSHADE, the baseline EHO and BA. Nonetheless, although these alternatives demonstrated better uniformity in their outcomes, they failed to achieve the best overall performance, which was secured by AEHO. This outcome implies that the alternative methods are more prone to becoming trapped in local optima in comparison to the introduced AEHO. Within the same Figure 2, convergence diagrams of the fitness function (MSE) are also visualized, offering meaningful insight into each method's proficiency in avoiding local minima and converging toward more suitable areas of the search domain. It is evident that the introduced AEHO achieved the most favorable overall solution during the first round of its finest execution, surpassing other considered optimizers, which struggled to escape suboptimal regions. Furthermore, Figure 3 illustrates the box plots alongside the convergence curves of the $R^2$, offering additional perspective on the behavior and efficiency of the evaluated methods.


Table 3 offers a detailed overview of the comprehensive comparative assessment of performance indicators for the finest-performing RNN structures synthesized by each considered optimization technique. The proposed AEHO yielded an RNN configuration that achieved an impressive $R^2$ of 0.918889, MAE of 0.018644, MSE of 0.002265, RMSE of 0.047594 and IoA of 0.978096. It is also evident that the remaining optimizers produced high-quality RNN models as well.
Method | R2 | MAE | MSE | RMSE | IoA |
RNN-AEHO | 0.918889 | 0.018644 | 0.002265 | 0.047594 | 0.978096 |
RNN-EHO | 0.911850 | 0.019980 | 0.002462 | 0.049616 | 0.975955 |
RNN-GA | 0.915653 | 0.018457 | 0.002355 | 0.048533 | 0.976746 |
RNN-PSO | 0.912029 | 0.021413 | 0.002457 | 0.049565 | 0.975427 |
RNN-BA | 0.916339 | 0.018608 | 0.002336 | 0.048336 | 0.977431 |
RNN-COLSHADE | 0.916638 | 0.019480 | 0.002328 | 0.048249 | 0.977542 |
Figure 4 delineates the forecasts made by the finest-performing RNN tuned by suggested AEHO metaheuristics. Lastly, Table 4 depicts the hyperparameters' configurations of the best RNN models synthesized with each observed optimization method, to support reproducibility of the simulations. The majority of the algorithms opted for structures with one hidden layer, except COLSHADE that determined RNN structure with two layers. Moreover, the RNNs with configurations listed in Table 4attained the results delineated within Table 3.

6. Conclusion
The spatiotemporal variability of air pollutant concentrations is remarkably dynamic, making air contamination a phenomenon of regional and global importance and a significant challenge for scientific investigation. The atmospheric behavior of harmful compounds depends on the intensity and nature of the emission sources, along with the meteorological conditions that influence their dispersion, chemical transformation, and eventual removal. Monitoring benzene, a highly volatile and carcinogenic compound, is essential for effective air quality assessment. Accurate prediction of benzene levels demands the deployment of advanced predictive frameworks.
This study explored the effectiveness of methodologies based on artificial intelligence, particularly those that employ RNN architectures, for estimating atmospheric benzene concentrations. To enhance the predictive performance of the RNN models, a modified version of the EHO algorithm was integrated for hyperparameter fine-tuning. The proposed system was evaluated using real-world data sets and produced encouraging results, with the top performing models achieving a MSE of only 0.002265 and a value $R^2$ of 0.918889.
However, despite these promising findings, the research faced several limitations. The vast amount of available data imposed practical constraints on the size of the dataset utilized for training and evaluation phases. Furthermore, the computational demands of the optimization procedures restricted both the population sizes and the number of iterations allowed for the metaheuristic algorithms employed.
Recognizing significant potential for further enhancement, particularly since machine learning models remain insufficiently tested on complex environmental datasets, future work will involve benchmarking alternative machine learning models (both standard and metaheuristically optimized) against the framework developed in this study, while also applying them to other environmental datasets and challenges.
The data used to support the findings of this study are available from the corresponding author upon request.
This research was supported by the Science Fund of the Republic of Serbia, grant No. 7373, characterizing crises-caused air pollution alternations using an artificial intelligence-based framework (crAIRsis), and grant No. 7502, Intelligent Multi-Agent Control and Optimization applied to Green Buildings and Environmental Monitoring Drone Swarms (ECOSwarm).
The authors declare no conflict of interest.
