An Enhanced Convolutional Neural Network for Accurate Classification of Grape Leaf Diseases
Abstract:
Grape leaf diseases can significantly reduce grape yield and quality, making accurate and efficient identification of these diseases crucial for improving grape production. This study proposes a novel classification method for grape leaf disease images using an improved convolutional neural network. The Xception network serves as the base model, with the original ReLU activation function replaced by Mish to improve classification accuracy. An improved channel attention mechanism is integrated into the network, enabling it to automatically learn important information from each channel, and the fully connected layer is redesigned for optimal classification performance. Experimental results demonstrate that the proposed model (MS-Xception) achieves high accuracy with fewer parameters, achieving a recognition accuracy of 98.61% for grape leaf disease images. Compared to other state-of-the-art models such as ResNet50 and Swim-Transformer, the proposed model shows superior classification performance, providing an efficient method for intelligent diagnosis of grape leaf diseases. The proposed method significantly improves the accuracy and efficiency of grape leaf disease diagnosis and has potential for practical application in the field of grape production.1. Introduction
Grapes are a globally important cash crop, prized for their sweet taste and nutritional value. However, the incidence of grape leaf diseases has been increasing due to climate and environmental changes, resulting in common diseases such as black rot, whorl spot, and brown spot, which severely impact grape yield and quality. Precise identification of grape leaf disease species is essential to enable precise treatment of grape leaves. Yet, the current manual visual determination of grape leaf disease species in most vineyards is inefficient, costly, and prone to errors.
In recent years, deep learning has become a popular tool in computer vision applications, owing to its ability to extract image features automatically. Among the various deep learning techniques, convolutional neural network (CNN) has made significant strides in image recognition, including target detection [1], [2], [3], [4], image segmentation [5], [6], and autonomous driving [7]. Researchers have also applied CNN to crop disease recognition, where it has shown promising results.
For instance, Sladojevic et al. [8] proposed a plant disease recognition method using deep convolutional networks, achieving recognition rates between 91% to 98%. This was the first application of deep learning methods to plant disease classification. Bi et al. [9] developed a leaf disease classification method based on MobileNet networks, which demonstrated better recognition efficiency for apple leaf diseases than InceptionV3 and ResNet152. Hameed and Üstündağ [10] proposed a method to detect apple leaf disease species using deep neural networks (DNN), utilizing accelerated robust features (SURF) for feature extraction and grasshopper optimization algorithm (GOA) for feature optimization, achieving better classification accuracy. Krishnamoorthy et al. [11] used the InceptionResNetV2 model combined with a migration learning approach to identify diseases in rice leaf images, achieving 95.67% recognition accuracy. Luo et al. [12] proposed a multi-scale feature fusion-based apple disease classification network by altering the batch normalization and ReLU position, improving the ResNet network, and optimizing the network using methods such as pyramidal convolution instead of 3×3 convolution, achieving 94.24% recognition accuracy. Zhang et al. [13] combined null convolution with global pooling and proposed a global pooled null convolution neural network that can effectively discriminate cucumber diseases. Hu et al. [14] developed a lightweight adaptive feature extraction network model GKFENet based on the SqueezeNet model, achieving an average recognition accuracy of 97.90% for tomato diseases.
While previous studies have achieved good results in plant leaf disease classification, the small and variable spots in grape leaf disease images present a unique challenge, making it difficult to obtain accurate classification results. To address this challenge and achieve precise classification of grape leaf disease species, an improved convolutional neural network based on the Xception network [15] is proposed in this study. The main contributions of this article are as follows:
Firstly, the acquired dataset is expanded using data enhancement methods, including luminance brightness adjustment, rotation, and Gaussian noise addition, to prevent model overfitting and improve the robustness of the network.
Secondly, the Mish activation function is used to replace the ReLU activation function of the Xception network, improving the classification accuracy of the network and avoiding the problem of neuron death.
Thirdly, the network is enhanced with the Squeeze-and-Excitation (SE) module, which enables the network to focus more on extracting important information between feature channels, improving the network's ability to extract small lesion features.
Finally, the fully connected layer of the Xception network is improved using 1×1 convolution instead of the fully connected layer, enhancing the classification performance of the network.
2. Data Acquisition and Processing
To accomplish the grape leaf disease classification task, a certain number of grape leaf disease images were collected, followed by data enhancement and division of the dataset into a training set and a test set.
The experimental data for this study were collected from the plant village dataset, consisting of 2000 grape leaf images, including 500 images of each grape leaf black rot, whorl spot, brown spot, and healthy leaves.
In real-life situations, grape leaves grow in complex environments, and images taken in such environments may be impacted by various factors such as weather conditions, equipment clarity, and shooting angles. As a result, the photographs of grape leaves taken in real-life situations may exhibit complex backgrounds with varied angles, levels of clarity, etc., which can impact the classification results.
To simulate the real-life scenario of taking photos and make the classification task more suitable for application in real environments, the data were processed through data enhancement techniques. Specifically, image brightness was enhanced and reduced, flipping was applied, and Gaussian noise was added to expand the dataset. The resulting processed dataset comprises 10,000 images, which were divided into a training set and a validation set in a 4:1 ratio. Some of the original and processed images are shown in Figure 1.

3. Classification Model of Grape Leaf Diseases
The basic structure of the Xception network is introduced in this section.
Before 2014, most convolutional neural networks improved their performance by increasing the depth or width of the network, which was computationally expensive and could lead to overfitting. To address this problem, GoogLeNet [16] proposed the Inception module, which merges 1×1 convolution, 3×3 convolution, 5×5 convolution, 3×3 pooling, and dimensionality reduction using 1×1 convolution in parallel to reduce computation. The network can choose the appropriate combination of convolution layers to use, and the output feature map shape remains unchanged. The Inception module increases the network depth while enhancing the network's adaptability to different scale features and significantly reducing computational effort (See Figure 2).

Depthwise Separable Convolution (DSC) comprises Depthwise Convolution (DC) and pointwise convolution (PC), which significantly reduce the number of parameters and computational costs of the network compared to normal convolution operations.
The DC operation performs spatial convolution for each input channel, and the results are restacked to obtain the output. Each convolution kernel corresponds to a separate channel for convolution. The PC operation performs a second convolution of the feature map obtained after the DC operation, performing channel fusion, changing the number of output channels, and combining the outputs of the DC operation.
Eq. (1) shows the computation required for the ordinary convolution operation.
where, H and W denote the height and width of the feature map, respectively, f denotes the size of the convolution kernel, and C denotes the number of channels.
And the calculation required to perform the DSC operation is shown in Eq. (2):
The schematic diagram of the DSC operation is shown in Figure 3.

The ratio of the computational effort of the DSC operation to the ordinary convolution is:
Using Depthwise Separable Convolution can significantly reduce computation, as shown by the reduced ratio of computational effort in Eq. (3).
The Xception network is a convolutional neural network that replaces the 3×3 convolution in the Inception v3 [17] network with DSC and combines the residual structure of the ResNet [18] network. It comprises three parts: Entry flow, Middle flow, and Exit flow, and includes a total of 14 blocks, each containing the DSC structure. The Xception network processes spatial and channel information separately, which enables it to extract image features more comprehensively.
This section focuses on the improvement made to the Xception network in this study. The experimental results demonstrate that the proposed improved Xception model can effectively extract the disease spot features of grape disease leaves and achieve better classification performance.
The activation function plays a crucial role in convolutional neural networks by mapping the input of neurons to the output and introducing nonlinearity into the network. This enhances the network's ability to fit various nonlinear models and increases the expressiveness of the model, thereby making deep networks effective.
The Xception network uses ReLU as its activation function, and its formula is shown in Eq. (4):
The ReLU activation function sets all positive values as constant and all non-positive values as 0. This function can activate only some neurons at a time, introducing sparsity in the network, improving computational efficiency, and reducing parameter dependence. The ReLU function's non-negative interval has a constant gradient, which avoids the problem of gradient disappearance. However, the ReLU activation function may result in the "permanent death of neurons" problem because the output may be zero, and the gradient cannot be updated.
The Mish function's graph is similar to that of ReLU, but smoother and does not have a value of 0 in the negative interval. The equation of the Mish function is as follows:
The Mish activation function also eliminates the problem of gradient disappearance and has a non-zero gradient in the negative interval, preventing the problem of neuron death and enabling the network to learn more features. Additionally, the Mish activation function can speed up the training process and improve the network's accuracy. The smoother nature of the Mish function allows information to penetrate better into the network, resulting in better accuracy and generalization. Figure 4 displays the graphs of the two activation functions.
Replacing the ReLU activation function with the Mish activation function improves the recognition accuracy of the classification network.

The Attention Mechanism (AM) is a method of focusing on relevant information by automatically calculating the importance of input information to the network's output. This enables the network to prioritize effective information and ignore irrelevant information, improving the network's efficiency. The channel attention mechanism focuses on relevant information within channels, enabling the network to learn the importance of each channel and use resources more efficiently to extract more effective information.
SENet [19] (Squeeze-and-Excitation Networks) proposed a channel attention structure called the SE module, which comprises two parts: Squeeze and Excitation. Figure 5 shows the structure of the SE module. The SE module is divided into three parts: Squeeze operation, Excitation operation, and Scale operation.
The Squeeze operation performs global average pooling on feature maps within each path to compress channel features into a real number that represents each channel with a numerical value. The formula for this operation is shown in Eq. (6):
where, z is the result of the Squeeze operation performed on the input features in spatial dimension H×W, and vk(i,j) is the feature map after a series of convolutions, and C is the number of channels of v.
The Excitation operation learns the feature weights of each channel. This operation reduces dimensionality through a fully connected layer, applies a ReLU activation function, raises the dimensionality through another fully connected layer, and finally generates a weight coefficient between 0 and 1 through a sigmoid activation function. This process predicts the importance of each channel. The calculation formula for the Excitation operation is shown in Eq. (7):
where, σ denotes the sigmoid function, δ denotes the ReLU activation function, $W_1 \in R^{\frac{C}{\gamma} \times C}$ and $W_2 \in R^{c \times \frac{C}{\gamma}}$ are the parameters of the two fully connected layers, and r is used to reduce the dimensionality of the fully connected layers.

The Scale operation weights the weight coefficients obtained from Excitation to the original features channel by channel, labeling the importance of each channel. The formula is shown in Eq. (8):
To extract disease spot information more comprehensively from disease images, this study proposes an improved SE module that uses a parallel structure of global maximum pooling and global average pooling instead of the original global average pooling. This improves the network's classification capability. Figure 6 compares the principle of the SE module before and after the improvement.
The global average pooling and global max pooling are used to process the feature maps separately, compressing them from space (N, H, W, C) to space (N, 1, 1, C). The two compressed feature maps are then fused to fully extract the texture information of the disease images.
The improved SE module is integrated into the Middle flow section of the Xception network to enhance its ability to extract disease spot features from disease images.
In convolutional neural networks, the fully connected layer can combine local features obtained in the previous layer, reduce the impact of feature position on classification results, and improve the network's robustness. However, using fully connected layers can result in excessive network parameters.
By contrast, the 1×1 convolution can represent the entire image information while greatly reducing the number of parameters. Using 1×1 convolution instead of fully connected layers can effectively reduce the network's parameter count and improve its performance.
The overall structure of the improved network is presented in Figure 7.


4. Experimental Results and Analysis
The experimental platform used Windows 10 operating system with an AMD Ryzen 7 5800H processor with Radeon Graphics 3.20 GHz and 16 GB memory. The classification model was based on the Pytorch deep learning framework in Python 3.8, and the experimental software used was PyCharm. The GPU used was NVIDIA GeForce RTX 3060 Laptop GPU.
The experimental parameters used in this network were as follows: a batch size of 12, 100 rounds of experimentation, Adam optimizer, weight decay to suppress overfitting with a decay coefficient of 0.0002, and the cross-entropy loss function as the loss function.
To investigate the effect of learning rate on the experimental results, comparison experiments were designed to verify the classification effects at learning rates of 0.0001, 0.0005, and 0.001, respectively. The optimal learning rate was determined to be 0.001.
Accuracy, recall (R), precision (P), and F1-score are commonly used metrics to evaluate the effectiveness of a classification model. They assist in determining how well the model is able to correctly identify and classify samples.
Accuracy rate is a metric that measures the proportion of correctly predicted samples to the total number of samples. The formula for accuracy rate is shown in Eq. (9):
Recall rate, also known as sensitivity or true positive rate, measures the proportion of actual positive samples that are correctly predicted as positive. The formula for recall rate is as follows:
Precision rate measures the proportion of positive predictions that are truly positive. Its formula is shown in Eq. (11):
F1-score is a combined metric that takes into account both precision and recall. It is the harmonic mean of precision and recall. The formula for F1-score is as follows:
The confusion matrix is a widely used evaluation metric in multi-classification tasks. After testing the trained model on a test set of 2000 images, the resulting confusion matrix is presented in Figure 8. From the confusion matrix, the single test accuracy of the model was calculated to be 99.3%. Table 1 shows the recall, precision, and F1-score for each category.
Precision/% | Recall/% | F1-Score | |
black rot | 99.4 | 97.8 | 0.986 |
brown spot | 99.6 | 100 | 0.98 |
healthy | 100 | 100 | 1 |
whorl spot | 98.2 | 99.4 | 0.988 |
Table 1 indicates that the trained model in this study has achieved an accuracy of 98% or higher for recognizing three types of diseased leaves, with an F1-score above 0.985. This further confirms the effectiveness of the proposed model.

Selecting an appropriate learning rate is crucial because a low learning rate slows down network convergence while a high learning rate may result in the gradient explosion problem and make model convergence difficult. To determine the optimal learning rate, we conducted a comparison experiment with learning rates of 0.0001, 0.0005, and 0.001, respectively. The results are presented in Figure 9, which shows that the classification performance is best when the learning rate is set to 0.001.


To assess the effectiveness of the proposed method, we conducted comparative experiments using ResNet50, ShuffleNet V2, and Swim Transformer classification models. The experimental results are presented in Figure 10. The classification accuracy of the ShuffleNet V2 model is comparatively lower, as it is a lightweight convolutional neural network model that may not perform as well as other deep convolutional models in terms of classification results. The ResNet50 and Swim Transformer models both exhibit good classification results, but their models have more parameters and are computationally complex. The MS-Xception model proposed in this study exhibits the best performance in the classification task, with faster convergence and higher accuracy, and significant advantages over other classification models.
Ablation experiments were conducted to verify the effectiveness of the proposed improved method. Table 2 presents the experimental results, where P1 denotes the replacement activation function, P2 denotes the introduction of the improved SE module, and P3 denotes the improved fully connected layer. The experiments demonstrate that the proposed improved method significantly enhances the performance of the original network, resulting in an average test accuracy increase of 2.38% to 98.61%.
Average Test Accuracy/% | |
Xception | 96.23 |
Xception+P1 | 97.54 |
Xception+P1+P2 | 97.87 |
Xception+P1+P2+P3 | 98.61 |
The performance of the network was enhanced by using the Mish activation function to mitigate the issue of "neuron necrosis". The introduction of the improved SE module enabled the network to focus more effectively on the critical information in disease images, resulting in improved classification accuracy with only a small increase in computation. Improving the fully connected layer using 1×1 convolution reduced the number of network parameters and improved the classification performance. The proposed MS-Xception network demonstrated better results than the original network for the grape leaf disease classification task.
Feature visualization of convolutional neural networks can provide visual insight into the learning ability of classification models. In this study, we utilized Grad-CAM [20] for feature visualization of the classification model, and the resulting heat map is presented in Figure 11. The darker the color, the more capable the model is at learning features. From the heat map, it can be observed that the MS-Xception network focuses its attention on the disease spot region of the leaf images, validating its ability to identify grape leaf disease features.

5. Conclusion
This study proposes an improved lightweight convolutional network (MS-Xception) for enhancing the accuracy of grape leaf disease species identification. The network is based on the Xception network, and modifications were made to the activation function, the introduction of an attention mechanism, and the improvement of the fully connected layer to enhance the classification accuracy of the network. The experimental results demonstrate that the proposed model outperforms other networks, providing an effective method for grape leaf disease classification and serving as a reference for crop pest and disease identification.
However, the proposed method has certain limitations. The model parameters are relatively large, making it unsuitable for deployment on mobile devices. Furthermore, identifying the degree of grape leaf disease is a crucial task that requires further exploration. Future work will focus on developing lightweight models and improving the identification of disease extent.
Significant contributions to the design of the experiment and manuscript revision were made by Yinglai Huang. Ning Li made significant contributions to the experiment design, data collection and processing, execution of the experiment, as well as manuscript writing and revision. Zhenbo Liu contributed significantly to the project provision and manuscript revision. All authors have read and agreed to the final version of the manuscript that was published.
The data used to support the findings of this study are available from the corresponding author upon request.
The hard work and valuable comments of the anonymous reviewers are greatly appreciated, as they have contributed to the improvement in the quality of this paper.
The authors declare that they have no conflicts of interest.
