RDE-DR: robust deep ensemble CNNs for automated diabetic retinopathy detection from fundus images - Nature

Introduction

Diabetic retinopathy (DR) is a leading cause of vision impairment and preventable blindness worldwide, particularly among individuals with diabetes¹. Early and accurate screening is essential to prevent irreversible vision loss and improve patient outcomes². However, early-stage DR detection remains challenging due to the subtle appearance of pathological features and the reliance on manual interpretation by clinical experts, which is time-consuming and subject to inter-observer variability³. In recent years, advances in artificial intelligence (AI) and deep learning have significantly enhanced automated disease screening, enabling reliable analysis of retinal fundus images⁴. Convolutional neural networks (CNNs), in particular, have demonstrated strong performance in medical image classification and segmentation tasks. Nevertheless, the scarcity of large, well-annotated medical datasets limits the training of high-capacity models from scratch⁵. Transfer learning alleviates this limitation by adapting models pre-trained on large-scale datasets, such as ImageNet, including ResNet50, VGG16, VGG19, and DenseNet121, to domain-specific tasks such as DR detection⁶.

Building upon transfer learning, ensemble learning has emerged as an effective strategy for improving robustness, generalization, and reliability in medical image analysis⁷. Instead of relying on a single model, ensembles integrate complementary predictions from multiple architectures to reduce variance and mitigate individual model bias⁸. Many works report performance improvements using voting or averaging techniques but do not analyze how fusion design influences calibration reliability, decision margins, or operational stability under identical experimental conditions. In addition, fixed decision thresholds are commonly adopted without investigating their impact on accuracy–precision trade-offs, which is critical for medical screening scenarios. Furthermore, comparisons across studies are frequently hindered by inconsistent preprocessing pipelines, training protocols, and evaluation settings, making it difficult to draw reproducible conclusions regarding ensemble effectiveness.

To address these limitations, this work proposes a Robust Deep Ensemble for Diabetic Retinopathy Detection (RDE-DR). This unified and reproducible experimental framework integrates four pretrained CNN backbones (ResNet50, VGG16, VGG19, and DenseNet121) into an automated DR screening system. To systematically exploit model complementarity, seven fusion strategies are investigated, such as hard voting, soft voting, weighted soft voting, rank-based fusion, Choquet integral, Sugeno integral, and average-logit fusion^9,10. These fusion strategies are involved under controlled conditions, probabilistic behavior analysis and threshold considerations. The experiments are conducted on the APTOS 2019 Blindness Detection dataset, incorporating contrast-limited adaptive histogram equalization (CLAHE) and data augmentation techniques¹¹.

Rather than pursuing incremental performance gains, the study emphasizes methodological understanding and practical guidance for designing reliable ensemble-based medical image classifiers. The main contributions of this work are as follows:

(1) A unified experimental pipeline integrating CLAHE-based preprocessing, transfer learning with four heterogeneous CNN architectures (ResNet50, VGG16, VGG19, DenseNet121), and seven different ensemble fusion strategies within identical training and evaluation conditions.

(2) A comprehensive comparative analysis of voting-based, rank-based, and fuzzy-integral-based fusion mechanisms, enabling controlled assessment of their robustness, calibration behavior, and metric stability.

(3) Systematic threshold optimization applied consistently across all fusion strategies to study accuracy–precision trade-offs and operational flexibility in medical screening scenarios.

(4) Probabilistic behavior analysis using Kernel Density Estimation (KDE) to characterize decision margins and reliability beyond conventional accuracy reporting.

The remainder of this paper is organized as follows: Sect. 2 reviews related work on DR detection using deep and ensemble learning. Section 3 presents the methodology, including CLAHE preprocessing, model architectures, and fusion strategies. Section 4 describes the experimental setup, evaluation metrics, and results. Section 5 compares the proposed framework performance with similar works. Finally, Sect. 6 concludes the paper and outlines future research directions.

Related work

Several studies using powerful pre-trained convolutional neural networks (CNNs), such as ResNet50, VGG16/19, DenseNet121, and EfficientNet, have reported DR screening accuracies typically in the range of 94–97%, with area under the ROC curve (AUC) values of around 0.97–0.99 on public datasets such as APTOS 2019 and other Kaggle DR challenges. These studies have also reported sensitivities and specificities often exceeding 94%¹². Early approaches combined machine learning with deep features, demonstrating that ensembles of CNN-derived features fed to classic classifier can outperform standalone models¹³.

Integrating multiple CNN architectures via ensemble or model-fusion approaches is another significant development in the field¹⁴. It has been proven that combining the predictive strengths of different models using some powerful fusion techniques consistently produces better results than single-model baselines. For example, an ensemble framework proposed in¹⁵ achieved an accuracy of around 96–97%, F1-scores above 0.96, and an AUC greater than 0.98 for binary DR classification, outperforming its best individual CNN by 2–3% points in terms of both accuracy and sensitivity. Similarly, an ensemble-based DR system using optimized voting and feature-level fusion presented in¹⁶ achieved around 95–96% accuracy and close to 95% sensitivity and specificity, as well as AUC values above 0.97 on benchmark datasets.

Islam et al.¹⁷ presented deep ensembles integrating multiple CNNs via feature-level fusion and probability voting/stacking. The authors evaluated pairwise and tri-fusion of pretrained CNN backbones (ResNet50, EfficientNet-B0, DenseNet121) for binary DR screening. Their results indicate that fused feature representations consistently outperform individual models, while offering promising computational efficiency trade-offs.

Moreover, Lin¹⁸ proposed an accuracy-weighted ensemble framework that combines seven distinct CNN architectures (including ResNet-50, DenseNet variants, EfficientNet, and MobileNet models) using a weighted majority voting scheme coupled with entropy-guided uncertainty estimation. This approach not only improved classification performance (achieving near 99% accuracy post-filtering) but also enabled rejection of low-confidence predictions.

Similarly, soft voting ensembles have been proposed in¹⁹ where posterior class probabilities from several pretrained networks (EfficientNet-B0, ResNet-50, and DenseNet-121) are aggregated to enhance the final decision. The obtained results reported noticeable gains in detection accuracy relative to individual CNN baselines, validating the utility of probabilistic fusion mechanisms in fundus image classification.

More recently, many studies in medical image analysis have increasingly explored advanced architectural paradigms beyond conventional CNNs, including transformer-based backbones, attention mechanisms, hybrid CNN–Transformer architectures, explainable learning frameworks, and secure collaborative infrastructures^20,21. Hybrid models combining convolutional networks with Vision Transformers and multi-scale fusion have demonstrated improved spatial representation and segmentation accuracy in medical image analysis (brain tumor, chest, cardiomegaly, and fundus images) highlighting the effectiveness of global–local feature integration^22,23,24,25. Likewise, attention-enhanced architectures such as DeepLabV3 with attention modules and EfficientNet-based explainable frameworks incorporating Grad-CAM have shown improved localization accuracy and interpretability on both public and clinical datasets²⁶.

In parallel, image pre-processing techniques such as Contrast-Limited Adaptive Histogram Equalization (CLAHE) are often used to improve the visibility of vessels and lesions. This can reduce illumination variability and enhance the robustness of downstream models. Kobat et al.²⁷ reported an increase from approximately 93% to 96% and from 0.95 to 0.98 in term of accuracy and AUC, respectively, after applying CLAHE and related enhancement steps. To facilitate the recap of previous studies, Table 1 summarizes the most important works on automated DR detection, highlighting the datasets used, number of images, learning architectures, methodological strategies, and reported performance metrics.

Table 1 Overview of representative studies on automated diabetic retinopathy detection.

Full size table

Despite the aforementioned advancements, several gaps remain in the literature that motivate the development of more rigorous ensemble fusion frameworks based explicitly on pretrained CNNs. Many existing ensembles rely on empirical combinations of a small number of architectures or simple voting schemes, without systematically exploring feature- and decision-level fusion strategies tailored to DR lesion patterns. Besides, there is still limited work on unified frameworks that jointly optimize transfer learning from multiple pretrained CNNs, fusion mechanisms, and calibration of predicted probabilities for clinical deployment.

The present work builds upon the aforementioned trends by designing an ensemble fusion framework that integrates several pretrained CNN models specialized for retinal imaging, combining their outputs through a carefully designed fusion strategy for robust DR screening. By leveraging the complementary strengths of different pretrained models and explicitly addressing probabilistic prediction behavior and threshold considerations, the proposed framework aims to improve upon current ensemble approaches and contribute a more reliable tool for large-scale DR screening systems.

Materials and methods

Proposed method

Our study proposes a deep ensemble transfer learning framework (RDE-DR) for the automated detection of diabetic retinopathy (DR) using CLAHE-enhanced APTOS 2019 fundus images. The pipeline begins with the APTOS 2019 RGB retinal fundus dataset²⁸, wher DR images are resized, normalized, and augmented to improve generalization²⁹. CLAHE is then applied to enhance local contrast and highlight diagnostically relevant structures, including microaneurysms, exudates, and hemorrhages³⁰. The pre-processed images are split into training and testing sets (80/20), and four pre-trained convolutional neural networks, ResNet50, VGG16, VGG19, and DenseNet121, are trained via transfer learning from ImageNet³¹ weights for binary DR classification (No_DR and DR). Seven ensemble fusion strategies are employed to exploit the complementary representations learned by these architectures: hard voting, soft voting, weighted soft voting, rank-based fusion, Choquet-like integral, Sugeno integral, and average logits fusion. These strategies produce a robust aggregated prediction for each image. The overall system is evaluated using accuracy, precision, recall, the F1 score, the ROC-AUC, and confusion matrices. The full RDE-DR pipeline is summarized schematically in Fig. 1.

Dataset

This study uses the Asia Pacific Tele-Ophthalmology Society (APTOS) 2019 Blindness Detection Dataset only²⁸, focusing on the labels for diabetic retinopathy (DR) provided. Images labeled with any degree of DR (mild, moderate, severe, or proliferative) are grouped in class 1 (DR), while images without DR are assigned to class 0 (No_DR). This binary reformulation simplifies the screening task, making it a matter of distinguishing between diseased and healthy retinas, which is consistent with many automated pre-screening scenarios.

The APTOS 2019 dataset was selected because it provides a thorough representation of retinal abnormalities at various DR severity levels and is widely used in research settings to benchmark computer-aided diagnosis systems. The original dataset comprises 3662 retinal fundus images with ground-truth labels, and an additional 1928 images form a separate test set without public labels. This study only uses the 3662 labeled training images, of which 1805 belong to the No_DR class (class 0) and 1857 to the DR class (class 1). Focusing on a single, high-quality dataset ensures consistent preprocessing, training, and evaluation protocols throughout the study.

We recall here that APTOS 2019 dataset does not provide explicit patient identifiers or paired left-right eye metadata. Therefore, it was not possible to perform a strict patient-level split. Figure 2 shows examples of images from both classes, demonstrating the variety of appearances, lighting conditions, and pathologies within the APTOS 2019 dataset.

Data pre-processing

This section outlines the preparation steps applied to the APTOS 2019 images prior to training the model. The original retinal fundus images are high-resolution (typically around 4288 × 2848 pixels) and exhibit significant variability in terms of focus, illumination, and noise³². This instance includes blurred, overexposed, and underexposed samples. To standardize the input size and reduce the computational cost, all images were resized to 224 × 224 pixels before being fed into the CNN models. The labelled dataset was randomly partitioned into training and test subsets in an 80:20 split. The training set contains 2929 images (1485 DR and 1444 No_DR), while the held-out test set includes 733 images (372 DR and 361 No_DR). This breakdown ensures an almost equal distribution of diseased and no-diseased classes in both subsets, enabling fair performance assessment for each category. Figure 3 illustrates the class distribution in the training and testing sets, showing that both partitions preserve the balance between DR and No_DR images overall.

CLAHE (contrast-limited adaptive histogram equalization)

CLAHE is a local contrast enhancement technique that operates on small, non-overlapping tiles within an image. Within each tile, the histogram is equalized and the contrast is amplified to a limited extent, preventing the over-enhancement of noise and the introduction of unnatural artifacts. Once all the tiles have been processed, bilinear interpolation is applied to merge neighboring regions smoothly and avoid visible grid boundaries³⁰.

This behavior makes CLAHE particularly effective for low-contrast images, such as medical images, where subtle local structures must be enhanced without distorting the overall appearance. In practice, CLAHE is primarily controlled by two parameters: the clipLimit, which determines the maximum allowable contrast amplification, and the tileGridSize, which defines the spatial scale of local enhancement. In most implementations (e.g., in common computer vision libraries), CLAHE can be applied to both grayscale and color images. For color images, it is usually applied to the luminance channel only, preserving the original color information while improving local contrast³³.

The overall CLAHE process can be summarized as follows³⁰:

Image division: The input image is divided into small regions (tiles) of a predefined size.
Histogram computation: A histogram is computed for each tile to represent the distribution of grey levels in that region.
Histogram clipping: The histogram of each tile is clipped at a predefined clip limit to restrict peak values and control contrast amplification.
Histogram equalization: The clipped histogram of each tile is equalized to produce locally enhanced pixel values.
Image reconstruction: The equalized pixel values are then used to reconstruct each enhanced tile. All of the enhanced tiles are subsequently merged to create the final, contrast-enhanced image.

CLAHE algorithm:

CNN models

Many researchers, particularly those specializing in medical image processing, use transfer learning (TL) rather than training deep convolutional neural networks (CNNs) from scratch. This is because TL significantly reduces training time and data requirements while improving generalization³⁴. In this study, we adopt a multi-stage, deep, ensemble, transfer learning methodology combining feature extraction, pre-trained CNNs, aggregation through ensemble fusion, and a comprehensive performance evaluation. In transfer learning, a model is first trained on a large dataset from a related domain³⁵. Then, the model is learned using a smaller, domain-specific dataset. This approach leverages learned low-level and mid-level features from the source domain, thereby avoiding the need for random initialization and substantially reducing the risk of overfitting³⁶. A key challenge for CNNs is their reliance on large amounts of annotated training data. The number and depth of model parameters directly influence the minimum dataset size required: networks with more layers require more data to avoid overfitting³⁷. In medical imaging, it is often impractical to obtain sufficiently large and diverse annotated datasets due to privacy regulations, the cost of expert annotation, and the rarity of diseases³⁸. Transfer learning mitigates these limitations by reusing feature representations learned from large public datasets, such as ImageNet, which makes it particularly valuable for medical applications³⁹.

This work selects four state-of-the-art pre-trained CNN architectures: ResNet50, VGG16, VGG19, and DenseNet121. Each model is initialized with ImageNet weights and then retrained on the APTOS 2019 DR dataset for specialization in binary retinal classification.

VGG16 and VGG19 models

In 2014, the Visual Geometry Group (VGG) at the University of Oxford introduced the VGG family of architectures, including variants such as VGG11, VGG13, VGG16, and VGG19 (see Fig. 4)⁴⁰. VGG16 and VGG19 are the most widely adopted versions, particularly in medical imaging applications for the recognition and classification of retinal pathology. VGG16 has 16 learnable convolutional and fully connected layers that are grouped into five convolutional blocks⁴¹. After that, there are three dense layers. Despite using relatively small 3 × 3 convolutional kernels, VGG16 and VGG19 are computationally intensive and require substantial GPU memory⁴². However, their straightforward architecture and proven efficacy in DR detection make them excellent choices for ensemble learning.

DenseNet121 architecture

DenseNet121 is a densely connected convolutional network that addresses several key challenges in deep learning⁴⁴. It facilitates improved gradient flow through dense skip connections by design, enabling efficient backpropagation during training and reducing the vanishing gradient problem that typically occurs as network depth increases. Its core innovation is that each layer receives inputs from all preceding layers, promoting efficient feature reuse, minimizing feature redundancy, reducing parameter count, and improving computational efficiency⁴⁵. The vanishing gradient problem, whereby error signals decay as they propagate backwards through many layers, is mitigated in DenseNet121 through these dense skip connections, which create direct pathways for gradient flow⁴⁶. Unlike traditional sequential architectures, where information can be lost or diluted as the network deepens, DenseNet121’s dense connectivity pattern ensures that both low- and high-level features are learned and preserved together. This leads to robust representations and improved generalization on medical imaging tasks (Fig. 5)⁴⁷.

ResNet50 architecture

ResNet (Residual Network) is a revolutionary deep learning architecture that addresses the vanishing gradient problem through skip connections, also known as residual connections or identity mappings⁴⁹. Unlike traditional sequential architectures, where gradients can decay when backpropagating through many layers, ResNet’s skip connections create direct pathways that allow gradients to flow unobstructed through the network. This design enables significantly deeper networks to be trained without performance degradation, making ResNet particularly effective for medical image analysis tasks⁵⁰. ResNet50, a member of the ResNet family, consists of 50 layers organized into five residual blocks (stages), with skip connections spanning multiple layers within and across blocks. Each residual block combines convolutional layers, batch normalization, and ReLU activations, as well as identity shortcuts, which allow the network to learn residual functions rather than the desired mappings directly. This improves convergence speed, reduces overfitting, and enhances feature extraction from retinal fundus images (Fig. 6)⁵¹.

Hyperparameter selection and optimization strategy

To ensure fair and stable model comparison, a systematic hyperparameter selection strategy was adopted. Hyperparameters were selected based on empirical evaluations of convergence stability, generalization performance, and computational efficiency. Batch sizes of 64 and 128 were tested to balance gradient stability and memory constraints. Learning rates of 1 × 10^-3, 1 × 10^-4, and 1 × 10^-5 were evaluated using SGD, RMSprop, and Adam optimizers. The Adam optimizer with a learning rate of 1 × 10^-4 consistently demonstrated faster convergence, reduced testing loss oscillation, and superior testing accuracy across all CNN backbones. Smaller learning rates slowed convergence without measurable accuracy improvement, while larger learning rates caused unstable training behavior.

Table 2 Justification of hyperparameter choices used in the experiments.

Full size table

The number of training epochs was fixed at 50 based on early stopping behavior observed during preliminary experiments, where testing performance saturated beyond this point. Data augmentation intensity and CLAHE parameters were empirically validated to enhance lesion visibility and mitigate overfitting. The selected hyperparameter configuration therefore represents an optimal trade-off between performance stability, generalization capability, and computational efficiency. The evaluated hyperparameter ranges and the selected configuration are summarized in Table 2.

Ensemble classifier method

Ensemble learning is a method of machine learning that uses the predictions of several separate models to make a better classification decision⁵³. Rather than relying on a single model, ensemble methods utilize the complementary strengths and diversity of the base learners to reduce variance, mitigate overfitting, and enhance generalization⁵⁴. In medical imaging tasks such as the detection of diabetic retinopathy, ensemble approaches have consistently outperformed individual models, particularly when the base learners have diverse architectures with different feature extraction capabilities⁵⁵.

In this study, we use four pre-trained CNN architectures (ResNet50, VGG16, VGG19, and DenseNet121) as the base learners. Each model is trained independently on the APTOS 2019 DR dataset, and their individual predictions are then combined using seven complementary ensemble fusion strategies (hard voting, soft voting, weighted soft voting, rank-based fusion, Choquet integral, Sugeno integral, and average logits fusion). These strategies operate at the decision level, aggregating output probabilities or class assignments from all base learners to produce a final ensemble-based DR classification. The ensemble method has a number of benefits: it lowers the chance that any one model’s biases or failure modes will affect the final prediction; it makes the model more robust to changes and noise in retinal photos; and it gives more reliable confidence scores for medical decision support. Figure 7 illustrates the core concept of ensemble learning as applied to our proposed RDE-DR system.

Ensemble fusion strategies

The RDE-DR framework employs seven complementary fusion strategies to combine predictions from four base CNN models: ResNet50, VGG16, VGG19, and DenseNet121. Each strategy operates at the decision level, aggregating output probabilities or confidence scores to produce a final binary DR classification.

1. Hard Voting with Threshold Optimization is a majority-rule ensemble method in which each base classifier casts a discrete class vote. The final prediction is assigned to whichever class receives the most votes⁵⁶.

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\text{arg}{max}_{k}\:\sum\:_{j=1}^{m}{\widehat{y}}_{j}=k$$

(1)

where $\:1({\stackrel{\prime }{y}}_{j}=k)$ denotes the indicator function, which equals 1 if classifier j predicts class k and 0 otherwise, m is the total number of base classifiers (m = 4 in this study).

With Threshold Optimization:

For binary DR classification, a tunable decision threshold τ ∈ [0,1] is introduced, where τ represents the minimum ensemble probability required to classify an image as DR-positive:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\left(DR\right)\:\:\:\:if\:\frac{{\sum\:}_{j=1}^{m}{\widehat{y}}_{i}}{m}\ge\:\tau\:\\\:0\:\left(No\_DR\right)\:\:Otherwise\:\:\:\end{array}\right.$$

(2)

Threshold optimization involves adjusting $\:\tau\:$ to maximize a chosen metric (e.g., F1 score or balanced accuracy) on a testing set.

2. Soft Voting with Threshold Optimization.

In soft voting, also known as average probability voting, the predicted probability distributions from all base classifiers are combined by averaging their output probabilities⁵⁷.

$$\:{P}_{\text{e}\text{n}\text{s}\text{e}\text{m}\text{b}\text{l}\text{e}}\left(\text{k}\right)=\frac{1}{m}\sum\:_{j=1}^{m}{P}_{j}\left(k\right)$$

(3)

where $\:{P}_{j}\left(k\right)$ is the predicted probability of class k from classifier j, and the final class is:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\text{arg}{max}_{k}{P}_{\text{e}\text{n}\text{s}\text{e}\text{m}\text{b}\text{l}\text{e}}\left(\text{k}\right)$$

(4)

With Threshold Optimization:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\left(DR\right)\:\:\:\:if\:{P}_{\text{e}\text{n}\text{s}\text{e}\text{m}\text{b}\text{l}\text{e}}\left(1\right)\ge\:\tau\:\\\:0\:\left(No\_DR\right)\:\:\:\:\:\:\:\:\:\:\:Otherwise\:\:\:\end{array}\right.$$

(5)

where τ is optimized on testing data to balance sensitivity and specificity.

3. Weighted Soft Voting with Threshold Optimization.

Weighted soft voting builds on the concept of soft voting by assigning a weight $\:{\omega\:}_{j}\in\:\left[\text{0,1}\right]$ to each base classifier j, where $\:{\omega\:}_{j}$ reflects the classifier’s relative reliability or testing performance. The weights satisfy the normalization constraint $\:\sum\:_{j=1}^{m}\:{\omega\:}_{j}=1$⁵⁸.

$$\:{P}_{\text{w}\text{e}\text{i}\text{g}\text{h}\text{t}\text{e}\text{d}}\left(\text{k}\right)=\frac{{\sum\:}_{j=1}^{m}{\omega\:}_{j\:}{\cdot \:\:P}_{j}\left(k\right)}{{\sum\:}_{j=1}^{m}{\omega\:}_{j\:}}$$

(6)

Weights can be assigned based on the accuracy, area under the curve (AUC), or F1-score of individual classifiers on testing data:

$$\:{\omega\:}_{j\:}=\frac{{Score}_{j}}{{\sum\:}_{j=1}^{m}{Score}_{j}}$$

(7)

Final Classification with Threshold:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\left(DR\right)\:\:\:\:if\:{P}_{\text{w}\text{e}\text{i}\text{g}\text{h}\text{t}\text{e}\text{d}}\left(1\right)\ge\:\tau\:\\\:0\:\left(No\_DR\right)\:\:\:\:\:\:\:\:\:\:\:Otherwise\:\:\:\end{array}\right.$$

(8)

4. Rank-Based Fusion with Threshold Optimization.

A rank is assigned to each classifier’s output score⁵⁹. These ranks are then aggregated to produce the final decision.

5. Choquet-like Integral with Threshold Optimization.

The Choquet integral is a fuzzy aggregation operator that accounts for interactions among classifiers through a fuzzy measure µ. where µ(S) ∈ [0,1] quantifies the importance of any subset S of classifiers. The measure satisfies the boundary conditions µ(∅) = 0 and µ(N) = 1, where N denotes the full set of classifiers⁶⁰.

Let $\:{a}_{1},\:{a}_{2},\dots\:,{a}_{n}\:$denote the classifier confidence scores sorted in non-decreasing order. The Choquet integral with respect to the fuzzy measure µ is defined as:

$$\:{C}_{\mu\:}({a}_{1},\:{a}_{2},\dots\:,{a}_{n})=\sum\:_{i=1}^{n}\left({a}_{\text{i}}-{a}_{\text{i}-1}\right)\mu\:\left({A}_{i}\right)$$

(9)

where $\:{a}_{0}=0\:$and $\:{A}_{i}$= {$\:i,\:i+1,\dots\:,n\}$ represents the set of classifiers with greater than or equal to scores $\:{a}_{\text{i}}$.

For binary diabetic retinopathy (DR) classification with four classifiers, the predicted confidence scores are first sorted: $\:{P}_{\left(1\right)}\le\:{P}_{\left(2\right)}\le\:{P}_{\left(3\right)}\le\:{P}_{\left(4\right)}$

A fuzzy measure µ is then defined over all subsets of classifiers (typically learned from testing data or manually assigned). The Choquet integral is computed as:

$$\:{C}_{\mu\:}=\left({P}_{\left(1\right)}-0\right)\:\mu\:\left(\left\{\text{1,2},\text{3,4}\right\}\right)+\left({P}_{\left(2\right)}-{P}_{\left(1\right)}\right)\:\mu\:\left(\left\{\text{2,3},4\right\}\right)+\dots\:$$

(10)

The final decision is obtained using a threshold τ:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\:\left(DR\right)\:\:\:\:if\:{C}_{\mu\:}\ge\:\tau\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\\\:0\:\:\:\:\:\:\:\:\:\:\left(No\_DR\right)\:\:\:Otherwise\end{array}\right.$$

(11)

6. Sugeno Integral with Threshold Optimization.

Another fuzzy aggregation operator that combines classifier outputs is the Sugeno integral, which uses a Sugeno $\:{\uplambda\:}-\text{f}\text{u}\text{z}\text{z}\text{y}\:$measure⁶¹.

Step 1: Initialize Sugeno $\:{\uplambda\:}-\text{f}\text{u}\text{z}\text{z}\text{y}\:$measure.

The Sugeno λ-fuzzy measure $\:{\mu\:}_{\lambda\:}$ is defined recursively, where λ∈ (− 1, ∞), λ ≠ 0, is an interaction parameter that controls the degree of complementarity (λ < 0) or redundancy (λ > 0) among classifiers. For a single classifier$\:\:i$, the fuzzy measure is defined as:

$$\:{\mu\:}_{\lambda\:}\left(\right\{i\left\}\right)={g}_{i}$$

(12)

where $\:{g}_{i}$ ∈ [0,1] denotes the fuzzy density (importance weight) of classifier $\:i$. The values $\:{g}_{i}$ are normalized and typically estimated from testing performance metrics such as classification accuracy or AUC.

For multiple classifiers, the Sugeno $\:{\uplambda\:}-\text{m}\text{e}\text{a}\text{s}\text{u}\text{r}\text{e}\:$satisfies:

$$\:{\mu\:}_{\lambda\:}\left(\text{A}\cup\:\text{B}\right)={\mu\:}_{\lambda\:}\left(\text{A}\right)+{\mu\:}_{\lambda\:}\left(\text{B}\right)+{\uplambda\:}\:\cdot \:\:{\mu\:}_{\lambda\:}\left(\text{A}\right)\:\cdot \:\:{\mu\:}_{\lambda\:}\left(\text{B}\right)$$

(13)

Step 2: Solve for $\:\lambda\:$

$$\:1+\lambda\:=\prod\:_{i=1}^{n}{(1+g}_{i})$$

(14)

Solve this equation numerically to obtain $\:\lambda\:.$.

Step 3: Compute $\:\mu\:$ for All Subsets.

Using the recursive formula and the computed $\:\lambda\:$, calculate $\:{\mu\:}_{\lambda\:}\left(S\right)$ for all relevant subsets $\:S$.

Step 4: Compute Sugeno Integral.

Sort the classifier outputs in non-decreasing order: $\:{a}_{1}\le\:{a}_{2}\le\:\dots\:\le\:{a}_{n}.$

$$\:{S}_{\mu\:}({a}_{1},\dots\:,{a}_{n})={max}_{i=1}^{n}[\text{m}\text{i}\text{n}({a}_{i},\:{\mu\:}_{\lambda\:}\left({A}_{i}\right)\left)\right]$$

(15)

where $\:{A}_{i}$= {$\:i,\:i+1,\dots\:,n\}$ is the set of classifiers with scores $\:{\ge\:a}_{i}.$.

With Threshold Optimization:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\:\left(DR\right)\:\:\:\:if\:{S}_{\mu\:}\ge\:\tau\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\\\:0\:\:\:\:\:\:\:\:\:\:\left(No\_DR\right)\:\:\:Otherwise\end{array}\right.$$

(16)

7. Average Logits Fusion.

This algorithm operates on the raw logits (i.e., unnormalized scores) from each classifier before the SoftMax function is applied, producing the final probability distribution⁶².

$$\:{L}_{avg}=\frac{1}{m}\sum\:_{j=1}^{m}{L}_{j}$$

(17)

where $\:{L}_{j}$ ∈ ℝ² denotes the logits vector, defined as the raw, unnormalized output scores produced by classifier $\:j$ prior to the softmax activation. The components of $\:{L}_{j}=[{l}_{j,0}\text{},{l}_{j,1}\text{}\text{}]$ correspond to the No_DR and DR classes, respectively.

Then apply SoftMax:

$$\:{P}_{\text{e}\text{n}\text{s}\text{e}\text{m}\text{b}\text{l}\text{e}}\left(\text{k}\right)=\frac{\text{e}\text{x}\text{p}\left({L}_{avg,k}\right)}{\sum\:_{k}\text{e}\text{x}\text{p}\left({L}_{avg,k}\right)}$$

(18)

Final Classification:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\:\left(DR\right)\:\:\:\:if\:{P}_{\text{e}\text{n}\text{s}\text{e}\text{m}\text{b}\text{l}\text{e}}\left(1\right)\ge\:\tau\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\\\:0\:\:\:\:\:\:\:\:\:\:\left(No\_DR\right)\:\:\:Otherwise\end{array}\right.$$

(19)

Threshold optimization

Threshold optimization is a post-hoc calibration step that is applied after ensemble predictions have been computed⁶³. Rather than using the default decision boundary of 0.5, a custom threshold (τ) is optimized on the testing set to maximize a specific performance metric.

Threshold Optimization Procedure⁶⁴:

1.
Generate predictions on testing data using each fusion strategy (output: probability scores for class 1, i.e., DR).
2.
Sweep threshold values $\:\tau\:\in\:[0,\:1]$ in small increments (e.g., 0.01).
3.
For each threshold, compute performance metrics:

$$\:\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y},\:\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y},\:\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y},\:\text{F}1-\text{S}\text{c}\text{o}\text{r}\text{e},\:\text{B}\text{a}\text{l}\text{a}\text{n}\text{c}\text{e}\text{d}\:\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\:\frac{\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y}+\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y}\:}{2}$$

$$\:{\tau\:}^{\text{*}}={arg\:}\underset{\tau\:\in\:[0,\:1]}{\text{max}}\text{F}1\left({\uptau\:}\right)\:$$

(20)

where $\:{\tau\:}^{\text{*}}$ denotes the threshold that maximizes the F1-score on the testing set.

Apply to test data using the optimal threshold:

$$\:\text{c}\text{l}\text{a}\text{s}\text{s}\left(\text{I}\right)=\left\{\begin{array}{c}1\:\:\left(DR\right)\:\:\:\:if\:{P}_{\text{e}\text{n}\text{s}\text{e}\text{m}\text{b}\text{l}\text{e}}\left(I\right)\ge\:{\tau\:}^{\text{*}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\\\:0\:\:\:\:\:\:\:\:\:\:\left(No\_DR\right)\:\:\:Otherwise\end{array}\right.$$

(21)

Overfitting prevention

Several strategies were employed to mitigate overfitting and improve generalization. All CNN models were initialized using ImageNet pre-trained weights and retrained on the target dataset, reducing reliance on limited training samples. Data augmentation was applied online during training, including random rotations, flips, scaling, and brightness variations, to increase data diversity. Dropout layers and weight decay regularization were used to prevent feature co-adaptation. Batch normalization stabilized gradient propagation and improved convergence. Early stopping based on testing loss was applied to avoid over-training, and the best-performing model weights were retained using checkpointing. Finally, ensemble fusion aggregated predictions from multiple independently trained models, reducing variance and improving robustness.

Results

This section provides a thorough evaluation of the RDE-DR framework for binary diabetic retinopathy classification. The performance of individual base learners (ResNet50, VGG16, VGG19, and DenseNet121) is evaluated, as is that of all seven ensemble fusion strategies (hard voting, soft voting, weighted soft voting, rank-based fusion, Choquet integral, Sugeno integral, and average logits fusion), each with threshold optimization. The results, which are reported on the held-out test set, use relevant metrics: accuracy, precision, recall (sensitivity), specificity, F1-score, and area under the receiver operating characteristic curve (ROC-AUC). Additionally, we provide confusion matrices and comparative analyses with state-of-the-art methods from the literature to contextualize the performance of RDE-DR.

The APTOS 2019 dataset was divided into training (2,929 images) and test (733 images) sets using an 80:20 split, with balanced class distributions in both subsets (see Fig. 3). All results were computed on the test set, ensuring no information leaked from the training data.

Experimental protocol

The experimental analysis of diabetic retinopathy classification was conducted on a local workstation dedicated to high-performance deep learning tasks. This setup was optimized to efficiently handle extensive image processing and training workloads. Table 3 reports the hardware configuration of our local workstation used.

The experiments were implemented using Python 3.7 with the integrated Jupyter Notebook interface on the local machine. This facilitated consistent code development, model training and visualization in a single environment. This environment supports an efficient, reproducible deep learning workflow, facilitating the development of models, the tuning of hyperparameters, and the evaluation of models for the classification of diabetic retinopathy.

Evaluation metrics

Selecting suitable evaluation metrics is essential for reliably assessing model performance in the classification of diabetic retinopathy. Metrics such as accuracy, recall, precision, F1_score, and specificity are frequently employed, as each provides a distinct viewpoint regarding various aspects of predictive capability. The mathematical definitions of these metrics are provided in Eqs. (22–26), which serve as the foundation for the performance analysis conducted in this study.

Table 3 Hardware configuration of the local workstation used for diabetic retinopathy classification experiments.

Full size table

Accuracy: is defined as the ratio of correctly predicted instances to the overall number of cases⁶⁵.

$$\:\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}\:\left(\text{A}\text{c}\text{c}\right)=\frac{\left(\text{T}\:\text{P}\:+\:\text{T}\:\text{N}\right)}{\left(\text{T}\:\text{P}\:+\:\text{T}\:\text{N}\:+\:\text{F}\text{P}\:+\:\text{F}\:\text{N}\right)}\text{*}\:100\%$$

(22)

Recall: also referred to as sensitivity, quantifies the classifier’s ability to correctly identify all actual positive cases within the dataset. It represents the proportion of true positive instances that are accurately detected by the model⁵⁰.

$$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\:\left(\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y}\right)=\frac{\text{T}\:\text{P}}{\left(\text{T}\:\text{P}\:+\:\text{F}\:\text{N}\right)}\:\text{*}100\%\:$$

(23)

True Negative Rate (TNR): is a metric that quantifies the precision of a system’s negative identification, calculated as the ratio of true negative instances that are correctly identified⁶⁶.

$$\:\text{T}\text{N}\text{R}\:\left(\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y}\right)=\frac{\text{T}\:\text{N}\:}{\left(\text{T}\:\text{N}\:+\:\text{F}\text{P}\:\right)}\text{*}\:100\%$$

(24)

Precision: Precision quantifies the classifier’s ability to correctly identify only the relevant positive instances. It represents the proportion of predicted positive cases that are actually true positives⁶⁷.

$$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:\left(\text{P}\text{r}\text{e}\right)=\frac{\text{T}\:\text{P}}{\left(\text{T}\:\text{P}\:+\:\text{F}\text{P}\:\right)}\text{*}\:100\%$$

(25)

F1_score: is a metric that quantifies the balance between precision and recall by calculating their harmonic mean. This score considers both false positives and false negatives, providing a single value that reflects the trade-off between correctly identified positive cases and errors made by the classifier⁶⁸.

$$\:\text{F}{1}_{\text{s}\text{c}\text{o}\text{r}\text{e}}=\frac{2\:\text{*}\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:\text{*}\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{\left(\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:+\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\right)}\text{*}\:100\%$$

(26)

All training, optimization, and data augmentation hyperparameters used in this study are summarized in Table 4.

Table 4 Summary of hyperparameters used in experiments.

Full size table

Results of individual classification models

Table 5 presents the performance of the four retrained CNN architectures (VGG16, VGG19, DenseNet121, and ResNet50) evaluated on the held-out APTOS 2019 test set (733 images). using the training configuration detailed in Table 4. All models achieved high and tightly clustered performance, with accuracies ranging from 97.95% to 98.64% and AUC values above 99.48%. This narrow variation suggests that performance gains are primarily driven by transfer learning, CLAHE preprocessing, and systematic hyperparameter optimization rather than architectural differences.

VGG16 achieved the highest accuracy (98.64%) with a recall of 99.19% and only three false negatives, indicating strong sensitivity for DR screening. VGG19 produced comparable accuracy (98.36%) and one of the highest AUC values (99.83%), reflecting stable discrimination across thresholds. DenseNet121 and ResNet50 demonstrated similarly balanced behavior, with accuracies of 98.36% and 98.23% and AUC values of 99.59% and 99.63%, respectively, supported by low misclassification rates in their confusion matrices. Overall, the four architectures exhibit consistent and robust performance, establishing a stable baseline for subsequent ensemble fusion analysis.

Table 5 Performance metrics of classification models on the APTOS 2019 dataset including hyper-parameter settings and confusion matrices.

Full size table

All four models exhibit consistently strong performance, with VGG16 and VGG19 achieving the highest recall—an essential criterion in large-scale screening applications to minimize missed DR cases. The uniformly low false negative rates (3–4 misclassified DR cases out of 372) highlight the reliability of the retrained models for automated detection tasks. The close agreement across architectures indicates that transfer learning, CLAHE preprocessing, and optimized hyperparameters contribute more substantially to performance than architectural differences alone.

Figures 8 and 9 further confirm these findings. The training and validation curves demonstrate stable convergence with minimal divergence, suggesting effective overfitting control. The confusion matrices and ROC curves corroborate the high discriminative capacity of all models, with near-perfect class separation on the test set.

Ensemble fusion results and comprehensive analysis

Table 6 summarizes the performance of the seven ensemble fusion strategies evaluated on the APTOS 2019 test set (733 images). Overall, the ensembles demonstrate consistently high and stable performance across all evaluation metrics. The nearly identical results obtained with hard, soft, and weighted soft voting indicate strong methodological robustness. This convergence across different aggregation mechanisms suggests that performance improvements arise primarily from complementary feature representations learned by the base models rather than from fusion-specific optimization effects.

Table 6 Performance metrics of ensemble fusion methods on APTOS 2019 test set.

Full size table

ROC–AUC analysis further confirms the discriminative strength of the proposed approach. While individual CNNs already achieved high AUC values (99.59%–99.83%), ensemble strategies maintained or slightly enhanced this performance. The highest AUC (99.80%) was obtained with rank-based and Choquet-like fusion methods, indicating excellent threshold-independent separability. In contrast, the Sugeno integral method yielded comparatively lower performance (AUC = 98.13%), suggesting reduced aggregation effectiveness under this formulation.

Generally, ensemble fusion preserves the strong baseline discrimination of individual models while providing marginal but consistent stability improvements, supporting its suitability for robust automated DR detection.

Figure 10 presents the ROC curves of representative individual CNN models and the best-performing ensemble configuration. The ensemble curve closely overlaps with—and slightly dominates—the strongest individual models across most operating regions, confirming stable threshold-independent discrimination. The best ensemble achieves a ROC–AUC of 99.80%, indicating excellent class separability and consistent calibration behavior. In contrast, the Sugeno integral method shows visible degradation, aligning with its lower AUC value reported in Table 6.

Altogether, the ROC analysis corroborates that ensemble fusion preserves the high discriminative capacity of the base models while providing improved stability across decision thresholds.

Figure 11 shows kernel density estimation (KDE) plots that show the predicted probability distributions for each ensemble fusion method applied to the test set. Each subplot shows the No_DR class in blue on the left and the DR class in orange on the right, with the optimal decision threshold in black dashed lines, determined through testing set optimization. Hard voting, soft voting, and weighted soft voting demonstrate exceptional class separation with thresholds at 0.61; rank-based fusion employs a threshold of 0.66; and average logits fusion uses 0.52. The fuzzy integral methods (Choquet-like: t = 0.20; Sugeno: t = 0.25) demonstrate compressed probability ranges. The wide decision margins (> 0.4 probability units) of the top-performing methods indicate robust calibration and flexibility for tuning the threshold, whereas the narrow margin of the Sugeno integral (overlap region) explains its suboptimal precision and accuracy.

The seven KDE distributions reveal three distinct calibration patterns:

Excellent Separation (Hard/Soft/Weighted Soft/Rank-Based/Average Logits Fusion): There are clear bimodal distributions, with No_DR concentrated near 0.0 and DR concentrated near 1.0. This creates a decision margin of 0.4 to 0.6 probability units. This wide gap enables flexible threshold placement without performance degradation.

Moderate separation (Choquet-like integral): A compressed probability range of approximately 0.2 units with adequate class separation but limited threshold flexibility. Despite the compressed probabilities, the method maintains a strong AUC of 99.80%, indicating that its discriminative power is preserved.

Severe misalignment (Sugeno integral): An extremely narrow probability range (less than 0.20 units) with near-overlapping class distributions. The No_DR class spans from − 0.05 to 0.15 with moderate dispersion. In contrast, the DR class forms a narrow spike at 0.25. This creates ambiguous decision regions, which explains the 2.79% precision degradation (95.61% vs. 98.40% for the best-performing methods).

Figure 12 demonstrates that voting-based ensemble methods (hard voting, soft voting, and weighted soft voting) achieve consistent top-tier performance across all critical metrics (see Table 6 for exact values). The near-identical results across these three fusion strategies with metric variance below 0.3% reflect methodological stability rather than over-optimization, confirming that diverse aggregation mechanisms converge to reliable outcomes. Alternative fusion approaches (Rank-Based and Average Logits Fusion) preserve this robustness while achieving marginally superior probability calibration, further validating that ensemble fusion enhances diagnostic reliability across multiple evaluation dimensions critical for clinical deployment.

Figure 13 shows the heatmap, which uses color intensity to visualize multi-metric performance:

Voting methods (hard, soft, and weighted soft voting): All cells display bright yellow/light green (0.984–0.992), indicating excellent performance across all metrics without any trade-offs. These methods achieve perfect metric consistency (variance of less than 0.3%).

Rank-based and average logits fusion: Predominantly light/medium green (0.981–0.989) with slight color variation indicates excellent, albeit slightly heterogeneous, performance (metric variance of less than 0.4%).

Choquet-like integral: Mixed green shades (0.979–0.987), with darker accuracy/F1 cells and a brighter AUC. This demonstrates an accuracy-calibration trade-off (with metric variance of less than 0.6%).

Sugeno integral: Highly heterogeneous, with the darkest precision cell (0.956, dark blue) and the brightest recall cell (0.995, yellow). This exhibits the largest metric range (3.9% points) and represents a critical precision–recall imbalance (2.8% metric variance).

Ablation study

To quantify the contribution of each major component in the proposed RDE-DR framework, an ablation analysis was conducted using controlled experimental comparisons. The evaluated components include (i) ensemble fusion versus individual CNNs, (ii) impact of fusion strategy selection, and (iii) threshold optimization. All experiments were evaluated on the same test split under identical training conditions (Table 7).

First, comparison between individual CNNs and ensemble fusion demonstrates that ensemble aggregation consistently improves metric stability and reduces performance variance across models. While individual models achieved accuracies between 97.95% and 98.64%, ensemble methods maintained comparable or improved accuracy with enhanced calibration behavior (AUC up to 99.80%).

Second, comparison across fusion strategies reveals that voting-based and rank-based methods yield stable and balanced performance, whereas fuzzy-integral approaches exhibit larger variability, particularly in precision. This confirms that fusion design significantly influences reliability rather than accuracy alone.

Third, threshold optimization improves operational flexibility and calibration by enabling adjustment of accuracy–stability trade-offs beyond the default 0.5 threshold. Probability density analysis further illustrates improved decision margins for optimized ensembles.

Although a complete ablation removing each preprocessing component independently was not conducted due to computational constraints, these results provide quantitative insight into the relative contribution of ensemble aggregation and calibration strategies.

Table 7 Partial ablation analysis of the RDE-DR framework on the APTOS 2019 dataset.

Full size table

Table 7 compares the best-performing individual CNN models with all evaluated ensemble fusion strategies under different preprocessing and evaluation conditions. The results highlight the positive effect of ensemble aggregation and fusion design on classification accuracy, calibration (AUC), and metric stability. All ensemble configurations employ CLAHE preprocessing. The performance degradation observed for the Sugeno integral-based fusion illustrates the sensitivity of ensemble reliability to the choice of fusion mechanism.

The Sugeno integral outcome can be explained by its theoretical characteristics: it is a non-compensatory aggregator designed primarily for ordinal or qualitative inputs, relying on min/max operations rather than summation or averaging. Consequently, it discards fine-grained probabilistic information from CNN outputs, is highly sensitive to outliers, and requires carefully tuned fuzzy measures to capture classifier interactions. In contrast, averaging-based methods and Choquet-like integrals preserve continuous confidence scores, allow partial compensation among classifiers, and are more robust to noise, making them better suited for probabilistic outputs in multiclass or binary deep learning ensembles.

Comparison with SOTA methods

To position the proposed RDE-DR framework within the context of recent advances in automated diabetic retinopathy detection, Table 8 presents a comparative overview of state-of-the-art methods reported between 2020 and 2025.

Table 8 Comparison between recent state-of-the-art diabetic retinopathy detection methods and the proposed RDE-DR framework on benchmark datasets.

Full size table

From Table 8, we can clearly see that recent state-of-the-art DR detection methods achieved strong performance on benchmark datasets, particularly on APTOS 2019, with reported accuracies generally ranging between 93% and 98.9%. Most previous works relied on transfer learning, attention mechanisms, data balancing techniques, or simple ensemble strategies to enhance performance, often excelling in specific metrics such as accuracy or precision.

In comparison, the proposed RDE-DR framework demonstrates competitive and balanced performance across all evaluation metrics. By integrating CLAHE preprocessing, transfer learning, and seven complementary ensemble fusion strategies; including hard/soft voting, weighted soft voting, rank-based fusion, and fuzzy integrals (Choquet and Sugeno)—RDE-DR achieves 98.64% accuracy, 98.92% recall, 98.66% F1-score, and an AUC of 99.78% on APTOS 2019. Notably, the very high AUC and recall indicate strong discriminative ability and reliable detection of positive DR cases, which are clinically critical. Overall, while previous studies achieved strong results using specific architectural or preprocessing enhancements, the proposed framework provides a more comprehensive and robust fusion strategy that ensures consistently high performance across multiple evaluation criteria.

Conclusions

This study presented RDE-DR, a structured framework for systematically evaluating ensemble fusion strategies in automated diabetic retinopathy detection. Rather than emphasizing a single best-performing configuration, the framework enables controlled comparison of seven heterogeneous fusion mechanisms under identical preprocessing, training, and calibration conditions. The results demonstrate that voting-based, rank-based, and average-logit fusion strategies converge to similarly high and stable performance on the APTOS 2019 dataset, while fuzzy-integral approaches exhibit distinct calibration behaviors and trade-offs between precision and accuracy.

A key contribution of this work lies in the integration of ensemble fusion to aggregate the complementary strengths of CNNs, threshold optimization to calibrate decision boundaries, and the combination of these with CLAHE preprocessing and transfer learning. The observed consistency within this proposed pipeline provides insight into its operational flexibility and reliability, beyond conventional accuracy metrics, within the limits of the evaluated dataset.

While the experimental results indicate strong performance on a public benchmark, the conclusions are restricted to retrospective evaluation on a single dataset. Future work will focus on cross-dataset validation with strict patient-level split, multi-class severity grading, explainability integration, and prospective clinical evaluation to further assess generalizability and clinical applicability.

Data availability

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. The retinal images used in this work are from the publicly available APTOS 2019 Blindness Detection dataset (Asia Pacific Tele-Ophthalmology Society), accessible via the Kaggle at: https://www.kaggle.com/competitions/aptos2019-blindness-detection/data.

References

Sivaprasad, S. et al. Diabetic retinal disease. Nat. Reviews Disease Primers. 11 (1), 62 (2025).
Article PubMed Google Scholar
Madhu, S. et al. Accurate diabetic retinopathy segmentation and classification model using gated recurrent unit with residual attention network. Biomed. Signal Process. Control. 102, 107348 (2025).
Article Google Scholar
Zhang, Z., Deng, C. & Paulus, Y. M. J. B. Advances in structural and functional retinal imaging and biomarkers for early detection of diabetic retinopathy. Biomedicines 12 (7), 1405 (2024).
Article CAS PubMed PubMed Central Google Scholar
Grzybowski, A. et al. Retina fundus photograph-based artificial intelligence algorithms in medicine: a systematic review. Ophthalmol. therapy. 13 (8), 2125–2149 (2024).
Article Google Scholar
Birjais, R. J. Approaches, and challenges, challenges and future directions for segmentation of medical images using deep learning models. In Deep learning applications in medical image segmentation: overview, approaches, and challenges, pp. 243–264. (2025).
Prethija, G. & Katiravan, J. J. P. Delving into transfer learning within U-Net for refined retinal vessel segmentation: An extensive hyperparameter analysis. Photodiagn. Photodyn. Therapy 104620. (2025).
Li, M. et al. Medical image analysis using deep learning algorithms. Front. Public. Health. 11, 1273253 (2023).
Article PubMed PubMed Central Google Scholar
Tiwari, S. & Shukla, A. Ensemble Deep Learning for DR Identification: Integrating DenseNet121 and VGG19 Architectures. In. 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR). 2024. IEEE. (2024).
Spooner, A. et al. Benchmarking ensemble machine learning algorithms for multi-class, multi-omics data integration in clinical outcome prediction. Brief. Bioinform. 26 (2), 116 (2025).
Article Google Scholar
Karczmarek, P. et al. Choquet integral-based aggregation for the analysis of anomalies occurrence in sustainable transportation systems. IEEE Trans. Fuzzy Syst. 31 (2), 536–546 (2022).
Article Google Scholar
Vallukappully, S., van der Linde, I. & Chakraborty, A. J. I. Early detection and classification of diabetic retinopathy by transfer learning of NASNet-large and ResNet-50 convolutional neural networks. Informatics Med. Unlocked : 101688. (2025).
Arora, L. et al. Ensemble deep learning and EfficientNet for accurate diagnosis of diabetic retinopathy. Sci. Rep. 14 (1), 30554 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Ejaz, M. S. & Innovation, S. A comprehensive study on the automatic identification of diabetic retinopathy. Int. J. Res. Sci. Innov. 12 (5), 1467–1475 (2025).
Google Scholar
Sanamdikar, S. T. et al. Enhanced Detection of Diabetic Retinopathy Using Ensemble Machine Learning: A Comparative Study. Ingenierie des. Systemes d’Information. 28 (6), 1663 (2023).
Google Scholar
Ghosh, S. & Chatterjee, A. Transfer-ensemble learning based deep convolutional neural networks for diabetic retinopathy classification. In 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE). IEEE. (2023).
Mishra, A., Pandey, M. & Singh, L. J. I. VEnDR-Net: voting ensemble classifier for automated diabetic retinopathy detection. Informatica 49(32) (2025).
Rafid Islam, M. et al. Balancing accuracy and efficiency: CNN fusion models for diabetic retinopathy screening. arXiv: 2512.21861. (2025).
Lin, J. J. Selective diabetic retinopathy screening with accuracy-weighted deep ensembles and entropy-guided abstention. (2025).
Barwal, A. et al. Diabetic retinopathy detection using deep learning ensemble with soft voting. In International Conference on Data Analytics & Management. Springer. (2024).
Agarwal, D. K. & Nehra, M. S. Optimized detection of diabetic retinopathy through image preprocessing and ensemble models. 28(4), 387–410 (2025).
Safarpour, H. et al. Explainable deep learning framework for brain tumor segmentation using vision transformer and conditional random fields. Multimedia Syst. 32 (1), 19 (2026).
Article Google Scholar
Ranjbarzadeh, R. et al. A Hybrid UNet and Vision Transformer Architecture with Multi-scale Fusion for Brain Tumor Segmentation. In International Conference on Medical Imaging and Computer-Aided Diagnosis. Springer. (2024).
Yanar, E. K. et al. A comparative analysis of the mamba, transformer, and CNN architectures for multi-label chest X-ray anomaly detection in the NIH ChestX-Ray14 dataset. Diagnostics 15 (17), 2215 (2025).
Article PubMed PubMed Central Google Scholar
Yanar, E. & Ayturan, H. F. CELM: an ensemble deep learning model for early cardiomegaly diagnosis in chest radiography. Diagnostics 15 (13), 1602 (2025).
Article PubMed PubMed Central Google Scholar
Yanar, E., Hardalaç, F. & Ayturan, K. PELM: a deep learning model for early detection of pneumonia in chest radiography. Appl. Sci. 15 (12), 6487 (2025).
Article CAS Google Scholar
Ranjbarzadeh, R., Crane, M. & Bendechache, M. J. The impact of backbone selection in Yolov8 Models on brain tumor localization. Iran J. Comput. Sci. 1–23 (2025).
Kobat, S. G. et al. Automated diabetic retinopathy detection using horizontal and vertical patch division-based pre-trained DenseNET with digital fundus images. 12(8), 1975 (2022).
APTOS 2019 Blindness Detection. Available from: https://www.kaggle.com/c/aptos2019-blindness-detection/overview/evaluation.
Saleem, M. A. et al. Enhancing stroke risk prediction through class balancing and data augmentation with CBDA-ResNet50. Sci. Rep. 15 (1), 24553 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Mohammed, I. M. & Isa, N. A. M. Contrast limited adaptive local histogram equalization method for poor contrast image enhancement. IEEE Access (2025).
Okazaki, S. et al. RadImageNet and ImageNet as Datasets for Transfer Learning in the Assessment of Dental Radiographs: A Comparative Study. J. imaging Inf. Med. 38 (1), 534–544 (2025).
Article Google Scholar
Conquer, V. et al. Comprehensive Review of Open-Source Fundus Image Databases for Diabetic Retinopathy Diagnosis. Sensors 25 (18), 5658 (2025).
Article ADS PubMed PubMed Central Google Scholar
Yoshimi, Y. et al. Image preprocessing with contrast-limited adaptive histogram equalization improves the segmentation performance of deep learning for the articular disk of the temporomandibular joint on magnetic resonance images. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 138 (1), 128–141 (2024).
Article PubMed Google Scholar
Salehi, A. W. et al. A study of CNN and transfer learning in medical imaging: Advantages, challenges, future scope. Sustainability 15 (7), 5930 (2023).
Article ADS Google Scholar
Chaudhary, G. et al. Transfer learning in building dynamics prediction. Energy Build. 330, 115384 (2025).
Article Google Scholar
Ghoneim, O., Dobias, P. & Romain, O. J. Survey of neural network optimization methods for sustainable AI: From data preprocessing to hardware acceleration. Mach. Learn. Appl. 100762 (2025).
Dawson, H. L. et al. Impact of dataset size and convolutional neural network architecture on transfer learning for carbonate rock classification. Comput. Geosci. 171, 105284 (2023).
Article Google Scholar
Alabduljabbar, A. et al. Medical imaging datasets, preparation, and availability for artificial intelligence in medical imaging. J. Alzheimer’s Disease Rep. 8 (1), 1471–1483 (2024).
Article Google Scholar
Dharmik, A. J. J. M. COVID-19 Pneumonia Diagnosis Using Medical Images: Deep Learning–Based Transfer Learning Approach. Comput. Methods Programs Biomed. 6, e75015 (2025).
Google Scholar
Karacı, A. J. VGGCOV19-NET: automatic detection of COVID-19 cases from X-ray images using modified VGG19 CNN architecture and YOLO algorithm. Neural Comput. Appl. 34 (10), 8253–8274 (2022).
Article PubMed PubMed Central Google Scholar
Rani, R. et al. VGG-EffAttnNet: Hybrid Deep Learning Model for Automated Chili Plant Disease Classification Using VGG16 and EfficientNetB0 With Attention Mechanism. Food Sci. Nutr. 13 (7), e70653 (2025).
Article PubMed PubMed Central Google Scholar
Anand Kumar, P. & Sountharrajan, S. J. Insurance claims estimation and fraud detection with optimized deep learning techniques. Sci. Rep. 15 (1), 27296 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Goutam, B. et al. A comprehensive review of deep learning strategies in retinal disease diagnosis using fundus images. IEEE Access. 10, 57796–57823 (2022).
Article Google Scholar
Zhang, Y., Ning, C. & Yang, W. J. An automatic cervical cell classification model based on improved DenseNet121. Sci. Rep. 15 (1), 3240 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Sadiq, S. S. Improving cbir techniques with deep learning approach: An ensemble method using nasnetmobile, densenet121, and vgg12. J. Rob. Control (JRC). 5 (3), 863–874 (2024).
Google Scholar
Srinivasan, D. & Kalaiarasan, C. J. Gradient Propagation Based DenseNet121 with ResNet50 Feature Extraction for Lymphoma Classification. J. Institution Eng. (India): Ser. B. 106 (4), 1183–1195 (2025).
Article ADS Google Scholar
Zhou, T. et al. Dense convolutional network and its application in medical image analysis. BioMed Res. Int. 1, 2384830 (2022).
Deepak, V. & Sarath, R. J. Cascaded regression with dual CNN frame work for time effective detection of gliomas cancers. Intelligence-Based Med. 10, 100168 (2024).
Article Google Scholar
Borawar, L. & Kaur, R. ResNet: Solving vanishing gradient in deep networks. In Proceedings of International Conference on Recent Trends in Computing: ICRTC 2022. Springer. (2023).
Moustari, A. M. et al. Two-stage deep learning classification for diabetic retinopathy using gradient weighted class activation mapping. 65(3), 1284–1299 (2024).
Oladimeji, O. O. A.O.J. Brain tumor classification using ResNet50-convolutional block attention module. Appl. Comput. Inf. (2023).
Sriram Ganesh, G. et al. Detecting Monkeypox skin lesions with deep learning: A promising approach for early diagnosis. In International Conference on Computers, Management & Mathematical Sciences. Springer. (2023).
Liu, Z. L. Ensemble learning, in Artificial Intelligence for Engineers: Basics and Implementations. Springer. 221–242. (2025).
Fan, Z. et al. Diverse models, united goal: A comprehensive survey of ensemble learning. CAAI Trans. Intell. Technol. (2025).
Khan, U. S. Boost diagnostic performance in retinal disease classification utilizing deep ensemble classifiers based on OCT. Multimedia Tools Appl. 84 (19), 21227–21247 (2025).
Article Google Scholar
Khafaga, D. S. et al. Voting classifier and metaheuristic optimization for network intrusion detection. Comput. Mater. Contin. 74(2). (2023).
Jabbar, H. G. J. Advanced threat detection using soft and hard voting techniques in ensemble learning. J. Rob. Control (JRC). 5 (4), 1104–1116 (2024).
Google Scholar
Zahrouri, A., Mazouzi, S. & Benaboud, R. J. I. TSO-optimized weighted soft voting ensemble of pretrained CNNs for MRI-based brain tumor classification. Informatica 49(6). (2025).
Asif, S. et al. BREAST-RANKNet: a fuzzy rank-based ensemble of CNNs with residual learning for enhanced breast cancer detection from ultrasound and mammogram images. J. Big Data. 12 (1), 194 (2025).
Article Google Scholar
Zhang, X. et al. New classifier ensemble and fuzzy community detection methods using POP Choquet-like integrals. Fractal Fract. 7 (8), 588 (2023).
Article Google Scholar
Asif, S. et al. SFI-ensemble: Sugeno fuzzy integral-based ensemble of CNN models with meta-heuristic fuzzy measures for mouth and oral disease detection. Artif. Intell. Rev. 58 (11), 353 (2025).
Article Google Scholar
Zhou, Q. et al. Democratizing AI through model fusion: A comprehensive review and future directions. Nexus (2025).
You, H. et al. MSTNet: A prostate imaging diagnosis algorithm based on feature similarity dynamic fusion and threshold optimization. Inf. Fusion 103825 (2025).
Staňková, M. J. Artificial Factors Within the Logit Bankruptcy Model with a Moved Threshold: M. Staňková. Comput. Econ. 66 (2), 1107–1135 (2025).
Article Google Scholar
Aiche, I. et al. Transfer learning for diabetic retinopathy detection. In 2022 International Conference of Advanced Technology in Electronic and Electrical Engineering (ICATEEE). IEEE. (2022).
Nadir, C. et al. A sequential combination of convolution neural network and machine learning for finger vein recognition system. Signal. Image Video Process. 18 (11), 8267–8278 (2024).
Article Google Scholar
Brik, Y. et al. Deep learning-based framework for automatic diabetic retinopathy detection. In. 32nd International Conference on Computer Theory and Applications (ICCTA). 2022. IEEE. (2022).
Roy, P. S. & Kukreja, V. J. Vision transformers for rice leaf disease detection and severity estimation: A precision agriculture approach. J. Saudi Soc. Agricultural Sci. 24 (3), 3 (2025).
Article Google Scholar
Mondal, S. S. et al. Edldr: An ensemble deep learning technique for detection and classification of diabetic retinopathy. Diagnostics 13 (1), 124 (2022).
Article PubMed PubMed Central Google Scholar
Fayyaz, A. M. et al. Analysis of diabetic retinopathy (DR) based on the deep learning. Information 14 (1), 30 (2023).
Article Google Scholar
Nahiduzzaman, M. et al. Diabetic retinopathy identification using parallel convolutional neural network based feature extractor and ELM classifier. Expert Syst. Appl. 217, 119557 (2023).
Article Google Scholar
Jena, P. K. et al. A novel approach for diabetic retinopathy screening using asymmetric deep learning features. Big Data Cogn. Comput. 7 (1), 25 (2023).
Article Google Scholar
Khalifa, N. E. M. et al. Deep transfer learning models for medical diabetic retinopathy detection. Acta Informatica Med. 27 (5), 327 (2019).
Article Google Scholar
Aftab, S., Akhtar, S. & Applications diabetic retinopathy severity classification using data fusion and ensemble transfer learning. J. Softw. Eng. Appl., 18(1): 1–23. (2025).
Wang, Z. et al. Diabetic retinopathy classification using a multi-attention residual refinement architecture. Sci. Rep. 15 (1), 29266 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Ahmed, F. J. Addressing high class imbalance in multi-class diabetic retinopathy severity grading with augmentation and transfer learning. (2025).
Pamungkas, Y. et al. Enhancing Diabetic Retinopathy Classification in Fundus Images using CNN Architectures and Oversampling Technique. J. Rob. Control (JRC). 6 (1), 413–425 (2025).
Article Google Scholar
Kavuru, A. K., Patjoshi, R. K. & Panigrahi, R. J. A hybrid CNN-RBF approach for classification of diabetic retinopathy. Traitement du Signal 42(5) (2025).
Guerbai, Y. et al. Deep learning techniques for diabetic retinopathy classification: a focus on VGG16 and EfficientNetB0. South Fla. J. Dev. 5(10), e4517–e4517.

Download references

Acknowledgements

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R754), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Funding

Open access funding provided by Royal Institute of Technology. This research received no external funding.

Author information

Authors and Affiliations

Laboratory of Signal and Systems Analysis (LASS), Department of Electronics, Faculty of Technology, University of M’sila, Ichebilia, PO Box 166, M’sila, 28000, Algeria
Ishaq Aiche & Oussama Bouguerra
Laboratory of Telecommunication and Smart Systems (LTSS), Faculty of Science and Technology, University of Djelfa, PO Box 3117, Djelfa, 17000, Algeria
Abdelaziz Rabehi
Civil and Architectural Engineering, KTH Royal Institute of Technology, Teknikringen, 78, 11428, Stockholm, Sweden
Mustapha Habib
Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, 11671, Saudi Arabia
Doaa Sami Khafaga
Department for Communications and Electronics, Delta Higher Institute of Engineering and Technology, Mansoura, 35511, Egypt
El-Sayed M. El-kenawy
Jadara Research Center, Jadara University, Irbid , 21110, Jordan
El-Sayed M. El-kenawy
Laboratory of Electrical Engineering (LGE), Faculty of Technology, University of M’sila, PO Box 166 Ichebilia, 28000 M’sila, Algeria
Youcef Brik & Bilal Attallah

Authors

Ishaq Aiche
Youcef Brik
Bilal Attallah
Oussama Bouguerra
Abdelaziz Rabehi
Mustapha Habib
Doaa Sami Khafaga
El-Sayed M. El-kenawy

Contributions

Conceptualization: M.H., S.M.A., E.M.E.; Methodology: M.H., S.M.A., E.M.E.; Software: I.A., Y.B., B.A., A.B.; Validation: E.M.E., Y.B., B.A., A.B.; Formal analysis: I.A., Y.B., B.A., A.B. O.B.; Investigation: M.H, S.M.A, E.M.E.; Resources: M.H., S.M.A, E.M.E; Data curation: I.A., Y.B., B.A., A.B.; Writing—original draft: M.H., E.M.E, I.A; Writing—review and editing: M.H., S.M.A, E.M.E; Visualization: M.H., S.M.A, E.M.E; Supervision: M.H, S.M.A, E.M.E; Project administration: M.H, S.M.A, E.M.E; Funding acquisition: M.H. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Mustapha Habib.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

This material is the author’s original work, which has not been previously published elsewhere. All authors have been personally and actively involved in substantial work leading to the paper and will take public responsibility for its content. The paper properly credits the meaningful contributions of all the co-authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Aiche, I., Brik, Y., Attallah, B. et al. RDE-DR: robust deep ensemble CNNs for automated diabetic retinopathy detection from fundus images. Sci Rep 16, 15226 (2026). https://doi.org/10.1038/s41598-026-48669-y

Download citation

Received: 22 December 2025
Accepted: 09 April 2026
Published: 17 May 2026
Version of record: 17 May 2026
DOI: https://doi.org/10.1038/s41598-026-48669-y

RDE-DR: robust deep ensemble CNNs for automated diabetic retinopathy detection from fundus images - Nature

Introduction

Related work

Materials and methods

Proposed method

Dataset

Data pre-processing

CLAHE (contrast-limited adaptive histogram equalization)

CNN models

VGG16 and VGG19 models

DenseNet121 architecture

ResNet50 architecture

Hyperparameter selection and optimization strategy

Ensemble classifier method

Ensemble fusion strategies

Threshold optimization

Overfitting prevention

Results

Experimental protocol

Evaluation metrics

Results of individual classification models

Ensemble fusion results and comprehensive analysis

Ablation study

Comparison with SOTA methods

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords