Objective: Our objective is to create an effective ensemble tool that can accurately predict MEFV gene variants and determine the threshold value for pathogenicity based on the optimal distribution.
Methods: First, we extracted a dataset from the Infevers database [https://infevers.umai-montpellier.fr/web/search.php?n=1]. Second, we merged the variant classification into 2 categories: likely benign and likely pathogenic. Third, we implemented our high-sensitivity model to obtain disease-causing variants. In the 4 steps, we implemented curve estimation analysis to determine which curve was fitting our variant distribution. We implemented the receiver operating curve after the curve estimation analysis to find suitable in silico tool models for logistic regression. Repeated outlier detection analysis was performed in the fifth step until no outliers were detected. Ensemble tree-based machine-learning models were used to test a statistical model in the final step.
Results: When outliers were taken out, the Revel and BayesDel algorithms both had much higher ROCAUC scores (0.982 [0.967-0.998], P < .001 for the combined model; 0.982 [0.967-0.998], P < .001 for Revel; and 0.933 [0.889-0.977], P < .001 for BayesDel). AdaBoost was the most accurate machine learning model, with 0.982 ROACUAC scores.
Conclusion: Our study revealed that the implementation of outlier and anomaly detection techniques can enhance the accuracy of statistical models and yield more precise outcomes in machine learning datasets.
Cite this article as: Alay MT. An ensemble model based on combining BayesDel and Revel scores indicates outstanding performance: Importance of outlier detection and comparison of models. Cerrahpaşa Med J. 2024;48(2):179-184.