LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer.
Although modern methods of whole genome DNA methylation analysis have a wide range of applications, they are not suitable for clinical diagnostics due to their high cost and complexity and due to the large amount of sample DNA required for the analysis. Therefore, it is crucial to be able to identify a relatively small number of methylation sites that provide high precision and sensitivity for the diagnosis of pathological states. We propose an algorithm for constructing limited subsamples from high-dimensional data to form diagnostic panels. We have developed a tool that utilizes different methods of selection to find an optimal, minimum necessary combination of factors using cross-entropy loss metrics (LogLoss) to identify a subset of methylation sites. We show that the algorithm can work effectively with different genome methylation patterns using ensemble-based machine learning methods. Algorithm efficiency, precision and robustness were evaluated using five genome-wide DNA methylation datasets (totaling 626 samples), and each dataset was classified into tumor and non-tumor samples. The algorithm produced an AUC of 0.97 (95% CI: 0.94-0.99, 9 sites) for prostate adenocarcinoma and an AUC of 1.0 (from 2 to 6 sites) for urothelial bladder carcinoma, two types of kidney carcinoma and colorectal carcinoma. For prostate adenocarcinoma we showed that identified differential variability methylation patterns distinguish cluster of samples with higher recurrence rate (hazard ratio for recurrence = 0.48, 95% CI: 0.05-0.92; log-rank test, p-value
Список научных проектов, где отмечена публикация
- Белки и пептиды в постгеномную эру. Структурно-функциональные исследования для решения фундаментальных задач и направленного конструирования инновационных лекарственных средств (6 Января 2014 года 31 Декабря 2018 года). . Грант, РНФ.