Assessing Machine Learning for Diagnostic Classification of Hypertension Types Identified by Ambulatory Blood Pressure Monitoring

Background Inaccurate blood pressure (BP) classification results in inappropriate treatment. We tested whether machine learning (ML), using routine clinical data, can serve as a reliable alternative to ambulatory BP monitoring (ABPM) in classifying BP status. Methods This study employed a multicentre approach involving 3 derivation cohorts from Glasgow, Gdańsk, and Birmingham, and a fourth independent evaluation cohort. ML models were trained using office BP, ABPM, and clinical, laboratory, and demographic data, collected from patients referred for hypertension assessment. Seven ML algorithms were trained to classify patients into 5 groups, named as follows: Normal/Target; Hypertension-Masked; Normal/Target-White-Coat (WC); Hypertension-WC; and Hypertension. The 10-year cardiovascular outcomes and 27-year all-cause mortality risks were calculated for the ML-derived groups using the Cox proportional hazards model. Results Overall, extreme gradient boosting (using XGBoost open source software) showed the highest area under the receiver operating characteristic curve of 0.85-0.88 across derivation cohorts, Glasgow (n = 923; 43% female; age 50.7 ± 16.3 years), Gdańsk (n = 709; 46% female; age 54.4 ± 13 years), and Birmingham (n = 1222; 56% female; age 55.7 ± 14 years). But accuracy (0.57-0.72) and F1 (harmonic mean of precision and recall) scores (0.57-0.69) were low across the 3 patient cohorts. The evaluation cohort (n = 6213; 51% female; age 51.2 ± 10.8 years) indicated elevated 10-year risks of composite cardiovascular events in the Normal/Target-WC and the Hypertension-WC groups, with heightened 27-year all-cause mortality observed in all groups, except the Hypertension-Masked group, compared to the Normal/Target group. Conclusions ML has limited potential in accurate BP classification when ABPM is unavailable. Larger studies including diverse patient groups and different resource settings are warranted.


R ESUM E
Contexte : Les erreurs dans la classification des valeurs de la pression art erielle (PA) entraînent une inad equation du traitement.Nous avons tâch e de d eterminer si l'apprentissage machine, à l'aide de donn ees cliniques routinières, constituait une solution de rechange fiable à la surveillance ambulatoire de la PA pour d efinir le statut de la PA.M ethodologie : Cette etude a utilis e une approche multicentrique incluant trois cohortes de d erivation de Glasgow, Gda nsk et Birmingham, et une quatrième cohorte d' evaluation ind ependante.Les modèles d'apprentissage machine ont et e d evelopp es en analysant les donn ees d emographiques, les valeurs de la PA mesur ee au cabinet, les donn ees relatives à la surveillance ambulatoire de la PA et aux epreuves de laboratoire recueillies auprès de patients adress es pour une evaluation de l'hypertension.Sept algorithmes d'apprentissage machine ont et e appliqu es pour classer les patients en cinq groupes : Normale/Cible; Hypertension-Masqu ee; Normal/Cible-Blouse blanche; 2][3] Compared to oBP, ABPM is a superior predictor of hypertension-mediated organ damage, and cardiovascular disease (CVD) morbidity and mortality, 3 and it can identify patients with sustained hypertension, white-coat (WC) hypertension (BP overestimation by oBP measurement), and masked hypertension (BP underestimation by oBP measurement).Those with WC and masked hypertension comprise at least a third of the at-risk population, and masked hypertension is more prevalent in populations of African ancestry. 4WC and masked hypertension are associated with an increased risk of progression to sustained hypertension, 5 highlighting the importance of early identification for more intensive follow-up, even if pharmacologic therapy is not initiated.However, ABPM has several drawbacks, including the limited access to devices (often available only in secondary care); significantly higher device costs compared to oBP measurement devices; staff training; patient discomfort; and sleep disruption. 6,7Any method that can determine true hypertension status without incurring the expense and discomfort of using ABPM will enable effective and timely management in both primary and secondary care.We hypothesized that machine learning (ML) applied to routine clinical data can replace the need for ABPM.Our primary objective was to develop an ML algorithm capable of accurately classifying patients without ABPM measurements into distinct BP risk groups that are as informative as classifications made using ABPM.Our secondary objective was to demonstrate the clinical utility of the ML classification by comparing the risk of CVD morbidity and mortality between these groups.The more-extensive evaluation of all the additional information that ABPM provides (such as BP variability, and circadian variation) was beyond the scope of this study.A validated ML tool that performs well in different settings and is free from algorithmic bias will help hypertension management in both resource-poor and resource-rich settings, allowing clinicians and nurses to effectively manage hypertension by risk-stratifying hypertensive patients, without incurring the added cost or inconvenience to both clinic and patients.

Study datasets
This retrospective study included the following patients: those referred for hypertension assessment to the Glasgow Blood Pressure Clinic 8 at the Queen Elizabeth University Hospital, Glasgow, United Kingdom (UK) between 2017 and 2019 (Glasgow cohort); patients participating in the CARE NORTH study, a prospective study of hypertensive patients from the outpatient specialist clinic at the Medical University of Gda nsk, Poland (Gda nsk cohort); and patients attending hypertension clinics at Birmingham Heartlands Hospital, Birmingham, UK, between 2001 and 2020 (Birmingham cohort).A temporally distinct cohort for ML-model evaluation was extracted from the Glasgow Blood Pressure Clinic database comprising patients referred to the clinic between 1985 and 2011 and followed-up until 2013 (Glasgow non-ABPM cohort; nonoverlapping with the Glasgow Cohort).Details of the cohorts and ethics statements are provided in Supplemental Appendix S1.

BP groups
[3]9 These are described in  If both oSBP and aSBP meet nonhypertensive thresholds, this indicates normal SBP in the absence of treatment or at target SBP if on treatment; these patients are grouped together as "Normal/Target."The group "Hypertension" defines sustained hypertension.The group "Hypertension-Masked" defines hypertensive aSBP in the presence of normal oSBP.Those with a WC effect are divided into 2 groups: those with normal aSBP in the presence of hypertensive oSBP are in the Normal/Target-WC group, whereas those with hypertensive aSBP and oSBP with oSBP ! 15 mm Hg higher than aSBP, are in the Hypertension-WC Group.Any difference < 15 mm Hg between the oSBP and the aSBP of an individual with hypertensive oSBP and aSBP qualifies the individual for the Hypertension group.We assigned each patient to 1 of 5 BP groups based on the definitions in Table 1 for the Glasgow, Gda nsk, and Birmingham cohorts.ML models were derived using these group labels.

Clinical features
Demographics, CVD, antihypertensive therapy, and blood chemistry test results were used to develop the model (see Supplemental Table S1).At the time of ABPM, or within a 3month window preceding ABPM, features were obtained from standard clinical assessment of patients with hypertension in primary or secondary care who underwent ABPM.However, cohort-specific variations were present in the available features, reflecting differences in clinical practice in the real world.The Glasgow cohort has 14 features; the Gda nsk cohort has 12 features; and the Birmingham cohort has 10 features (see Supplemental Tables S1-S4).

ML models
All data analysis and ML models were performed in Python 3.8 using Scikit-learn (https://scikit-learn.org/),eXtreme Gradient Boosting (XGBoost open source software, https:// github.com/dmlc/xgboost),and Matplotlib software libraries (https://matplotlib.org/).Missing data were imputed using the k-nearest neighbor (kNN) algorithm.The Shapiro-Wilk test was utilized to examine normality, and the c 2 test was utilized to examine the independence of categorical variables.
We derived 7 ML models from clinical features (excluding ABPM, but including oBP measurements), including multinomial logistic regression, support vector machine, kNN, naïve Bayes, decision tree, random forest, and tree-based XGBoost. 10 Using stratified 5-fold cross-validation, separate ML models were developed for each of the 3 cohorts, owing to the distinct features of each and performance metrics reported for each (see Supplemental Appendix S1).

Performance metrics
Performance of models was reported using the area under the receiver operating characteristic curve (AUROC) and measures calculated from confusion matrices generated for each BP group: accuracy, precision (positive predictive value [PPV]; proportion of relevant instances among the retrieved instances), recall (sensitivity), specificity, F1-score (the harmonic mean of precision and recall), negative predictive value (NPV), and number needed to misdiagnose (NNM; the number of patients who need to be tested in order for 1 to be misdiagnosed by the test).Calibration (the degree of similarity between observed and predicted probability) was assessed by calibration plot and Brier score, as recommended by the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines (see Supplemental Appendix S1). 11The performance of a model was obtained by averaging the performance across all 5 folds.Results were reported as mean AE standard deviation.In all cases, the significance level was P < 0.05.

Predicting BP group in the Glasgow non-ABPM cohort
Models derived from the Glasgow, Gda nsk, and Birmingham cohorts were used to predict model-specific BP groups for the 7812 patients in the Glasgow non-ABPM cohort.the hypertension duration and the number of antihypertensive medications were not available for the Glasgow non-ABPM cohort, so these features were omitted from the models.

Survival analysis
Survival analysis was conducted on the Glasgow non-ABPM cohort, by applying models derived from the Glasgow, Gda nsk, and Birmingham ABPM cohorts.The outcomes analyzed were all-cause mortality and CVD composite outcome (defined as time to first CVD admission or CVD death from myocardial infarction, ischemic heart disease, cerebrovascular accident, heart failure, or peripheral vascular disease; see Supplemental Appendix S1).The patients were followed from their first BP clinic visit until death, emigration, or April 1, 2011 (the end of follow-up).Multivariable Cox proportional hazards models were used to assess the prognostic effect of the ML-derived BP groups on allcause mortality and composite CVD events after adjustment for baseline variables, age, sex, body mass index, cholesterol, smoking status, Charlson comorbidity index, and a variable on year of the first visit strata (epochs) to adjust for secular trends In mortality.Schoenfeld residuals were used to test the proportional hazards assumption.Multiple imputation by chained equations (MICE) was performed for variables with < 10% of values missing (body mass index and cholesterol).Ten imputation datasets were generated, and pooled estimates from Cox regression are reported.

Results
Table 2 and Supplemental Tables S2-S4 summarize the clinical, laboratory, and demographic characteristics of the 3 study datasets.The proportion of patients on antihypertensive medication varied across cohorts (54% in Glasgow, 96% in Gda nsk, and 42% in Birmingham).The Glasgow cohort had the highest oBP and ABPM readings, and the Gda nsk cohort had the lowest.In Supplemental Tables S2-S4, the distribution of patients across the 5 BP groups, and a summary of the clinical features of the 3 cohorts, are presented.The BP groups with the smallest proportion of patients were the Hypertension-Masked group in the Birmingham cohort (2.1%) and the Glasgow cohort (5.6%), and the Hypertension-WC group in the Gda nsk cohort (6.8%).The Normal/Target-WC group in the Glasgow cohort (37.7%), the Normal/Target-WC group in the Gda nsk cohort (46.3%), and the Normal/Target-WC group in the Birmingham cohort (44.8%) had the highest proportions of patients.

Model performance
The performance measures are presented in Table 3 and Supplemental Table S5.Table 3 shows the performance of the XGBoost ML model in classifying patients into distinct BP groups, as defined by ABPM data, which we consider to be the gold standard in this context.Overall, XGBoost and random forest models with AUROC values between 0.85 and 0.88 had the highest level of performance (see Supplemental Figs.S1 and S2).For XGBoost, the accuracy ranged between 0.57 and 0.72, and F1 ranged between 0.57 and 0.69 across the 3 patient cohorts.The simplest kNN model (Glasgow and Gda nsk cohorts) and the naïve Bayes model (Glasgow and Birmingham cohorts) performed the worst.The confusion matrices and the corresponding performance metrics for XGBoost classification of patients into the 5 BP groups in the 3 cohorts are depicted in Table 3 and Supplemental Figure S3.Across all 3 cohorts, XGBoost correctly classified 79%-99% of patients in the Normal/ Target group, and 60%-80% of patients in the Normal/ Target-WC group.The classification accuracies for the Hypertension and the Hypertension-Masked groups were variable across the 3 cohorts, with misclassification into the Normal/Target-WC group and the Normal/Target group, respectively.Supplemental Figure S4 depicts XGBoost calibration.

Feature importance
The XGBoost model revealed that in all cohorts, oSBP, age, cholesterol, and creatinine level were the most influential predictors (see Supplemental Fig. S5).Substantial variations were present in influential characteristics among models and among cohorts (see Supplemental Fig. S6).XGBoost demonstrated a balanced contribution from all features, whereas other models exhibited a predominance of 1 to 3 features in determining model performance (Supplemental Fig. S6).

Survival analysis
The demographic characteristics of the Glasgow non-ABPM cohort are summarized in Supplemental Table S6.The results of Cox regression analysis are displayed in Table 4 and Supplemental Figure S7.A significantly higher 10-year risk of composite CVD events was evident for the Normal/Target-WC group (hazard ratio (HR) 1.3 [95% CI: 1.09; 1.55], P ¼ 0.003) and the Hypertension-WC group (HR 1.38 [95% CI: 1.14; 1.67], P ¼ 0.001), compared to that for the Normal/Target group when predicted by the model derived from the Glasgow cohort.When heterologous modelsdthat is, Birmingham or Gda nsk XGBoost modelsdwere applied to the Glasgow non-ABPM cohort, the outcomes were largely consistent.Hypertension-WC groups predicted by the Birmingham (HR 1.42 [95% CI: 1.21; 1.66], P < 0.0001) and Gda nsk (HR 1.33 [95% CI: 1.12; 1.57], P ¼ 0.001) XGBoost models were associated with a higher 10-year CVD risk, compared to that for the Normal/Target group.The predicted Normal-Target WC groups did not show an increased risk of 10-year cardiovascular events for the Birmingham cohort (HR 1.16 [95% CI: 0.99; 1.36], P ¼ 0.059) or the Gda nsk cohort (HR 1.16 [95% CI: 0.96; 1.39], P ¼ 0.127).An interesting finding is that the Glasgow and Gda nsk models did not reveal an increased risk of CVD outcomes in the Hypertension group.The Normal/Target-WC and the Hypertension-WC groups showed significantly higher all-cause mortality compared to the Normal/Target group for all 3 models.The Hypertension group showed significantly increased mortality risk with the Glasgow and Gda nsk models.The Hypertension-Masked group showed significantly increased mortality risk with only the Glasgow model.For all-cause mortality outcomes, the Normal/Target-WC group (Glasgow: HR 1.42 [95% CI: 1.17; 1.73], P ¼ 0.0004; Birmingham: HR 1.2 [95% CI: 1.01; 1.42], P ¼ 0.036; Gda nsk: HR 1.25 [1.03; 1.52], P ¼ 0.025) and the Hypertension-WC group (Glasgow:HR 1.69 [95% CI: 1.38; 2.07], P < 0.0001; Birmingham: HR 1.5 [95% CI: 1.27; 1.76], P < 0.0001; Gda nsk: HR1.39 [95% CI: 1.17; 1.66], P ¼ 0.0002) showed significant associations regardless of the classification model used.All-cause mortality for the

Discussion
In our proof-of-concept investigation, we explored the viability of using ML applied to routine clinical data as a substitute for ABPM in the clinical management of hypertension.Despite the modest accuracy of the ML classifications, our analysis of associations with clinical outcomes revealed that certain ML-derived categories correlate with a higher risk, compared to that for the reference category of normal/target BP.This observation suggests that even with current accuracy limitations, ML classifications can provide preliminary insights into risk stratification that merit further investigation.The implications for our results, despite the acknowledged low accuracy levels, are 2-fold.First, the study demonstrates the potential clinical relevance of ML classifications in identifying risk groups, which could be refined and The values represent adjusted hazard ratios and 95% confidence intervals.WC, white coat; XGBoost, eXtreme Gradient Boosting open source software.
Table 3. Summary of XGBoost performance in the 3 cohorts for each blood pressure (BP) group validated in larger, more diverse datasets.Second, the study underscores the importance of enhancing accuracy and reliability in future research.The observed associations, despite the current limitations, indicate a promising direction for leveraging ML in hypertension management, especially in contexts in which ABPM is not feasible.An interesting insight from our study is the potential generalizability of the ML model, regardless of the originating cohort, in consistently identifying individuals exhibiting the WC effect, with consequent increased cardiovascular risk.Our approach to classifying individuals into 5 risk groups differs from previous efforts, which focused primarily on deducing ABPM BP values from oBP measurements and other clinical parameters. 12ur ML algorithms exhibited some degree of misclassification across all 3 cohorts, particularly between the Normal/Target and the Hypertension-Masked groups, and within the trio of the Normal/Target-WC, the Hypertension-WC, and the Hypertension groups.Although the models showed high AUROCs, these misclassifications resulted in suboptimal calibration metrics.
The XGBoost model performed better in the Glasgow cohort (40%) in the detection of the Hypertension and Hypertension-Masked groups (54%), compared to the Gda nsk and Birmingham cohorts (20% and 9% for the Hypertension group, and 20% and 0% for the Hypertension-Masked group, respectively).The relatively higher detection rate for the smaller Hypertension-Masked and Hypertension groups in the Glasgow cohort may be attributable to the Glasgow cohort having more input features (Supplemental Table S1) and a more balanced distribution of groups between the 2 cohorts (Supplemental Tables S2-S4).The advantage of XGBoost models is that they harmonize a range of covariates, including demographic and lab values, to construct the prediction, with no single covariate predominating, in contrast to logistic regression (see Supplemental Fig. S6).Nevertheless, the development of ML models from unbalanced data can result in predictions that are skewed toward the majority group and toward deceptively high performance.Typically, this phenomenon is referred to as the accuracy paradox.Consequently, despite the Glasgow cohor''s superior predictive ability for smaller subgroups, its model produced a lower level of accuracy, and lower F1 and AUROC scores than those of the Gda nsk and Birmingham cohorts.Performance matrices (Supplemental Table S5), which provide a more indepth analysis of the mode''s classification accuracy, demonstrated that the Glasgow cohor''s model has the best performance in detecting clinically significant smaller subgroups.Oversampling implicitly increases the prior probability of smaller subgroups, which could lead to misclassifications of patients from the majority group.
ABPM is considered the gold standard for diagnosing hypertension. 2,3,9,13ABPM provides a more precise hypertension diagnosis, which benefits both the individual patient and the healthcare system.Avoiding unnecessary antihypertensive treatment in patients has resulted in a 3%-14% decrease in treatment costs.However, due to the requirements for technology and other resources, the immediate cost of ABPM is significantly higher than that of routine clinic BP measurement. 6,7ABPM is also more taxing on patients, as the BP cuff inflates and deflates at least twice or thrice hourly: one third of patients report having pain or bruising during the 24-hour monitoring process, and two thirds report having sleep disturbances.
The primary strengths of our study are the following: the development of multiple ML models in 3 temporally and geographically distinct cohorts, which strengthen the stud''s conclusions; the use of cross-validation to generate robust confidence intervals for our estimates; the demonstration of an impact on long-term outcomes; and adherence to the current Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis-Artificial Intelligence (TRIPOD-AI) guidelines for reporting prediction models. 11owever, our study did not come without limitations.In addition to the misclassification and poor calibration metrics noted above, some of the BP groups were underrepresented due to the small sample size, most notably the Hypertension-Masked group, and the majority of patients were White Europeans with little ethnic diversity.We were unable to account for secular effects among cohorts in the survival analyses, and we used only the year of inclusion as a covariate to address this issue.The Glasgow non-ABPM cohort comprised hypertensive patients attending the Glasgow BP clinicdthus, all were managed nearly similarly at a specialist clinic.The discrepancy in the cardiovascular outcomes risk seen in the Hypertensive group and the WC group, compared to that in the Normal/ Target BP group, on applying models derived from the 3 cohorts on outcomes in the Glasgow non-ABPM cohort, may reflect an existing unmet need, with those patients with the WC effect still being undertreated.We surmise that further research using larger datasets will help determine the predictive power of the ML models and clarify the implications for long-term outcomes.
This study serves as a foundational proof-of-concept, demonstrating the potential of ML to categorize patients into clinically pertinent risk groups when ABPM is unavailable, although it underscores that ML cannot supplant ABPM completely at this juncture.Our findings advocate for the conduct of more extensive prospective studies, with the inclusion of a diverse array of racial and ethnic groups, and a variety of resource settings, alongside adequate representation across all BP classifications.Furthermore, the successful and ethical deployment of ML in healthcare, particularly in resource-poor settings, requires a participatory approach that includes patients, healthcare providers, and other stakeholders from these communities from the outset.In conclusion, a comprehensive evaluation is essential to determine the true potential of ML as either a direct substitute for or a significant complement to ABPM.By integrating ML into the clinical pathway, we have the opportunity to revolutionize the management of hypertension, enabling timely and effective interventions across diverse healthcare environments.may be granted to those who meet the criteria for confidential access, but data governance regulations prevent the data from being accessible to the public.The leads for each cohort (S.P., K.N., and I.D.) can provide guidance on data-access procedures, which may necessitate separate applications to the respective health boards.The 3 XGBoost models, corresponding to the 3 cohorts used in this study have been made freely available at https://github.com/Tran031194/abpmML.

Table
1and Supplemental Appendix S1.Le potentiel d'exactitude de la classification de la PA à l'aide de l'apprentissage machine lorsque la surveillance ambulatoire de la PA n'est pas possible est limit e. Des etudes de plus grande envergure portant sur des groupes de patients et des niveaux de ressources diversifi es s'imposent.

Table 1 .
Definition of 5 blood pressure (BP) groups Pressures are given in mm Hg.NA indicates not applicable for that group.aSBP, ambulatory BP monitoring 24-hour systolic BP average; oSBP, office systolic BP; WC, white-coat.

Table 4 .
Results of Cox regression analysis for 27-year all-cause mortality and for 10-year composite cardiovascular events