Generalizability Assessment of AI Models Across Hospitals in a Low-Middle and High Income Country – Nature Communications
Adopting AI in Low and Middle-Income Countries (LMICs) faces significant infrastructural and capacity-building challenges. These challenges include power outages, unreliable internet connectivity, cybersecurity concerns, inadequate digital infrastructure (such as data and storage), and a shortage of skilled AI professionals. Consequently, prioritizing AI solutions may divert resources from more urgent foundational needs. These issues also touch upon the broader concern of AI governance, which remains a challenge even in High-Income Countries (HICs), and is likely even more challenging in LMICs. Therefore, while AI holds promise, its adoption in LMICs necessitates a careful, context-sensitive approach to address these underlying challenges.
In this study, clinical data combined with deidentified demographic details for patients across hospitals in the UK and Vietnam were used. United Kingdom National Health Service (NHS) approval via the national oversight/regulatory body, the Health Research Authority (HRA), has been granted for the use of routinely collected clinical data to develop and validate artificial intelligence models to detect COVID-19 (CURIAL; NHS HRA IRAS ID: 281832). The study was restricted to working with deidentified data, extracted retrospectively; hence, explicit patient consent was not required, as covered by the HRA approval. All necessary consents have been obtained, and the appropriate institutional forms archived.
The ethics committees of the Hospital for Tropical Diseases (HTD) and the National Hospital for Tropical Diseases (NHTD) in Vietnam approved the use of their datasets for COVID-19 diagnosis. The study only worked with deidentified data collected as part of ongoing audits; therefore, OxTREC, NHTD, and HTD ethics committees waived the need for individual informed consent. The methods comply with the relevant guidelines and regulations.
From the UK, data were used from hospital emergency departments under Oxford University Hospitals NHS Foundation Trust (OUH), University Hospitals Birmingham NHS Trust (UHB), Bedfordshire Hospitals NHS Foundations Trust (BH), and Portsmouth Hospitals University NHS Trust (PUH). For these datasets, NHS HRA approval was granted for the development and validation of AI models to detect COVID-19. In Vietnam, data were used from the intensive care units (ICUs) in the HTD and the NHTD, approved by respective ethics committees.
To maintain consistency with earlier studies, models were trained using the same cohorts as those used previously. Specifically, patient presentations from OUH were exclusively used for the training and validation sets. Two data extracts from OUH were obtained, corresponding to the first (December 1, 2019, to June 30, 2020) and second waves (October 1, 2020, to March 6, 2021) of the COVID-19 epidemic in the UK. During the first wave, incomplete testing and the imperfect sensitivity of the PCR test resulted in uncertainty about the viral status of untested or negative-tested patients. A simulated disease prevalence of 5% was created by matching positive COVID-19 presentations in the training set with negative controls based on age, aligning with the actual COVID-19 prevalences observed at the four UK sites during data extraction (4.27% to 12.2%). Sensitivity analysis was conducted to account for this uncertainty, improving the apparent accuracy of the models.
The development dataset included 114,957 patient presentations from OUH predating the global COVID-19 outbreak, ensuring they were COVID-free. Additionally, 701 positive COVID-19 presentations, confirmed by PCR tests, were included to ensure the accuracy of COVID-19 status labels during training.
Validation of the model occurred across four UK cohorts (OUH wave two, UHB, PUH, BH) comprising 72,223 admitted patients (4600 COVID-19 positive), and two Vietnam cohorts (HTD and NHTD) with 3431 admitted patients (2413 COVID-19 positive). A summary of respective cohorts is outlined in Table 1. Inclusion and exclusion criteria are provided in the Supplementary Material.
For OUH, all patients presenting and admitted to the emergency department were included. For PUH, UHB, BH, HTD, and NHTD, all admitted patients were included. COVID-19 status was determined at the UK sites and HTD through confirmatory PCR testing, while NHTD used both PCR and/or rapid antigen testing. However, for NHTD, specific test types were often unrecorded. To maximize testing coverage, unspecified test types were documented based on indications and severity of COVID-19, confirmed by specialist infectious diseases clinicians.
Uniformity in measurement units for identical features was first ensured. All features were standardized to a mean of 0 and a standard deviation of 1 to aid in achieving convergence in neural network models. For missing values in the UK datasets, population median imputation was used. For matched features in the Vietnamese datasets, the same imputation method was applied. Sensitivity analysis using the XGBoost model, which handles missing values as input, showed performance improvements, with respective AUROC scores improving for both HTD and NHTD datasets.
To evaluate model generalizability, three model architectures were investigated: logistic regression, XGBoost, and a standard neural network. Logistic regression is a linear model widely accepted in clinical communities. XGBoost, a tree-based model, is known for strong performance on tabular data. A standard neural network can benefit from transfer learning and has shown superior performance for COVID-19 diagnosis.
The performance of trained models was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). These metrics were accompanied by 95% confidence intervals (CIs), computed using 1000 bootstrap samples from the test set. Tests of significance (p-values) comparing model performances were conducted, with 0.05 as the threshold for statistical significance.
A grid search adjusted the sensitivity/specificity for identifying COVID-19 cases, optimizing the threshold to achieve a sensitivity of 0.85 (±0.05). This surpassed the sensitivity of LFD tests and aligned with the sensitivities of real-time PCR, ensuring effective COVID-19 detection by the AI models.
For training, the reduced feature set from prior studies was used, including specific laboratory blood tests and vital signs recorded during patient admissions. The UK and Vietnam datasets were matched on features, applying nearest neighbor (NN) imputation and Geometrically-Aggregated Training Samples (GATS) techniques to address missing data.
Transfer learning was investigated by fine-tuning network weights from UK-trained models on subsets of HTD and NHTD data, customizing the models to Vietnam’s local context. Validation was conducted on remaining patient data and external validation on the other hospital was performed. Additionally, baseline neural network models were trained locally at each Vietnam hospital for direct comparison.
This study illustrates the potential for HICs to help LMICs through dataset augmentation, improving outcomes when applying AI models across different settings. Further details on research design can be found in the Nature Portfolio Reporting Summary linked to this article.