A frustrating report of a clinical trial

In 2020, I did the analysis for a couple of studies of a deep learning prediction model developed to screen patients for atrial fibrillation (AF) based on an electrocardiogram (ECG). The idea is that when a patient has an ECG, it’s read by a human who looks for AF, among other irregularities; if a human can see AF, then the prediction model isn’t needed. On the other hand, if the ECG shows a normal sinus rhythm, then the patient might still be experiencing episodes of AF or might be at high risk of AF in the near future; all that’s known is that AF wasn’t occurring when the ECG was taken. The prediction model, so-called AI-ECG, is applied to normal sinus rhythm ECGs and outputs a probability that the patient has had AF or will have AF in the near future. The deep learning model presumably identifies AF-associated patterns in the ECG that are too subtle for a human to detect. Patients labeled as high-risk by AI-ECG subsequently could be more closely monitored.

The AI-ECG model recently was evaluated in a prospective study, the Batch Enrollment for an AI-Guided Intervention to Lower Neurologic Events in Patients with Undiagnosed Atrial Fibrillation (BEAGLE) trial.¹ Patients who underwent an ECG at Mayo Clinic were invited to participate. Their ECGs were run through the AI-ECG model to obtain a risk score for each patient. Patients who enrolled in the study were given a device to continuously monitor their heart rhythms for up to 30 days. The heart rhythm monitor provided a “gold standard” diagnosis of incident AF.

In the Introduction of the article reporting the trial in The Lancet, the trial is described as addressing two questions:

First, it is unclear whether this AI algorithm offers additional risk stratification power beyond advanced age and other clinical risk factors, which are the traditional approaches to defining monitoring populations.

…

Second, it is unknown whether, or how much, monitoring in patients AI selected as being at high risk addresses the underdiagnosis of atrial fibrillation in comparison with usual care. Even if the AI successfully selects patients more likely to have atrial fibrillation, the programme is only valuable if it identifies new atrial fibrillation cases that would not have come to clinical attention under usual care without active screening.

Reading the Methods and Results sections to understand what the trial results actually say in answer to these questions, I was frustrated on a number of points.

Validation of AI-ECG risk stratification

The Methods section says that sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV) were calculated for a pre-designated cutoff of the AI-ECG output to classify patients into high- vs low-risk, in detecting AF lasting \(\geq\) 30 seconds. But then the Results only explicitly states NPV.
- The PPV (7.6%) is reported but without calling it PPV — why? Is it because PPV 7.6% “looks bad”?
- Sensitivity can be calculated from the numbers in Table 2, \(48/(48+6)=89\%\), and specificity would be \((370-6) / [(370-6) + (633-48)]=38\%\). Again, why not state these outright? Is it because the low specificity looks bad (especially compared to the 79.5% specificity of AI-ECG in the data used to train the model)?
- Or rather, were sensitivity and specificity not reported because the study design didn’t allow them to be estimated in a simple manner? An article² describing the study design says that the cohort will be enriched with patients who have higher-risk AI-ECG scores. The Lancet article doesn’t confirm whether this was actually done, though there does seem to be a large number of “high risk” patients (633 high-risk vs 370 low-risk). If it was, then the study participants who actually had AF observed in the study wouldn’t be representative of eligible patients with AF; the study participants with AF would be expected to be enriched with high-scoring patients, and therefore sensitivity of AI-ECG would be inflated. Similarly, specificity would be deflated.
The question posed in the Introduction was whether AI-ECG has value in further stratifying patients in terms of AF risk beyond the risk stratification provided by age and other clinical factors. CHARGE-AF is a risk score based on clinical factors, and the Lancet article reports on the discriminative accuracy of CHARGE-AF for comparison, but there’s this puzzling sentence: “Adding CHARGE-AF on top of the AI-ECG risk score did not improve the discrimination compared with the AI alone.” Shouldn’t they be looking at whether adding AI-ECG on top of CHARGE-AF improves discrimination compared to CHARGE-AF alone?

Comparison of AI-ECG screening to usual care

This is the part that really baffles me. Propensity scoring was used to match each study participant with an eligible-but-not-enrolled patient who received “usual care” (as a result of not being enrolled). The purpose of the propensity score matching would be to create a kind of simulated randomized trial, where patients were randomized to the trial intervention or usual care. Probability of AF detection/diagnosis is compared between those receiving the trial intervention and those receiving usual care. The Lancet article’s analysis does this comparison within groups defined by AI-ECG risk (high, low). The trial participants had significantly higher AF diagnosis than the usual care patients among those with high AI-ECG risk, but not among those with low AI-ECG risk.
- But what is the trial intervention for which the effect on AF diagnosis is being estimated here? It’s monitoring a patient with a continuous heart monitor. Of course more cases of AF will be diagnosed if every patient is monitored that way! But the whole point of this investigation is that it’s too cumbersome to screen a large number of patients with this kind of continuous monitoring.
- A news article³ describing the study results says, “An AI-guided targeted screening strategy is effective in detecting new cases of atrial fibrillation that would not have come to attention in routine clinical care.” No — continuous heart monitoring detected new cases of AF that wouldn’t have come to attention in routine care.
- Any analysis that looks at the effectiveness of a screening strategy for identifying clinically actionable AF cannot be based on a simple objective of maximizing the yield of AF cases — that’s easy, just do intensive monitoring of everyone. There has to be some kind of cost-benefit analysis that also considers the number of patients who get more intensive monitoring or evaluation.

Noseworthy, PA, Attia, ZI, Behnken, EM, et al. (2022), “Artificial intelligence-guided screening for atrial fibrillation using electrocardiogram during sinus rhythm: a prospective non-randomised interventional trial,” The Lancet 400, 1206–1212, https://doi.org/10.1016/S0140-6736(22)01637-3 ↩︎
Yao, X, Attia, ZI, Behnken, EM, et al. (2021), American Heart Journal 239, 73–79, https://doi.org/10.1016/j.ahj.2021.05.006 ↩︎
https://newsnetwork.mayoclinic.org/discussion/ai-guided-screening-uses-ecg-data-to-detect-a-hidden-risk-factor-for-stroke/↩︎