Key Takeaways
Extended Takeaways
- Performance improved as training data accumulated across the 3 studies, with AUC rising from 0.77 (95% CI, 0.679–0.859) in the pilot dataset to 0.85 to 0.98 across available test sets when the model was trained on all available data.
- At the operating point reported in the discussion, the model yielded sensitivity of 0.820 and specificity of 0.821, suggesting a balanced threshold for remote triage rather than a one-sided rule-in or rule-out tool.
- Interrater agreement among trained reviewers was only fair before adjudication, with average Cohen κ of 0.37 ± 0.05 and Fleiss κ of 0.35 in Study 2; the model’s Cohen κ of 0.51 on the same data, increasing to 0.61 with the full dataset, indicates AI may reduce variability inherent in visual TD assessments.
- Feasibility in unsupervised settings remains a practical issue: in Study 3, 17% of participants were excluded because of poor-quality video, so workflows using this approach should include real-time capture guidance and a fallback plan for in-person or telehealth AIMS when video is inadequate.
- The algorithm predicts a continuous total AIMS-based risk score and can be grouped into binary or multi-level outputs, which may help clinicians tailor monitoring intensity for patients with different pretest probabilities of TD rather than relying on a single yes/no screen.
- This protocol does not directly assess legs, feet, or toes, so clinicians should be cautious about false negatives when lower-extremity movements are the main manifestation and should maintain in-person examination when suspicion remains high despite a low-risk video result.