HOW-TO GUIDES 2 guides
Frequently Asked Questions
9 questions-
The model showed high discrimination for suspected tardive dyskinesia (TD) when trained on all available data, with AUC values ranging from 0.85 to 0.98 across the available test sets. Performance improved as more training data were added, rising from an initial AUC of 0.77 (95% CI, 0.6792010.859) in Study 1 to 0.98 on the Study 1 test data using the full model trained on all 3 studies.
At the operating point reported in the discussion, the model's sensitivity was 0.820 and specificity was 0.821 when the threshold was set at 5.1. The study's primary comparison standard was TD presence or absence defined by AIMS ratings from trained raters.
-
Yes. In these 3 studies, self-administered smartphone-recorded videos provided data that an AI model used to identify suspected tardive dyskinesia in people taking antipsychotic medications. The final dataset included 351 participants and 3,979 video responses, and the model achieved AUC values of 0.85 to 0.98 across available test sets when trained on all available data.
The authors frame this as a remote screening and monitoring approach, not a standalone diagnostic method. They state that when the algorithm identifies suspected TD, a psychiatrist should perform the diagnostic evaluation and determine next steps.
-
No. The article explicitly states that the remote AI method does not replace an in-person assessment by a physician for definitive diagnosis and management. It is intended to augment initial remote screening and help identify patients who should receive clinician follow-up.
The authors also note that a health care professional's evaluation is essential to confirm the diagnosis required for prescribing treatment. In practice, a positive AI result is presented as a prompt for psychiatrist assessment rather than a final diagnosis.
-
In this study, the AI model showed agreement that matched or exceeded trained human raters. In Study 2, reviewers' initial agreement on binary TD status was limited, with an average Cohen ba of 0.37 b1 0.05 and a Fleiss ba of 0.35; after discussion and reassessment, average Cohen ba improved to 0.57 b1 0.03 and Fleiss ba to 0.58.
Using the same data, the machine learning model achieved a Cohen ba of 0.51, and its Cohen ba increased to 0.61 when the full dataset was used. The authors interpret this as stronger and more reliable agreement than the human raters' initial assessments.
-
The studies used smartphone videos collected through an app that guided participants through a standardized protocol. In Study 3, which simplified the protocol for home use, the assessment included 4 steps: 15 seconds of tapping a hand on the shoulder, 30 seconds of opening the mouth and sticking out the tongue followed by 30 seconds of sitting still, and answers to 2 open-ended questions.
Studies 1 and 2 also captured standard AIMS-related video elements and open-ended speech responses, with the number of open-ended questions reduced from 6 in Study 1 to 3 in Study 2. The model analyzed video and audio data from these responses.
-
According to the article, model performance was consistently high across the demographic subgroups analyzed. The authors assessed heterogeneity of treatment effect across gender, ethnicity, and age and reported low heterogeneity of treatment effect and uniformly high predictiveness across those splits.
The paper does not present this as proof of no bias in every setting, but it does report that performance remained strong across the demographic groups examined in these studies.
-
- Video quality was a major practical limitation. In Study 3, 72 participants, or 17% of the evaluation dataset, were excluded because video quality was below the threshold required for analysis.
- The protocol does not directly assess legs, feet, or toes, so isolated lower-extremity movements may be missed.
- The tool cannot function independently as a diagnostic test; clinician evaluation is still required to confirm TD.
- Some users may have unresolvable problems related to network, camera, environment, or ability to follow instructions, in which case the article recommends referral for an in-person or telehealth AIMS.
-
The model predicted a continuous total AIMS-based risk score rather than only a binary result. Specifically, it predicted the total AIMS score, defined in the paper as the sum of the 7 observed body regions, and that continuous score was then grouped into a binary TD versus no-TD prediction for reporting AUC, sensitivity, and specificity.
The authors note that this continuous output could also support multiple thresholds, such as low, medium, and high risk. They suggest this flexibility may allow calibration toward sensitivity or specificity depending on the clinical population's pretest probability of TD.
-
This study matters because tardive dyskinesia monitoring is a standard of care, but routine in-person assessments are difficult to deliver at the recommended frequency for every patient taking antipsychotics. The authors note that clinicians may need to perform TD monitoring as often as 220134 times per year, and later discuss the challenge of providing the 420136 annual assessments needed to meet standard care expectations.
In that context, a remote video-based AI screen may help identify patients who need more urgent clinician AIMS evaluation, especially in telemedicine-first care settings where time-consuming safety monitoring is often harder to maintain.