Help Login Create account

Data released on March 07, 2017

Supporting data for "The need to approximate the use-case in clinical machine learning"

Saeb, S; Lonini, L; Jayaraman, A; Mohr, D, C; Kording, K, P (2017): Supporting data for "The need to approximate the use-case in clinical machine learning" GigaScience Database. RIS BibTeX Text

The availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map that data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is vital to reliably quantify their prediction accuracy. Cross-validation is the standard approach where the accuracy of such algorithms is evaluated on data the algorithm has not seen during training. However, for this procedure to be meaningful, the relationship between the training and validation set should mimic the relationship between the training set and the dataset expected for the clinical use. Here we compared two popular cross-validation methods: record-wise and subject-wise. The subject-wise procedure mirrors the clinically relevant use-case scenario of diagnosing/identifying patterns in newly recruited subjects. The record-wise strategy has no such interpretation.
Using both a publicly available dataset and a simulation, we found that record-wise cross-validation often massively overestimates the prediction accuracy of the algorithms. We also conducted a systematic review of the relevant literature, and found that this overly optimistic method is used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes.
As we move towards an era of machine learning based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as results that are overly optimistic can mislead both clinicians and data scientists.

Contact Submitter

Additional information:




  • Funding body - National Institutes of Health
  • Award ID - 5R01NS063399
  • Funding body - National Institutes of Health
  • Award ID - P20MH090318
  • Funding body - National Institutes of Health
  • Award ID - R01MH100482

Files: (FTP site) Table Settings


File Description
Sample ID
File Type
File Format
Release Date
Download Link
File Attributes

File NameSample IDFile TypeFile FormatSizeRelease Date 
GitHub archivearchive232.63 KB2017-02-28
tabular dataEXCEL18.81 KB2017-02-28
Matlab data fileMATLAB9.19 KB2017-02-28
Matlab data fileMATLAB9.95 KB2017-02-28
Matlab data fileMATLAB10.44 KB2017-02-28
ReadmeTEXT2.95 KB2017-02-28
Displaying 1-6 of 6 File(s).



Other datasets you might like: