By, Ryan Yost
Why brainwave-based sleep tracking is more accurate than other wearables
If you're reading this blog, you've likely purchased a sleep tracker in the past (a watch, a ring, a band to name a few). But are all sleep trackers created equal? What kind of signals from the body are needed to accurately track your sleep?
The gold standard for sleep tracking is polysomnography (PSG) data [1], which is scored by a trained sleep technician. A PSG setup measures brainwaves (EEG), heart rate (ECG and/or PPG), movement (accelerometer), muscle activity (EMG), blood oxygen levels (pulse oximeter), eye movement (EOG), breathing rate (chest straps), and sometimes even more signals. This data is collected in a sleep lab and later scored by a trained sleep technician. I’m willing to bet most of you don’t have or want this setup—it’s prohibitively expensive (at least $10,000 for the device, plus around $40 per hour for a technician to score it) and cumbersome to sleep with. See for yourself below… look at all those wires!
Elemind’s VP of Science and Research, Dr. Ryan Neely, demonstrates the cumbersome nature of a full PSG system, the gold standard for sleep data [1].
So, how do tech companies and researchers score a user’s sleep without requiring a full PSG setup and sending the data off to be professionally scored? They record PSG data alongside wearable device data and then train an algorithm to mimic the scoring done by a sleep technician.
How good are these algorithms really, and are some wearables better than others? When evaluating wearable sleep algorithms, there are two important factors to consider: the number of stages and the Cohen’s Kappa score. For the number of stages, how much resolution does the model provide? Can it only distinguish between wake and sleep, or can it differentiate all five stages of sleep (Wake, N1, N2, N3 (deep sleep), and REM)? As for Cohen's Kappa score, it is a measure of the algorithm's 'accuracy'—how well the model distinguishes between sleep stages. More precisely, it measures the rate of agreement—how closely the algorithm’s predicted sleep stages align with the true sleep stages as determined by the sleep technician.
Note: Looking at accuracy alone isn’t sufficient. For example, in a sleep study with 500 epochs, a participant may spend 400 epochs in 'Light Sleep,' 60 in 'Deep Sleep,' 30 in 'REM,' and 10 in 'Wake.' If a silly algorithm predicts 'Light Sleep' for every epoch, it would achieve a high accuracy rate (80%), simply because 'Light Sleep' is the majority stage. However, it would completely miss critical stages like 'Deep Sleep,' 'REM,' and 'Wake,' which are vital for assessing sleep quality. Cohen’s Kappa is a better measure to evaluate sleep algorithms [3] —it accounts for how well the algorithm’s predictions align with all stages of sleep, not just the easiest ones to identify.
The first generation of popular sleep wearables are wrist-worn devices that rely solely on movement, such as the Fitbit Classic and Philips Actiwatch. However, being motionless does not always mean you're asleep, so these devices often don’t align well with expert scorers. Due to this limitation, they only aim to differentiate between sleep and wake, grouping all sleep stages together. On the algorithm side, they typically use basic algebra combined with a threshold to make these determinations [4]. Essentially, if there is significant movement, the device labels it as wake, and if movement is minimal, it labels it as sleep.
The next generation of sleep wearables incorporated heart rate metrics and are typically worn on the wrist or finger. Devices like the WHOOP, OURA Ring, Apple Watch, Garmin, and others use photoplethysmography (PPG) or electrocardiography (ECG) to track heart rate, combining it with motion data to estimate sleep stages. These devices can distinguish between four stages of sleep: wake, light, deep, and REM. Many use a gradient boosting machine [5], a relatively lightweight machine learning model, for this classification. However, since heart rate (with lower heart rates indicating a higher likelihood of sleep) and motion based metrics aren’t perfect indicators of sleep, these devices still have their limitations (more on this later).
The future of sleep wearables is brainwave-based. EEG devices, like the Elemind headband or the Muse, which can distinguish between all five stages of sleep—wake, N1, N2, N3, and REM—with far greater accuracy than wrist-worn devices. Brainwaves are the most critical component of a PSG system, offering the rich data that algorithms need to precisely predict sleep stages. This EEG data is often processed through deep neural networks, such as convolutional neural networks (CNNs), to predict sleep stages, with a hidden Markov model (HMM) providing temporal context for even more accurate predictions [6].
The Elemind headband not only tracks sleep with state-of-the-art accuracy, it also helps you fall asleep using closed-loop auditory neuromodulation [7]
To demonstrate just how much better brainwave-based algorithms are compared to their heart rate and movement-based counterparts, I found a review paper [3] from a peer-reviewed scientific journal that analyzed hundreds of algorithms used to track sleep. The findings were clear: not only can brainwave-based algorithms distinguish between all five sleep stages (wake, N1, N2, N3, REM) instead of just four (wake, light, deep, REM), but they also achieve significantly higher agreement with expert scorers. The nine brainwave-based algorithms had an average Cohen’s Kappa score of 0.70 (standard deviation of 0.1), compared to an average Kappa score of 0.54 (standard deviation of 0.09) for the eight heart rate and movement-based algorithms [3].
To put this in perspective: “Cohen suggested the Kappa result be interpreted as follows: values ≤ 0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement.” [8] This means that brainwave-based algorithms have substantial agreement with expert scorers, while heart rate and movement-based algorithms only show moderate agreement.
Brainwave-based sleep algorithms not only distinguish between all five sleep stages (wake, N1, N2, N3, REM) compared to just four (wake, light, deep, REM) detected by heart rate and movement-based devices, but they also achieve greater agreement with expert scorers (brainwave algorithms: avg Kappa = 0.70, std = 0.10, n = 8; HR + movement algorithms: avg Kappa = 0.54, std = 0.09, n = 9) [3].
So, how important is having substantially accurate sleep tracking to you? Want to fall asleep significantly faster too [7]? Elemind may be right for you! Check it out for yourself at elemindtech.com.
WORKS CITED
[1] Birrer, V., Elgendi, M., Lambercy, O. et al. Evaluating reliability in wearable devices for sleep staging. npj Digit. Med. 7, 74 (2024). https://doi.org/10.1038/s41746-024-01016-9
[2] “Polysomnography: Medlineplus medical encyclopedia,” MedlinePlus. [Online]. Available: https://medlineplus.gov/ency/article/003932.htm. [Accessed: 17-Feb-2023]
[3] Imtiaz SA. A Systematic Review of Sensing Technologies for Wearable Sleep Staging. Sensors. 2021; 21(5):1562. https://doi.org/10.3390/s21051562
[4] Palotti, J., Mall, R., Aupetit, M. et al. Benchmark on a large cohort for sleep-wake classification with machine learning techniques. npj Digit. Med. 2, 50 (2019). https://doi.org/10.1038/s41746-019-0126-9
[5] Altini M, Kinnunen H. The Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura Ring. Sensors. 2021; 21(13):4302. https://doi.org/10.3390/s21134302
[6] Nguyen, A., Pogoncheff, G., Dong, B.X. et al. A comprehensive study on the efficacy of a wearable sleep aid device featuring closed-loop real-time acoustic stimulation. Sci Rep 13, 17515 (2023). https://doi.org/10.1038/s41598-023-43975-1
[7] Bressler, S., Neely, R., Yost, R.M. et al. A randomized controlled trial of alpha phase-locked auditory stimulation to treat symptoms of sleep onset insomnia. Sci Rep 14, 13039 (2024). https://doi.org/10.1038/s41598-024-63385-1
[8] McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22(3):276-82. PMID: 23092060; PMCID: PMC3900052.