Open Access
Article  |   October 2024
The visual experience dataset: Over 200 recorded hours of integrated eye movement, odometry, and egocentric video
Author Affiliations & Notes
  • Footnotes
     BJB, MDL, and PRM contributed equally to this work and are listed in alphabetical order.
  • Footnotes
     KB, PAH, RM, BS, and CBS contributed equally to this work and are listed in alphabetical order.
  • Footnotes
     KC, SH, HH, MJ, AL, AM, AM, IN, EP, PR, BS, MWS, WS, BS, JMT, and EW contributed equally to this work and are listed in alphabetical order.
Journal of Vision October 2024, Vol.24, 6. doi:https://doi.org/10.1167/jov.24.11.6
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Michelle R. Greene, Benjamin J. Balas, Mark D. Lescroart, Paul R. MacNeilage, Jennifer A. Hart, Kamran Binaee, Peter A. Hausamann, Ronald Mezile, Bharath Shankar, Christian B. Sinnott, Kaylie Capurro, Savannah Halow, Hunter Howe, Mariam Josyula, Annie Li, Abraham Mieses, Amina Mohamed, Ilya Nudnou, Ezra Parkhill, Peter Riley, Brett Schmidt, Matthew W. Shinkle, Wentao Si, Brian Szekely, Joaquin M. Torres, Eliana Weissmann; The visual experience dataset: Over 200 recorded hours of integrated eye movement, odometry, and egocentric video. Journal of Vision 2024;24(11):6. https://doi.org/10.1167/jov.24.11.6.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

We introduce the Visual Experience Dataset (VEDB), a compilation of more than 240 hours of egocentric video combined with gaze- and head-tracking data that offer an unprecedented view of the visual world as experienced by human observers. The dataset consists of 717 sessions, recorded by 56 observers ranging from 7 to 46 years of age. This article outlines the data collection, processing, and labeling protocols undertaken to ensure a representative sample and discusses the potential sources of error or bias within the dataset. The VEDB's potential applications are vast, including improving gaze-tracking methodologies, assessing spatiotemporal image statistics, and refining deep neural networks for scene and activity recognition. The VEDB is accessible through established open science platforms and is intended to be a living dataset with plans for expansion and community contributions. It is released with an emphasis on ethical considerations, such as participant privacy and the mitigation of potential biases. By providing a dataset grounded in real-world experiences and accompanied by extensive metadata and supporting code, the authors invite the research community to use and contribute to the VEDB, facilitating a richer understanding of visual perception and behavior in naturalistic settings.

Introduction and motivation
Visual perception is shaped by experience in multiple ways. For example, the tuning properties of visual neurons reflect color and orientation statistics of natural visual environments (Lee, Wachtler, & Sejnowski, 2002; Olshausen & Field, 1996). Detection and recognition performance depends on the relative frequency of specific stimuli in an observer's experience (e.g., colors (Juricevic & Webster, 2009; Webster, Mizokami, & Webster, 2007); faces (Meissner & Bheadsetham, 2001); and object configurations within scenes (Biederman, Mezzanotte, & Rabinowitz, 1982; Davenport & Potter, 2004; Greene, Botros, Beck, & Fei-Fei, 2015). These findings support the hypothesis that the brain leverages the statistical redundancies in stimuli to enable efficient behavior (Attneave, 1954; Barlow, 2001). Thus, a complete characterization of the mechanisms of our visual system relies on a thorough understanding of the natural statistics of visual experience. 
The basic statistics of photographs (contrast, edge orientations, luminance, color, and Fourier amplitude spectra) have been extensively studied (Geisler, 2008; Hansen, Essock, Zheng, & DeFord, 2003; Harrison, 2022; Howe & Purves, 2004; Lee et al., 2002; Olshausen & Field, 1996; Tolhurst, Tadmor, & Chao, 1992; Torralba & Oliva, 2003). Additional work has sought to characterize mid-level statistics such as texture, scale, and depth (Held, Cooper, & Banks, 2012; Long, Konkle, Cohen, & Alvarez, 2016; Nishida, 2019; Portilla & Simoncelli, 2000; Ruderman, 1994; Sato, Kingdom, & Motoyoshi, 2019; Su, Cormack, & Bovik, 2013; Torralba & Oliva, 2002), as well as higher-order statistics, such as distributions of objects and their locations in the image plane (Greene, 2013; Zhou, Lapedriza, Khosla, Oliva, & Torralba, 2018) and in depth (Adams et al., 2016; Song, Lichtenberg, & Xiao, 2015). 
Despite these successes, several methodological limitations constrain the insights we can obtain from these studies. First, the vast majority of work on natural scene statistics is on static photographs, while visual experience unfolds over time. Although some studies of spatiotemporal scene statistics exist (Dong & Atick, 1995; DuTell, Gibaldi, Focarelli, Olshausen, & Banks, 2024; Rao & Ballard, 1998), we know comparatively little about the statistics of the motion experienced by the visual system, the extent to which this motion results from self-movement or the movement of other objects and surfaces, or the extent to which motion cues may contribute to perceptual invariants for recognition (Gibson, 1986). Second, many extant datasets are comparatively small or narrowly sample a small number of environments (e.g., Betsch, Einhäuser, Körding, & König, 2004). These datasets do not capture the full range of our visual experiences and thus may miss features of the natural world that affect perception. More troublingly, small datasets are particularly subject to bias (Tommasi, Patricia, Caputo, & Tuytelaars, 2017; Torralba & Efros, 2011). That is, datasets capture unintended covariance related to the chosen semantic labels, as well as image-specific preferences such as viewpoint and lighting conditions. A related concern is that many datasets are sampled broadly from Internet images (Deng et al., 2009; Greene, 2013; Krishna et al., 2017; Lin et al., 2014; Xiao, Ehinger, Hays, Torralba, & Oliva, 2014; Zhou et al., 2018). Internet-derived datasets have been critiqued for representing a narrow range of geographical locations (DeVries, Misra, Wang, & van der Maaten, 2019), viewpoints (Tseng, Carmi, Cameron, Munoz, & Itti, 2009), and socioeconomic circumstances (Tolia-Kelly, 2016); for containing harmful content such as pornography (Prabhu & Birhane, 2020); and for over-representing content that perpetuates race and gender stereotypes (Hirota, Nakashima, & Garcia, 2022; Kay, Matuszek, & Munson, 2015; Otterbacher, Bates, & Clough, 2017; Zhao, Wang, & Russakovsky, 2021). Finally, human vision is an active process. To fully characterize natural scene statistics, it is necessary to determine what observers fixate in the world rather than simply what falls in front of their head. 
Although several large video datasets have been developed by scientists in computer vision (Table 1), none have been designed with these issues at the forefront. Many existing datasets provide dynamic views of a range of human actions, for example, but most are not taken from an egocentric perspective, and very few include the location of gaze or head (or camera) position. Thus, despite the impressive scope of some datasets (Grauman et al., 2022), it is unclear from these data, for example, where objects typically fall in the visual field, because little or no gaze data were collected. By contrast, several studies have also collected samples of mobile eye tracking data (e.g., Hayhoe & Ballard, 2014; Kothari et al., 2020; Peterson, Lin, Zaun, & Kanwisher, 2016; Sprague, Cooper, Tošić, & Banks, 2015), but these have tended to focus on eye movements in particular circumstances, such as specific environments (Matthis, Yates, & Hayhoe, 2018), specific tasks such as making food or coffee (Fathi, Hodgins, & Rehg, 2011; Hayhoe & Ballard, 2005) or by examining gaze patterns to specific stimuli such as faces (Peterson et al., 2016). Other mobile eye tracking datasets have been collected for developing robust gaze tracking pipelines (Fuhl, Kasneci, & Kasneci, 2021; Kothari et al., 2020). Although these sets have provided useful insights into how gaze behavior varies across tasks, environments, and observers in constrained settings, these sets are comparatively small in scope and narrowly focused on the specific question they set out to answer. With a larger and more diverse dataset, we can examine many more general questions: How many faces, bodies, and animals do people regularly see? How many vertical surfaces do we see compared with oblique or horizontal surfaces? How many different types of objects do people typically encounter, and where in the visual field do these objects land? How do we move our heads and eyes during different types of activities? How does this affect our visual input? What does the world, as experienced by acting human observers, really look like? 
Table 1.
 
Comparing related works Image not available indicates presence, Image not available indicates partial presence, Image not available indicates absence.
Table 1.
 
Comparing related works Image not available indicates presence, Image not available indicates partial presence, Image not available indicates absence.
In this work, we introduce the Visual Experience Dataset (VEDB), a source of more than 240 hours of egocentric video combined with gaze and head tracking. This dataset was collected across two universities and funded by the National Science Foundation (1920896 to M.R.G.). Following the recommendations of (Gebru et al., 2021), we have organized this article to provide potential end-users with information on the composition, data collection, and data processing and labeling. We describe current and possible uses for this dataset and detail the distribution of data and code that comprise the dataset, as well as our plans to maintain the dataset. Ultimately, we hope that by sharing this resource with the vision science community, progress can be made on several topics at the intersection of visual ecology, embodied cognition, and visual perception. 
Methods
Dataset composition
The VEDB is composed of 717 egocentric video sessions recorded by 56 individual observers ranging in age from 7 to 46 years (22 female, 33 male, and 1 nonbinary). The Institutional Review Boards of Bates College, North Dakota State University, and University of Nevada Reno approved the recording protocols, and all participants (or their guardians, in the case of children) provided written informed consent. Participants were informed that recordings would be released publicly and were encouraged to get assent from others, such as family members or friends, who might appear in the recordings. This project started during the Severe Acute Respiratory Syndrome Coronavirus 2 pandemic when outside persons were prohibited on our campuses. Therefore, a sizeable number of recordings were made by the authors of this article, trainees in our laboratories, and the persons in our pandemic bubbles. Trainees were compensated their normal hourly rate or salary for data collection. Outside observers were compensated at rates determined by the common standard of each institution. 
The videos were recorded between October 2020 and August 2023 and ranged from 1 to 73 minutes in length (mean, 19 minutes). Each session is composed of three primary sensor streams: 1) first-person egocentric video from a head-mounted camera, 2) videos of the left and right eye for use in gaze tracking, and 3) information from a tracking camera, including accelerometry, odometry, and gyroscope for use in head tracking. In general, VEDB sessions include all three of these data streams, each of which may be of interest to researchers individually or may be combined to examine the interaction of visual experience with gaze and head and body movements. In addition, we provide processed gaze information for most sessions and an annotation file detailing the tasks and environments depicted in each session. 
There is considerable variability in the activities that were recorded. We recorded 351 sessions indoors, and 278 were recorded in outdoor locations. 407 sessions were deemed active, with observers walking, jogging, skateboarding, or playing other sports, and 222 sessions depicted sedentary activities. Twelve of the 16 top-level categories from the American Time Use Survey were represented. These include personal care, household activities, caring for others, work, education, consumer activities, professional services, eating and drinking, leisure, sports, volunteer work, and travel. Figure 1B shows a word cloud of all recorded activities. The locations where our participants recorded also varied considerably. Of the 365 scene categories in the Places Database (Zhou et al., 2017), 124 are represented in the VEDB (Figure 1A). Because these tasks and activities represent self-chosen records from our participants, there is nothing that, if viewed directly, is likely to be considered offensive, insulting, or threatening. 
Figure 1.
 
(A) Word cloud of recorded locations. (B) Word cloud of verbs from recorded tasks. In each, word size is roughly proportional to the number of recorded hours.
Figure 1.
 
(A) Word cloud of recorded locations. (B) Word cloud of verbs from recorded tasks. In each, word size is roughly proportional to the number of recorded hours.
Collection process
Hardware
Eye tracking
We used an eye tracking module provided by the Pupil-Labs Core system, an open-source and portable device used in various research and commercial settings.1 The Core system includes binocular eye cameras with a 400 × 400 pixel resolution and a frame rate of 120 frames per second. Eye camera position can be adjusted per participant via a sliding extension arm attached to the device frame, and the Core system headset is lightweight (approximately 23 g). 
World camera
The Pupil-Labs Core system is available with a small world camera attached to the device frame. However, we attached a different world camera to the Core frame to support higher-resolution first-person video recording with a wider field of view and a global shutter to enable V-SLAM algorithms. A higher resolution video with a wider field of view contributes to richer recordings of first-person experience, including more of the visual periphery and the potential to measure high spatial frequency features in natural scenes. By using a camera that includes a global shutter, we can support more robust measurement of spatiotemporal features in first-person video, such as optic flow or linear filters tuned to specific passbands in the spatial and temporal domain. By comparison, a rolling shutter captures appearance over time by scanning the image row by row, leading to distortions in the appearance of objects that move or otherwise change over time, especially within the scan rate of the device. We incorporated a FLIR Chameleon 3 (Wilsonville, OR) camera into the eye tracking device to meet these goals.2 This camera increases the headset's weight by 30 to 40 g, but addresses our spatial and temporal video resolution goals. The world camera recorded at a rate of 30 Hz at an image resolution of 3.1 megapixels (2,048 × 1,536 pixels). Some sessions (n = 166) used a 3.6 mm 1/2.5” lens to this camera. This enabled a horizontal field of view of 125°, but had a very heavy fisheye distortion and lens vignetting. Other sessions (n = 551) used a 4 mm 1/1.8” lens, with a horizontal extent of 101°. This provided a more natural-looking video while still recording a larger field of view than is typical (Table 1). Figure 2 shows the relative vignetting of both lenses. We recorded 5 seconds of video of a uniform white wall, converted these frames to grayscale, and examined the relative pixel intensities across the visual field. Given the heavy vignetting of the 3.6-mm lens, we do not recommend measuring natural scene statistics in the periphery for these sessions. That said, the 3.1-megapixel resolution provides a sufficient number of pixels to compute image statistics from center-cropped portions. 
Figure 2.
 
Lens vignetting for 3.6- and 4.0-mm lenses. Five seconds of video of a uniform, white wall was recorded with each lens. The video frames were converted to grayscale, and the pixel intensities were plotted as a function of horizontal location. Zero represents the center of the image plane.
Figure 2.
 
Lens vignetting for 3.6- and 4.0-mm lenses. Five seconds of video of a uniform, white wall was recorded with each lens. The video frames were converted to grayscale, and the pixel intensities were plotted as a function of horizontal location. Zero represents the center of the image plane.
Tracking camera
We wished to record head movements in addition to egocentric video and eye movements to allow research questions linking natural scene statistics to head movements associated with walking, running, or other aspects of body movement related to visual exploration or the completion of everyday tasks. We used the Intel RealSense T265 as our head tracking module to record odometry information.3 This device has stereo wide-angle monochrome cameras, and an inertial measurement unit consisting of a gyroscope, and an accelerometer. Data from these sensors can be used to infer the device (head) pose in the world via a proprietary visual-inertial simultaneous localization and mapping algorithm. The T265 estimates linear and angular head position via visual-inertial simultaneous localization and mapping at 200 Hz, which can be differentiated into velocity and acceleration. Furthermore, the T265 provides supplemental estimates of angular velocity and linear acceleration from the triaxial gyroscope and accelerometer on the device, with sampling rates of approximately 200 and 60 Hz, respectively. The T265 added 55 g to the weight of the recording device. 
Headset designs
We needed to design custom headsets to include each of these sensor components, mount them rigidly with respect to one another, and fit on observers’ heads stably and comfortably, allowing children and adults to wear them for an extended period while performing everyday activities. Through iterative design and redesign, we found that no single solution worked best for all participants in all activities. Adult participants varied substantially in head size and face shape. Further, children required smaller, simpler headset designs to accommodate their size and patience. Thus, the final dataset reflects data collected from four different headset designs. We briefly describe each, highlighting each design's possible advantages and disadvantages. 
Hard plastic mount
We used a ratchet-adjustable hard hat suspension (3M, St. Paul, MN) as the base of this headset (Figure 3, left). We designed custom 3D-printed pieces to hold the world camera (FLIR), the tracking camera (Intel), and the eye tracker (Pupil) rigidly to one another. We used multiple 3D-printed clips to ensure that the weight of the world and tracking cameras were distributed evenly over the device and secure enough to remain stable when a participant was in motion. Although this design achieved the goal of rigidly positioning the cameras relative to one another and to the participant's head, the design did not flexibly accommodate all head sizes and face shapes. Multiple observers found this design uncomfortable because the device had to be angled upward to keep the observer's eyes in the pupil camera frame. To keep the device stable in this orientation, the headband had to be very securely tightened, decreasing the comfort of the headset. 
Figure 3.
 
(Left) Hard plastic mount. (Middle) Elastic strap mount. (Right) Virtual reality headset mount. The right mount also shows the custom ball joint eye extenders.
Figure 3.
 
(Left) Hard plastic mount. (Middle) Elastic strap mount. (Right) Virtual reality headset mount. The right mount also shows the custom ball joint eye extenders.
Elastic strap mount
We used elastic headbands designed for GoPro cameras (Figure 3, center) for this design. As before, we used custom 3D-printed parts to mount the cameras relative to rigidly to one another and to the elastic mount system. This system allowed users to adjust the circumference and longitude of the mount and was thus more flexible and more comfortable. However, this system was difficult to adjust, and the elastic mounting may have allowed the system to slip or bounce during vigorous movement. Further, the silicone coating of the elastic straps pulled the hair of some observers. We created a variant of this mounting system to mitigate this discomfort by pulling the elastic straps over a standard bicycle helmet. We used duct tape to affix the Pupil Core frame to the helmet. Although this variant was comfortable to wear, the increased camera height decreased the visibility of the lower visual field. 
Virtual reality headset mount
We used a Quest 2 Elite mount (Meta, https://www.meta.com/quest/accessories/quest-2-elite-strap/) combined with custom 3D-printed parts (Figure 3, right). This mounting system included a ratchet-adjustable plastic strap that was wider than the previous plastic mount, increasing comfort for some observers. Further, unlike the previous plastic system, the Quest 2 mount included an elastic strap running from over the top of the head front to back, increasing the system's comfort by distributing the weight of the front-facing cameras more evenly. Although some observers found this model comfortable and stable, there were fewer degrees of freedom to adjust the system, and not all face sizes and shapes were accommodated easily. 
Custom pupil camera extenders
The Pupil Core system is equipped with sliding extenders that allow the eye camera to adjust to different face shapes. However, these extenders are at a fixed angle, and we found that this angle only accommodated some observers. With this system, many observers were forced to place the entire device at an unusual angle for their eyes to stay in the camera frame, which could be uncomfortable or unstable. We designed and printed a new extender that used a ball joint mechanism, enabling each user to customize their device, and leading to higher quality data. This joint was susceptible to wear and loosened with use, requiring frequent replacement. 
Peripherals
The eye tracking, world, and tracking cameras were each plugged into a laptop computer via USB, and the computer was placed in a backpack. Given the intensive battery demands of recording four video streams simultaneously, we used a backup computer battery for longer sessions. Participants were provided an Amerteer Bluetooth clicker module to pause and resume recording if needed. Calibration and validation targets were provided on a printed display (as described elsewhere in this article) or a smartphone. 
Acquisition software
We developed custom code to record the four video streams as well as odometry and inertial measurement unit data from the tracking camera. We recorded timestamps for each stream on a common clock to account for the different frame rates of each camera. The acquisition code was based on PupilLabs’ open source code for their PupilCapture software (https://github.com/vedb/pupil_recording_interface), modified to incorporate the FLIR and Intel T-265 cameras. To economize file sizes, all video data was compressed at the point of saving to disk using the h264 codec, with a constant rate factor of 18. This parameter value is commonly considered to avoid introducing perceptible compression artifacts (e.g., Wang, Katsavounidis, Huang, Zhou, & Kuo, 2017), while decreasing the amount of data saved to disk for each session from approximately 24 GB/min to approximately 0.4 GB/min, a 60× decrease in file size. (Note that the benefits of compression will vary with the frame-to-frame differences in different videos.) 
The variable illumination of VEDB recording locations presented a substantial technical challenge. To account for these difference, we allowed the frame rate of our world and eye camera acquisition to vary over individual recording sessions, collecting light for slightly longer in dimmer conditions. Frame time was also slowed for stretches of some sessions owing to low battery levels or processing demands in the recording laptop. We documented this variability by collecting frame acquisition timestamps for all video streams. These are saved separately for each session and stream as a Numpy file. These files are critical for basic statistics, such as motional statistics in the world camera, or defining saccades based on a velocity threshold. The code for the acquisition software is available on GitHub (https://github.com/vedb/pupil_recording_interface and https://github.com/vedb/ved-capture). 
Acquisition procedure
Each participant provided written informed consent (or, in the case of minors, assent with parental written consent). To ensure safety, we instructed participants not to drive and to remain aware of their surroundings. To protect the rights of bystanders, we instructed observers to obtain permission from acquaintances who were filmed in private spaces and to stop recording immediately if requested by someone in a public place. We informed observers that recordings would be made available on a public-facing website and were allowed to delete any session or section for any reason. 
We trained observers to properly wear and adjust the recording equipment, use the custom recording software, and complete the calibration and validation procedures. This training typically took between 20 minutes and 2 hours, and observers were compensated for their training time. 
The eye tracking calibration target consisted of a standard concentric circle pattern provided by Pupil Labs. This target was either printed on paper and affixed to a 3D-printed wand (7.5 cm × 13.0 cm in size) or viewed on a 6.5 cm × 13.0 cm sized smartphone. The validation target was a checkerboard pattern printed and affixed to the opposite side of the calibration wand or viewed on a smartphone. Some recordings used a 7 × 9 checkerboard pattern, whereas others used 4 × 7. 
We instructed observers to hold the calibration target at arm's length and to reveal the target at 25 locations, evenly spaced in a 5 × 5 grid in the world camera's field of view. We directed observers to hold the calibration target steady, keeping a steady gaze at the center of the centric circle pattern for 1 to 2 seconds in each location. As observers had no real-time feedback of their calibration target locations, some targets were missed because they were out of the world camera's field of view. To avoid spurious detections of the calibration target, observers were advised to rotate the calibration target by 90° (to view the calibration wand edge-on) while moving the target between locations. The validation procedure was identical, but used the checkerboard target instead of the concentric circle target. To calibrate the tracking camera, observers nodded slowly five times and then rotated their heads right and left five times while regarding the validation target. These movements may be used to identify a head-centered kinematic reference frame that is common across observers and independent of how the device is mounted on the head; they also provide a basis for joint analysis of head and eye movements. Not all sessions contained tracking camera calibration. Some sessions also contained additional validation procedures at other stages of recording to evaluate the stability of the initial eye and head tracking calibration. After the session, observers were invited to view their recordings before uploading them to a centralized server. In the debriefing form that participants received after the experiment, they were provided with contact information for the principal investigator should an observer wish to withdraw their data at any time. 
Potential sources of error or bias
Omission
In total, 968 sessions (283 hours of data) were recorded from 70 unique individuals, with three universities participating in data collection. Excluding sessions that had corrupted recordings (n = 59), contained only pilot data (n = 76), or were disallowed by an institutional review board (n = 116), we have 717 sessions (244 hours) from 56 unique individuals. 
Video recording errors
Sixteen sessions have been included but with minor recording errors. These include a blurry world camera (n = 3), partial occlusion by a cord or other part of the recording apparatus (n = 6), and an inability to maintain a sampling rate near 30 Hz on the world camera (n = 5). 
Eye tracking errors
We could calibrate gaze tracking in 548 of the 717 sessions (76%) successfully. Of the 169 sessions without calibration, 16% were due to the participant omitting the calibration procedure. Unfortunately, gaze will not be recoverable from these sessions. Another six sessions had corrupted timestamp files for one or more of the cameras, which are also nonrecoverable errors. The remainder of the errors are due to difficult lighting or backgrounds during calibration and may be recoverable in the future by improved marker detection algorithms or manually annotating the calibration target locations. We are presenting gaze error as the error in the independent validation sessions. To compute a representative error metric, we used a series of procedures to omit marker detections that were likely to be spurious. We removed marker detections that were very brief. These were often due to false alarms with the scene background, or views depicting a target that was not intended to be part of a validation sequence (such as a target left upon a table). In addition, we spatially clustered validation marker locations, and only computed error for sessions with a sufficient number of validation points. However, the clustering procedure identified zero clusters for some sessions. Therefore, the final validated gaze error values are from 458 sessions (64% of total, 84% of sessions with successful calibration). We annotated the reasons for each failure. For 110 sessions (65% of failures), insufficient information was recorded to recover gaze. These include omitting calibration or validation (n = 52), corrupted timestamp files for any of the video streams (n = 8), pupils that were not entirely visible to the camera (n = 22), or validation that was not executed correctly (n = 28). The remaining errors were due to challenging backgrounds and lighting conditions that made it difficult to detect the validation target automatically. Figure 4 shows a diagram of the validation failures and their reasons. 
Figure 4.
 
Sankey diagram indicating reasons for validation failures. Approximately one-third of these failures may be recoverable with additional annotation.
Figure 4.
 
Sankey diagram indicating reasons for validation failures. Approximately one-third of these failures may be recoverable with additional annotation.
The extent to which the pupil can be identified confidently limits gaze tracking accuracy. We measured the proportion of frames with PupilLabs confidence ratings of greater than 0.6 in each session. We found that, on average, 65% of frames met this confidence threshold. However, the type of recording impacted confidence ratings significantly, as shown in Figure 5. Indoor sessions led to more retained frames compared with outdoor sessions, 68% vs. 62%, Welch's two-sample t test, t(593.7) = 4.78, p < 0.001, and sedentary sessions led to more confident pupil detections than active, 70% vs. 63%, t(347.4) = −4.03, p < 0.001. 
Figure 5.
 
(A) Survival plot of sessions, plotting the proportion of sessions on the y axis as a function of frames retained (frames with pupil detection confidence of >0.6). Indoor sessions led to significantly higher pupil confidence than outdoor sessions. (B) Survival plot for active and sedentary sessions. Sedentary activities led to significantly higher pupil confidence.
Figure 5.
 
(A) Survival plot of sessions, plotting the proportion of sessions on the y axis as a function of frames retained (frames with pupil detection confidence of >0.6). Indoor sessions led to significantly higher pupil confidence than outdoor sessions. (B) Survival plot for active and sedentary sessions. Sedentary activities led to significantly higher pupil confidence.
The validity of the initial calibration requires that the cameras remain in the same location relative to one another and in the same position on the observer's head. Although the mounting systems guarded against the first issue, any device instability on an observer's head could lead to inaccurate gaze tracking. To assess the extent to which stability limited gaze accuracy over time, we used ordinary least squares regression to fit a model of each pupil's y position over time and saved the slope parameter. Figures 6A and 6B show histograms of slopes for the left and right eyes. These distributions are centered on zero, with very small slope values. However, small differences in slope could lead to large errors in gaze tracking, particularly in long sessions. Thus, to quantify the absolute magnitude of drift from beginning to end of each session, we averaged the two eyes and converted the difference in average position between the beginning and end of each session to degrees of visual angle (DVA). As shown in Figure 6C, 95% of sessions contained less than 5° of drift (70% of sessions drifted by <2°). Active sessions had a numerical tendency to greater drift compared with passive, 1.83 vs. 1.51 degrees, t(396.1) = 1.93, p = 0.054. We observed no significant difference in drift between indoor and outdoor sessions, t(510.7) < 1. Finally, we reasoned that if slippage were a serious issue in many sessions, then it should get progressively worse with longer sessions. To test this, we computed a correlation between session length and amount of drift. We observed a small negative correlation between session length and the amount of drift, r = −0.09, p < 0.05, indicating that slippage did not get worse with longer sessions. It is possible that drift followed nonlinear patterns, with sudden shifts in headset position over short times, which our linear regression estimation did not capture fully. Estimation and correction of this level of slippage is beyond the scope of the current article, but the linear regression and drift magnitude analyses presented here, taken together, suggest that for a clear majority of sessions, gaze calibration remained stable over time. 
Figure 6.
 
(A and B) Slope of detected pupil positions over time in units of degrees of visual angle (DVA) per minute. (C) Difference in average pupil position between early (first 1,000 frames) and late (last 1,000 frames) in each session, expressed in DVA.
Figure 6.
 
(A and B) Slope of detected pupil positions over time in units of degrees of visual angle (DVA) per minute. (C) Difference in average pupil position between early (first 1,000 frames) and late (last 1,000 frames) in each session, expressed in DVA.
Finally, we assessed the consistency of temporal sampling in each stream by computing the mean squared error of sample timestamps relative to an assumed constant frame rate. Mean squared error values were generally very low (specific values can be found in Table 2), indicating that our sampling was stable over time. 
Table 2.
 
Mean squared error values interframe intervals of each data stream
Table 2.
 
Mean squared error values interframe intervals of each data stream
Confidentiality and privacy
Our lived experiences contain many potentially sensitive moments. We have mitigated this risk to the extent possible by allowing each observer to choose what activities to record. Further, the use of the Bluetooth pause button allowed participants to pause recordings during sensitive times, such as entering a password or passing through a room with a family member who did not wish to be recorded. Finally, observers could note sessions or session sections that they wished to be removed for any reason. Nonetheless, it is possible that some potentially sensitive information may exist in the dataset. 
We have taken measures to reduce the identifiability of individuals within the dataset. Even simple demographic information such as zip code, gender, and date of birth can be used to identify individuals uniquely (Sweeney, 2000). We have decreased these risks by recording each participant's date of birth as January 1 of their birth year and not recording the location of any specific recording. However, we recognize that the unique landscapes of our recording locations make it possible to make educated guesses about the location. 
Finally, the eye videos themselves may be considered sensitive. Near-infrared videos, such as those used by the Pupil Core, have been shown to contain diagnostic iris features that can be used for biometric identification (John, Koppal, & Jain, 2019). Because the eye videos contain the raw data for researchers advancing new pupil detection and gaze tracking methods, the privacy interests of the participants are opposed to the scientific usefulness of these videos. One study has shown that a mild degree of Gaussian blurring of these videos can remove diagnostic iris information without degrading pupil tracking performance (John et al., 2019). Therefore, we have released blurred versions of the eye videos to improve the impact of the dataset without harming participant privacy. Unmodified eye videos are made available to researchers with a Databrary account. 
Sampling strategy
Our sampling strategy considered several tradeoffs. First, we wanted data to be representative of daily experience but also to capture the breadth of activities that are common across many groups. According to the American Time Use Survey, an annual telephone survey conducted by the U.S. Bureau of Labor Statistics, the preponderance of peoples’ days are engaged in work (7.69 hours, on average in 2022) and sedentary entertainment such as watching television or playing computer games (>3 hours, on average). Because some occupations cannot be recorded for privacy or safety reasons, we have limited the work task to common office computer work in the dataset. Television and video games are common but visually homogenous activities, so we avoided sampling them in proportion to the time most people allot to these experiences. 
Second, we wanted to show a variety of people engaging in various tasks in various environments without breaking the natural covariation between people, tasks, and environments. To accomplish this aim, we allowed participants to determine what they wanted to record in consultation with experimenters to assess the safety and feasibility of each activity. 
Third, we wanted to record from various observers and tasks while also enabling researchers to study interesting repeated measures, such as similarities within individuals or across individuals performing the same task in the same locations. To accomplish this, we included some structured instructions within our sampling. For example, 12 observers were instructed to walk around a 0.25-mile pond path on the Bates College campus once or twice a week for the duration of 2022. These 135 sessions can be useful in assessing eye and head movement variability within and across individual observers. Figure 7 shows example frames of representative activities in the VEDB. 
Figure 7.
 
Representative frames within the VEDB. (Top row) Public activities. (Second row) Cooking. (Third row) Playing music. (Bottom row) Outdoor activities.
Figure 7.
 
Representative frames within the VEDB. (Top row) Public activities. (Second row) Cooking. (Third row) Playing music. (Bottom row) Outdoor activities.
Data processing
Preprocessing
Using custom software, we hand-annotated the start and end times for eye movement calibration, validation, and odometry calibration procedures. These times were translated into timestamp indices and saved in a .yaml file (marker_times.yaml). 
Gaze pipeline
Calibration target detection
We applied the circle detection algorithm from Pupil Labs to detect the concentric circle pattern in the world videos between the saved calibration start and end times. The locations of each detected circle, along with their size (for filtering spurious detections), and time stamps were saved as NumPy arrays. 
Pupil detection
We applied the 2D pupil detection method from Pupil Labs to locate the pupil in both eye videos. We saved the detected pupil ellipses and their diameters, locations, luminance, and time stamps in Python dictionaries. Further, the confidence of each detection (0–1) was saved as a NumPy array. 
Calibration
The concentric circle calibration target locations were filtered before the calibration step for spurious detections. Detected locations that were too small, at an inappropriate aspect ratio, or detected very briefly (<0.3 seconds) were omitted as likely to be spurious. Target locations were not completely steady because calibration targets were held by hand. We clustered the remaining locations using DBSCAN clustering to increase the signal-to-noise ratio. This method was chosen because the number of clusters does not need to be specified in advance, and it is robust to outliers. Pupil locations were filtered by confidence, with only confidence values of greater than 0.6 included in the calibration. To account for unsteady hands and occasional behavioral gaze errors (e.g., participants momentarily looking away from the target owing to wind or other real-world distraction), we used the median location within each cluster for both calibration marker detections and detected pupils to compute gaze calibration. Next, the saved calibration target arrays and pupil arrays were passed to custom Python code that fit the pupil locations to each calibration location using cross-validated thin plate spline regression. Model fitting was performed monocularly. 
Computing validation error
We detected the locations of the validation markers using the checkerboard detector from the OpenCV library and stored these as NumPy arrays. As with the calibration markers, we filtered these detected locations by size and duration and clustered the locations using DBSCAN. Validation target locations over four standard deviations away from cluster medians were omitted as outliers. The error (in DVA) between the modeled gaze and the location of the checkerboard's center was computed and stored. A histogram of validation error is shown in Figure 7B. 168 sessions have a validated gaze error of less than 2°, and 332 sessions have validated error of less than 5°. Indoor and outdoor recordings had similar levels of validated error, indoor, 2.4 DVA; outdoor, 2.6 DVA; p = 0.15 Mann–Whitney U test. Similarly, we observed no significant difference between physically active vs. passive sessions, active, 2.6 DVA; passive, 2.4 DVA; p = 0.48. 
Computing immediate calibration error
The checkerboard pattern was not robustly detected in some environments and lighting conditions. We also computed the error in gaze calibration to estimate the extent of gaze error for these sessions. This quantity is overfitted to the data, but can be used to identify sessions with acceptable gaze calibration. Using the detected calibration markers, we used affinity propagation to cluster marker locations and then computed the median location of each cluster. We identified the time stamps associated with each cluster and computed the median gaze position during that same time interval. We computed the degree of error between them (measured in DVA). As shown in Figure 8A, the median calibration error in most sessions was minimal (59% were <1° and 80% were <5°). There was a long tail of very high error clusters. There are numerous reasons for these failures, including spurious pupil detections, pupil occlusion through eyelashes, or an eye that was not centered in the frame. We observed a slight tendency for outdoor sessions to have a higher median calibration error compared with indoor, 0.65 DVA vs. 0.40 DVA; Mann–Whitney U test, p < 0.05. This was driven by the additional challenges of recording outdoors, including participants squinting into the sun. We observed a lower median calibration error for physically passive sessions compared with active, 0.32 DVA vs. 0.63 DVA, p = 0.006. The code that performs this gaze pipeline can be found on GitHub (https://github.com/vedb/vedb_gaze_1.0). 
Figure 8.
 
(A) Histogram of initial calibration error (degrees of visual angle [DVA]). (B) Histogram of validation error (DVA) across 423 sessions.
Figure 8.
 
(A) Histogram of initial calibration error (degrees of visual angle [DVA]). (B) Histogram of validation error (DVA) across 423 sessions.
Head tracking
Odometry, gyroscope, and accelerometer data are provided as is without further processing. These raw data indicate the position and movement of the tracking camera relative to a world reference frame that is initialized when recording starts. Precision and accuracy of this tracking data relative to in-lab optical tracking has been assessed previously (Hausamann, Sinnott, Daumer, & MacNeilage, 2021). For the purposes of comparing or compiling head movement data across participants, it is necessary to transform the data into an anatomical or kinematic reference frame that is the same across participants and independent of how the device is mounted on the head (Sinnott, Hausamann, & MacNeilage, 2023). Code for finding and applying this transformation for a given session can be found on GitHub (https://github.com/bszek213/odopy). Similarly, for joint analysis of head and eye movements, or for reconstructing how the eye is moving relative to the world, the spatial relationship between the reference frame of the tracking camera and the world camera is necessary because eye movements are measured in the reference frame of the world camera (Hausamann, Sinnott, & MacNeilage, 2020). Code for finding and applying the eye–head reference frame transformation for a given session can be found on GitHub (https://github.com/bszek213/head_eye). 
Labeling
For each session, we extracted frames from the world camera videos at five-second intervals. Researchers in our groups examined these frames and, for each session, created a .csv file that labeled all changes in scene location (using scene categories from the Places database) or the observer's task in the session. This provides not only ground truth place and task labels but also the approximate time points of each scene or task transition. 
Preserving second-order privacy
We created custom software to blur bystanders' faces in our videos. Each video frame was read in and downsampled to 50% size for speed. We applied the RetinaFace detector (Deng, Guo, Ververas, Kotsia, & Zafeiriou, 2020) to each frame. We created a tight ellipse around each of any bounding boxes returned by the detector and blurred the pixels within the ellipse with a Gaussian blur (standard deviation = 149 pixels). We then rewrote each altered frame to a new file. We found that RetinaFace provided the best detection efficacy of several leading detectors. Specifically, although most detectors underperformed on non-White and non-adult faces, we found that RetinaFace was robust across age, race, and the presence of face masks. An internal audit of approximately 3,000 of the most challenging sessions found that the classifier's sensitivity was more than 87%, making it robust across different ethnicities, ages, viewpoints, and occluders such as hats, sunglasses, and face masks. A sample of 150 typical face frames yielded a sensitivity of greater than 99%. The script that generated these videos is found on GitHub (https://github.com/vedb/protect_privacy). 
Results and discussion
Use cases for the VEDB
The VEDB is appropriate for studies in natural scene statistics, examinations of gaze behavior during common tasks, and studies of how head and eye movements combine to orient overt attention and gaze. The VEDB also provides a large dataset of high-resolution egocentric video together with head and eye movement data. Alongside investigations of natural scene statistics, these data can be used to create and validate new or extant methods of motion or position estimation, object classification, human activity recognition, machine learning-based eye tracking, and more. We provide our plans for this dataset and our recommendations for others using the dataset. 
One unique feature of the VEDB is that scene locations and observers’ tasks have been annotated at a fine temporal scale. We have been working to assess the relative success of pretrained deep neural networks in classifying scene examples in the VEDB. We have found that these networks are impaired in classifying the VEDB samples and that this effect is particularly strong for scenes within private homes (Greene, Hart, & Mohamed, 2022). Thus, VEDB frames may be useful for fine-tuning deep neural networks to decrease their biases. Further, the fine-grained annotations of tasks can be useful in training systems to recognize human activities from an egocentric perspective, which can be useful in human–computer interaction and robotics. 
The majority of the literature on natural scene statistics has only considered static photographs. Thus, the VEDB is especially appropriate for studying spatiotemporal image statistics. Subsampling of world camera images centered on gaze allows generating retinocentric videos that provide an approximation of moment-to-moment retinal image motion. Furthermore, by combining the world camera feed with the odometry from the tracking camera, one can begin to assess the statistics of self-generated and externally generated visual motion. The statistics of head orientation and movement have important implications for understanding sensory processing across vestibular, visual and auditory modalities (MacNeilage, 2020; Sinnott et al., 2023), and joint statistics of natural head and eye movements can provide valuable and novel insight into head–eye coordination during everyday behavior. Because each session is named with a recording time stamp (YYYY_MM_DD_HH_MM_SS), we have been examining how local environmental conditions (temperature and time until sunset) alter some standard natural scene statistics (Greene, Hart, & Balas, 2023). 
Gaze information is available for most VEDB sessions, with 168 sessions (approximately 63 hours of content) containing initial gaze accuracies of under 2 DVA, making them appropriate for many eye tracking applications. For research questions that do not require precise gaze tracking, 333 sessions (approximately 122 hours) have median gaze accuracies of under 5 DVA. We are providing blurred eye videos in the dataset, which previous work suggests is an effective method for protecting observers’ identities while allowing pupil detection (John et al., 2019). More robust pupil detection methods will improve any future gaze pipeline applied to these data. In particular, machine learning methods show great promise for improving gaze tracking pipelines (Yiu et al., 2019). 
The VEDB highlights the challenges faced by researchers who wish to study gaze in active situations with a diverse set of participants. In particular, the VEDB contains gaze data recorded in challenging outdoor conditions with dynamic lighting, which require careful sensor parameter selection with model-based tracking methods (Binaee, Sinnott, Capurro, MacNeilage, & Lescroart, 2021). Further, the VEDB contains data from participants of many racial and ethnic backgrounds, a known challenge for model-based eye tracking methods (Blignaut & Wium, 2014). Thus, the VEDB provides an opportunity for researchers to develop tracking methods that are robust to these features. Finally, by combining gaze, egocentric video, and positional data, researchers can construct and validate gaze-contingent methods of human activity recognition, a topic of increasing interest in the wearable and augmented reality communities (Bektaş et al., 2023). 
VEDB case study
Given the critical role of faces in our social lives, the “face diet” of human observers has been of particular interest. Developmental work using head-mounted cameras has shown that human infants see more faces than hands in the first few months of life, and that this ratio switches as children become toddlers and can engage more interactively with the world (Fausey, Jayaraman, & Smith, 2016). Smaller-scale studies of adults have examined the frequency of encountering faces, along with their viewpoints and depth profiles (Oruc, Shafai, Murthy, Lages, & Ton, 2019). Here, we demonstrate the usefulness of the VEDB by expanding on this type of analysis by using gaze information in addition to frequency statistics from egocentric video. 
We applied the Google MediaPipe (Lugaresi et al., 2019) face and hand detectors to each video frame after calibration and validation. We omitted calibration and validation segments because hands were continuously visible during the procedure. This analysis totaled more than 22.5 million frames of content. Heat maps showing the spatial distribution of faces and hands in the world video is shown in Figure 9 (top). Our findings revealed a distinct spatial distribution of these body parts: the median center of faces was located 13° above the video center, while the median hand position was 23° below, Wilcoxon rank-sum, p < 0.0001. 
Figure 9.
 
(Top) Heat maps showing the location of faces and hands in 22.5 million frames of egocentric video. (Bottom) Gaze-centered heatmaps of faces and hands.
Figure 9.
 
(Top) Heat maps showing the location of faces and hands in 22.5 million frames of egocentric video. (Bottom) Gaze-centered heatmaps of faces and hands.
We next applied our gaze data to show the spatial distribution of faces and hands relative to fixation. For each frame with a detected hand or face, we found the gaze location and, when possible, cropped a 512 × 512 image segment centered on the gaze location. The resulting gaze-centered heat maps are shown in the bottom row of Figure 9. Interestingly, the highest intensity location on the gaze-centered face map was 2.8° above fixation, and the highest intensity location on the hand map was 3.9° below fixation. Taken together, this finding suggests that, although both faces and hands tend to be frequently fixated by human observers, retinotopic experience of these objects still differs spatially. This finding suggests the hypothesis that face and hand detection sensitivity may vary accordingly across the visual field. We are currently testing this hypothesis. Furthermore, characterizing the retinotopic visual experience of various categories can provide additional insight into the functional organization of higher-level cortex (e.g., Weiner & Grill-Spector, 2013). 
Access and distribution
We are releasing most VEDB materials publicly. There are two main repositories for the materials. Databrary (https://nyu.databrary.org/volume/1612) hosts the deidentified videos from the world camera, blurred eye videos, as well as the location and task annotation files. We have also hosted the unblurred eye videos on Databrary, but with limited access to protect participant privacy. Researchers who have a Databrary account will be able to view this content. Each session is named by timestamp (YYYY_MM_DD_HH_MM_SS), and is tagged with participant information (participant code, approximate age, gender, race, and ethnicity), a condition label that indicates whether the gaze tracking pipeline was successful, a basic task label and a filming location label. End users may download all content or filter according to their specific needs using the tags. The Open Science Framework (https://osf.io/2gdkb/) hosts the processed gaze files and tracking data (odometry, accelerometry, and gyroscope) because Databrary only supports video and csv file types. Both repositories name sessions by timestamp (YYYY_MM_DD_HH_MM_SS) so that corresponding files can be found. We have also released a suite of code tools for data acquisition, processing, and analysis on GitHub (https://github.com/vedb), along with introductory notebooks for using the data (https://github/com/vedb/vedb-demos). The CC-BY Creative Commons license covers all content. 
Maintenance and sustainability
We have chosen to host the dataset on existing large-scale open science platforms (Databrary and OSF) to improve this dataset's longevity and increase its visibility in the community. We cannot guarantee the ongoing software development necessary to keep all code up-to-date. However, the CC-BY license empowers end users to use our code as a project starting point. 
We are open to growth of the VEDB and encourage the community to help grow and contribute to this project to increase its usefulness to all. 
Acknowledgments
The authors thank Maggie Diamond-Stanic, Jackson Skinner, and Amber Fuller for their administrative support. Thanks to Dan Gu and Robert Spellman for technical support. Annotation support was provided by Adya Agarwal, Hannah Burdick, Avery Cadorette, Olivia Cuneo, Yuleibi De Los Santos, Mavy Ho, Joey Ireland, Kaitlyn O'Shaughnessy, Colin Pierce, Samantha Simmons, and Madeline Smith. 
Author contributions: Greene: conceptualization, methodology, software, investigation, data curation, writing original draft, visualization, supervision, project administration, funding acquisition. Balas: conceptualization, methodology, writing review and editing, supervision, funding acquisition. Lescroart: conceptualization, methodology, software, investigation, writing review and editing, supervision, funding acquisition. MacNeilage: conceptualization, methodology, investigation, writing review and editing, supervision, funding acquisition. Hart: methodology, investigation, data curation, project administration. Binaee: methodology, software, validation, investigation, writing review and editing. Hausammann: software. Mezile: methodology, investigation, resources, data curation. Shankar: methodology, investigation. Sinnott: methodology, investigation, writing review and editing. Capurro: investigation. Hallow: methodology, investigation. Howe: investigation. Josyula: data curation, investigation. Li: software, investigation. Mieses: software, investigation. Mohamed: investigation. Nudnou: investigation. Parkhill: investigation. Riley: software, investigation. Schmidt: investigation. Shinkle: methodology, investigation. Si: data curation, investigation. Szekely: methodology, investigation. Torres: investigation. Weissmann: investigation. 
Commercial relationships: none. 
Corresponding author: Michelle R. Greene. 
Email: mgreene@barnard.edu. 
Address: Department of Psychology, Barnard College, Columbia University, 3009 Broadway, New York, NY 10027, USA. 
Footnotes
References
Adams, W. J., Elder, J. H., Graf, E. W., Leyland, J., Lugtigheid, A. J., & Muryy, A. (2016). The Southampton-York Natural Scenes (SYNS) dataset: Statistics of surface attitude. Scientific Reports, 6, 35805, https://doi.org/10.1038/srep35805. [CrossRef] [PubMed]
Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193. [CrossRef] [PubMed]
Barlow, H. (2001). The exploitation of regularities in the environment by the brain. Behavioral and Brain Sciences, 24(04), 602–607, https://doi.org/10.1017/S0140525X01000024. [CrossRef]
Bektaş, K., Strecker, J., Mayer, S., Garcia, K., Hermann, J., Jenß, K. E., ... Solèr, M. (2023). GEAR: Gaze-enabled augmented reality for human activity recognition. Proceedings of the 2023 Symposium on Eye Tracking Research and Applications, 9, 1–9, https://doi.org/10.1145/3588015.3588402.
Betsch, B. Y., Einhäuser, W., Körding, K. P., & König, P. (2004). The world from a cat's perspective – statistics of natural videos. Biological Cybernetics, 90(1), 41–50, https://doi.org/10.1007/s00422-003-0434-6. [CrossRef] [PubMed]
Biederman, I., Mezzanotte, R. J., & Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14(2), 143–177, https://doi.org/10.1016/0010-0285(82)90007-X. [CrossRef] [PubMed]
Binaee, K., Sinnott, C., Capurro, K. J., MacNeilage, P., & Lescroart, M. D. (2021). Pupil tracking under direct sunlight. ACM Symposium on Eye Tracking Research and Applications, 18, 1–4, https://doi.org/10.1145/3450341.3458490.
Blignaut, P., & Wium, D. (2014). Eye-tracking data quality as affected by ethnicity and experimental design. Behavior Research Methods, 46(1), 67–80, https://doi.org/10.3758/s13428-013-0343-0. [CrossRef] [PubMed]
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Kazakos, E., Ma, J., ... Wray, M. (2022). Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, 130(1), 33–55, https://doi.org/10.1007/s11263-021-01531-2. [CrossRef]
Davenport, J., & Potter, M. C. (2004). Scene consistency in object and background perception. Psychological Science, 15(8), 559–564. [CrossRef] [PubMed]
De la Torre Frade, F., Hodgins, J. K., Bargteil, A., Artal, X. M., Castells Collando I, & Beltran, J. (2009). Guide to the Carnegie Mellon University multimodal activity (CMU-MMAC) database—The Robotics Institute Carnegie Mellon University (Technical Report 08–22). Pittsburgh, PA: Carnegie Mellon University. Retrieved from https://www.ri.cmu.edu/publications/guide-to-the-carnegie-mellon-university-multimodal-activity-cmu-mmac-database/.
Deng, J., Guo, J., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. 5203–5212. Retrieved from https://openaccess.thecvf.com/content_CVPR_2020/html/Deng_RetinaFace_Single-Shot_Multi-Level_Face_Localisation_in_the_Wild_CVPR_2020_paper.html.
Deng, J., Dong W., Socher, R., Li L.-J., Li K., & Fei-Fei L. (2009). ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, Miami, FL, USA, 2009, pp. 248–255, https://doi.org/10.1109/CVPR.2009.5206848.
DeVries, T., Misra, I., Wang, C., & van der Maaten, L. (2019). Does object recognition work for everyone? arXiv:1906.02659 [Cs]. http://arxiv.org/abs/1906.02659.
Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network: Computation in Neural Systems, 6(3), 345–358, https://doi.org/10.1088/0954-898X_6_3_003. [CrossRef]
DuTell, V., Gibaldi, A., Focarelli, G., Olshausen, B. A., & Banks, M. S. (2024). High-fidelity eye, head, body, and world tracking with a wearable device. Behavior Research Methods, 56(1), 32–42, https://doi.org/10.3758/s13428-022-01888-3. [CrossRef] [PubMed]
Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012). Social interactions: A first-person perspective. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 1226–1233, https://doi.org/10.1109/CVPR.2012.6247805.
Fathi, A., Li, Y., & Rehg, J. M. (2012). Learning to recognize daily actions using gaze (pp. 314–327). Berlin, Heidelberg: Springer, https://doi.org/10.1007/978-3-642-33718-5_23.
Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. CVPR 2011. Colorado Springs, CO, USA, 2011, pp. 3281–3288, https://doi.org/10.1109/CVPR.2011.5995444.
Fausey, C. M., Jayaraman, S., & Smith, L. B. (2016). From faces to hands: Changing visual input in the first two years. Cognition, 152, 101–107, https://doi.org/10.1016/j.cognition.2016.03.005. [CrossRef] [PubMed]
Fouhey, D. F., Kuo, W., Efros, A. A., & Malik, J. (2017). From lifestyle vlogs to everyday interactions. arXiv:1712.02310 [Cs], http://arxiv.org/abs/1712.02310.
Fuhl, W., Kasneci, G., & Kasneci, E. (2021). TEyeD: Over 20 million real-world eye images with pupil, eyelid, and iris 2d and 3d segmentations, 2d and 3d landmarks, 3d eyeball, gaze vector, and eye movement types. 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bari, Italy, 2021, pp. 367–375, https://doi.org/10.1109/ISMAR52148.2021.00053.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé, H., III, & Crawford, K. (2021). Datasheets for Datasets (arXiv:1803.09010). arXiv, https://doi.org/10.48550/arXiv.1803.09010.
Geisler, W. S. (2008). Visual perception and the statistical properties of natural scenes. Annual Review of Psychology, 59(1), 167–192, https://doi.org/10.1146/annurev.psych.58.110405.085632. [CrossRef] [PubMed]
Gibson, J. J. (1986). The ecological approach to visual perception. Mahwah, NJ: Lawrence Erlbaum Associates.
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., ... Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5843–5851, https://doi.org/10.1109/ICCV.2017.622.
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., ... Malik, J. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video. 2022, pp. 18995–19012, https://openaccess.thecvf.com/content/CVPR2022/html/Grauman_Ego4D_Around_the_World_in_3000_Hours_of_Egocentric_Video_CVPR_2022_paper.html.
Greene, M., Hart, J., & Balas, B. (2023). Viewpoint and seasonal variations in natural scene statistics. Journal of Vision, 23(9), 5940, https://doi.org/10.1167/jov.23.9.5940. [CrossRef]
Greene, M. R. (2013). Statistics of high-level scene context. Frontiers in Perception Science, 4, 777, https://doi.org/10.3389/fpsyg.2013.00777.
Greene, M. R., Botros, A. P., Beck, D. M., & Fei-Fei, L. (2015). What you see is what you expect: Rapid scene understanding benefits from prior experience. Attention, Perception & Psychophysics, 77(4), 1239–1251, https://doi.org/10.3758/s13414-015-0859-8.
Greene, M. R., Hart, J. A., & Mohamed, A. (2022). What we don't see in image databases. Journal of Vision, 22(14), 3204, https://doi.org/10.1167/jov.22.14.3204. [CrossRef]
Hansen, B. C., Essock, E. A., Zheng, Y., & DeFord, J. K. (2003). Perceptual anisotropies in visual processing and their relation to natural image statistics. Network (Bristol, England), 14(3), 501–526. [CrossRef] [PubMed]
Harrison, W. J. (2022). Luminance and contrast of images in the THINGS database. Perception, 51(4), 244–262, https://doi.org/10.1177/03010066221083397. [CrossRef] [PubMed]
Hausamann, P., Sinnott, C. B., Daumer, M., & MacNeilage, P. R. (2021). Evaluation of the Intel RealSense T265 for tracking natural human head motion. Scientific Reports, 11(1), Article 1, https://doi.org/10.1038/s41598-021-91861-5.
Hausamann, P., Sinnott, C., & MacNeilage, P. R. (2020). Positional head-eye tracking outside the lab: An open-source solution. ACM Symposium on Eye Tracking Research and Applications, Article No.: 14, 1–5, https://doi.org/10.1145/3379156.3391365.
Hayhoe, M., & Ballard, D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9(4), 188–194, https://doi.org/10.1016/j.tics.2005.02.009. [CrossRef] [PubMed]
Hayhoe, M., & Ballard, D. (2014). Modeling task control of eye movements. Current Biology : CB, 24(13), R622–R628, https://doi.org/10.1016/j.cub.2014.05.020. [CrossRef] [PubMed]
Held, R. T., Cooper, E. A., & Banks, M. S. (2012). Blur and disparity are complementary cues to depth. Current Biology, 22(5), 426–431, https://doi.org/10.1016/j.cub.2012.01.033. [CrossRef]
Hirota, Y., Nakashima, Y., & Garcia, N. (2022). Gender and racial bias in visual question answering datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1280–1292, https://doi.org/10.1145/3531146.3533184.
Howe, C. Q., & Purves, D. (2004). Size contrast and assimilation explained by the statistics of natural scene geometry. Journal of Cognitive Neuroscience, 16(1), 90–102, https://doi.org/10.1162/089892904322755584. [CrossRef] [PubMed]
Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev, I., Sukthankar, R., & Shah, M. (2017). The THUMOS challenge on action recognition for videos “in the wild.” Computer Vision and Image Understanding, 155, 1–23, https://doi.org/10.1016/j.cviu.2016.10.018. [CrossRef]
John, B., Koppal, S., & Jain, E. (2019). EyeVEIL: Degrading iris authentication in eye tracking headsets. Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, 37, 1–5, https://doi.org/10.1145/3314111.3319816.
Juricevic, I., & Webster, M. A. (2009). Variations in normal color vision. V. Simulations of adaptation to natural color environments. Visual Neuroscience, 26(1), 133–145, https://doi.org/10.1017/S0952523808080942. [CrossRef] [PubMed]
Kay, M., Matuszek, C., & Munson, S. A. (2015). Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (pp. 3819–3828). Association for Computing Machinery, https://doi.org/10.1145/2702123.2702520.
Kothari, R., Yang, Z., Kanan, C., Bailey, R., Pelz, J. B., & Diaz, G. J. (2020). Gaze-in-wild: A dataset for studying eye and head coordination in everyday activities. Scientific Reports, 10(1), Article 1, https://doi.org/10.1038/s41598-020-59251-5.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73, https://doi.org/10.1007/s11263-016-0981-7. [CrossRef]
Lee, T.-W., Wachtler, T., & Sejnowski, T. J. (2002). Color opponency is an efficient representation of spectral properties in natural scenes. Vision Research, 42(17), 2095–2103, https://doi.org/10.1016/S0042-6989(02)00122-0. [CrossRef] [PubMed]
Lee, Y. J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 1346–1353, https://doi.org/10.1109/CVPR.2012.6247820.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Fleet, D., Pajdla, T., Schiele, B., & Tuytelaars, T. (Eds.), Computer Vision – ECCV 2014 (pp. 740–755). New York: Springer International Publishing.
Long, B., Konkle, T., Cohen, M. A., & Alvarez, G. A. (2016). Mid-level perceptual features distinguish objects of different real-world sizes. Journal of Experimental Psychology: General, 145(1), 95–109, https://doi.org/10.1037/xge0000130. [CrossRef] [PubMed]
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... Grundmann, M. (2019). MediaPipe: A framework for building perception pipelines (arXiv:1906.08172). arXiv, https://doi.org/10.48550/arXiv.1906.08172.
MacNeilage, P. (2020). Characterization of natural head movements in animals and humans. In Reference Module in Neuroscience and Biobehavioral Psychology. New York: Elsevier, https://doi.org/10.1016/B978-0-12-809324-5.24190-4.
Matthis, J. S., Yates, J. L., & Hayhoe, M. M. (2018). Gaze and the control of foot placement when walking in natural terrain. Current Biology, 28(8), 1224–1233.e5, https://doi.org/10.1016/j.cub.2018.03.008. [CrossRef]
Meissner, C. A., & Brigham, J. C. (2001). Thirty years of investigating the own-race bias in memory for faces: A meta-analytic review. Psychology, Public Policy, and Law, 7(1), 3–35, https://doi.org/10.1037/1076-8971.7.1.3. [CrossRef]
Ng, E., Xiang, D., Joo, H., & Grauman, K. (2020). You2Me: Inferring body pose in egocentric video via first and second person interactions. 9890–9900. Available from https://openaccess.thecvf.com/content_CVPR_2020/html/Ng_You2Me_Inferring_Body_Pose_in_Egocentric_Video_via_First_and_CVPR_2020_paper.html.
Nishida, S. (2019). Image statistics for material perception. Current Opinion in Behavioral Sciences, 30, 94–99, https://doi.org/10.1016/j.cobeha.2019.07.003. [CrossRef]
Northcutt, C. G., Zha, S., Lovegrove, S., & Newcombe, R. (2023). EgoCom: A multi-person multi-modal egocentric communications dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 6783–6793. IEEE Transactions on Pattern Analysis and Machine Intelligence, https://doi.org/10.1109/TPAMI.2020.3025105. [CrossRef] [PubMed]
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609, https://doi.org/10.1038/381607a0. [CrossRef] [PubMed]
Oruc, I., Shafai, F., Murthy, S., Lages, P., & Ton, T. (2019). The adult face-diet: A naturalistic observation study. Vision Research, 157, 222–229, https://doi.org/10.1016/j.visres.2018.01.001. [CrossRef] [PubMed]
Otterbacher, J., Bates, J., & Clough, P. (2017). Competent men and warm women: Gender stereotypes and backlash in image search results. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (pp. 6620–6631). Association for Computing Machinery, https://doi.org/10.1145/3025453.3025727.
Peterson, M. F., Lin, J., Zaun, I., & Kanwisher, N. (2016). Individual differences in face-looking behavior generalize from the lab to the world. Journal of Vision, 16(7), 12–12, https://doi.org/10.1167/16.7.12. [CrossRef] [PubMed]
Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2847–2854, https://doi.org/10.1109/CVPR.2012.6248010.
Portilla, J., & Simoncelli, E. (2000). A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40(1), 49–71. [CrossRef]
Prabhu, V. U., & Birhane, A. (2020). Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923 [Cs, Stat], http://arxiv.org/abs/2006.16923.
Rao, R. P. N., & Ballard, D. H. (1998). Development of localized oriented receptive fields by learning a translation-invariant code for natural images. Network: Computation in Neural Systems, 9(2), https://doi.org/10.1088/0954-898x_9_2_005.
Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 1194–1201, https://doi.org/10.1109/CVPR.2012.6247801.
Ruderman, D. (1994). The statistics of natural images. Network: Computation in Neural Systems, 5, 517–548, https://doi.org/10.1088/0954-898X/5/4/006. [CrossRef]
Sato, H., Kingdom, F. A. A., & Motoyoshi, I. (2019). Co-circularity opponency in visual texture. Scientific Reports, 9(1), Article 1, https://doi.org/10.1038/s41598-018-38029-w. [PubMed]
Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A., & Alahari, K. (2018). Charades-ego: A large-scale dataset of paired third and first person videos. arXiv:1804.09626 [Cs], http://arxiv.org/abs/1804.09626.
Singh, K. K., Fatahalian, K., & Efros, A. A. (2016). KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1–9, https://doi.org/10.1109/WACV.2016.7477717.
Sinnott, C. B., Hausamann, P. A., & MacNeilage, P. R. (2023). Natural statistics of human head orientation constrain models of vestibular processing. Scientific Reports, 13(1), Article 1, https://doi.org/10.1038/s41598-023-32794-z. [CrossRef] [PubMed]
Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. 567–576. Retrieved from https://openaccess.thecvf.com/content_cvpr_2015/html/Song_SUN_RGB-D_A_2015_CVPR_paper.html.
Sprague, W. W., Cooper, E. A., Tošić, I., & Banks, M. S. (2015). Stereopsis is adaptive for the natural environment. Science Advances, 1(4), e1400254, https://doi.org/10.1126/sciadv.1400254. [CrossRef] [PubMed]
Su, C.-C., Cormack, L. K., & Bovik, A. C. (2013). Color and depth priors in natural images. IEEE Transactions on Image Processing, 22(6), 2259–2274. IEEE Transactions on Image Processing, https://doi.org/10.1109/TIP.2013.2249075.
Su, Y.-C., & Grauman, K. (2016). Detecting engagement in egocentric video. In Leibe, B., Matas, J., Sebe, N., & Welling, M. (Eds.), Computer vision – ECCV 2016 (pp. 454–471). New York: Springer International Publishing, https://doi.org/10.1007/978-3-319-46454-1_28.
Sweeney, L. (2000). Simple demographics often identify people uniquely. Health, 671, 1–34, https://doi.org/10.1184/R1/6625769.v1.
Tolhurst, D. J., Tadmor, Y., & Chao, T. (1992). Amplitude spectra of natural images. Ophthalmic and Physiological Optics, 12(2), 229–232, https://doi.org/10.1111/j.1475-1313.1992.tb00296.x. [CrossRef]
Tolia-Kelly, D. P. (2016). Visuality/materiality: Images, objects and practices. London: Routledge, https://doi.org/10.4324/9781315547930.
Tommasi, T., Patricia, N., Caputo, B., & Tuytelaars, T. (2017). A deeper look at dataset bias. In Csurka, G. (Ed.), Domain adaptation in computer vision applications (pp. 37–55). New York: Springer International Publishing, https://doi.org/10.1007/978-3-319-58347-1_2.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 2011, pp. 1521–1528, https://doi.org/10.1109/CVPR.2011.5995347.
Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1–13. [CrossRef]
Torralba, A., & Oliva, A. (2003). Statistics of natural image categories. Network (Bristol, England), 14(3), 391–412. [CrossRef] [PubMed]
Tseng, P.-H., Carmi, R., Cameron, I. G. M., Munoz, D. P., & Itti, L. (2009). Quantifying center bias of observers in free viewing of dynamic natural scenes. Journal of Vision, 9(7), 1–16, https://doi.org/10.1167/9.7.4. [CrossRef] [PubMed]
Wang, H., Katsavounidis, I., Huang, Q., Zhou, X., & Kuo, C.-C. J. (2017). Prediction of Satisfied User Ratio for Compressed Video (arXiv:1710.11090). arXiv, https://doi.org/10.48550/arXiv.1710.11090.
Webster, M. A., Mizokami, Y., & Webster, S. M. (2007). Seasonal variations in the color statistics of natural images. Network: Computation in Neural Systems, 18(3), 213–233, https://doi.org/10.1080/09548980701654405. [CrossRef]
Weiner, K. S., & Grill-Spector, K. (2013). Neural representations of faces and limbs neighbor in human high-level visual cortex: Evidence for a new organization principle. Psychological Research, 77(1), 74–97, https://doi.org/10.1007/s00426-011-0392-x. [CrossRef] [PubMed]
Xiao, J., Ehinger, K. A., Hays, J., Torralba, A., & Oliva, A. (2014). SUN database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119, 1–20, https://doi.org/10.1007/s11263-014-0748-y.
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2018). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126(2), 375–389, https://doi.org/10.1007/s11263-017-1013-y.
Yiu, Y.-H., Aboulatta, M., Raiser, T., Ophey, L., Flanagin, V. L., zu Eulenburg, P., & Ahmadi, S.-A. (2019). DeepVOG: Open-source pupil segmentation and gaze estimation in neuroscience using deep learning. Journal of Neuroscience Methods, 324, 108307, https://doi.org/10.1016/j.jneumeth.2019.05.016. [CrossRef] [PubMed]
Zhao, D., Wang, A., & Russakovsky, O. (2021). Understanding and evaluating racial biases in image captioning. 14830–14840. Available from https://openaccess.thecvf.com/content/ICCV2021/html/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper.html.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 1452–1464, https://doi.org/10.1109/TPAMI.2017.2723009. [PubMed]
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2018). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127, 302–321, https://doi.org/10.1007/s11263-018-1140-0. [CrossRef]
Figure 1.
 
(A) Word cloud of recorded locations. (B) Word cloud of verbs from recorded tasks. In each, word size is roughly proportional to the number of recorded hours.
Figure 1.
 
(A) Word cloud of recorded locations. (B) Word cloud of verbs from recorded tasks. In each, word size is roughly proportional to the number of recorded hours.
Figure 2.
 
Lens vignetting for 3.6- and 4.0-mm lenses. Five seconds of video of a uniform, white wall was recorded with each lens. The video frames were converted to grayscale, and the pixel intensities were plotted as a function of horizontal location. Zero represents the center of the image plane.
Figure 2.
 
Lens vignetting for 3.6- and 4.0-mm lenses. Five seconds of video of a uniform, white wall was recorded with each lens. The video frames were converted to grayscale, and the pixel intensities were plotted as a function of horizontal location. Zero represents the center of the image plane.
Figure 3.
 
(Left) Hard plastic mount. (Middle) Elastic strap mount. (Right) Virtual reality headset mount. The right mount also shows the custom ball joint eye extenders.
Figure 3.
 
(Left) Hard plastic mount. (Middle) Elastic strap mount. (Right) Virtual reality headset mount. The right mount also shows the custom ball joint eye extenders.
Figure 4.
 
Sankey diagram indicating reasons for validation failures. Approximately one-third of these failures may be recoverable with additional annotation.
Figure 4.
 
Sankey diagram indicating reasons for validation failures. Approximately one-third of these failures may be recoverable with additional annotation.
Figure 5.
 
(A) Survival plot of sessions, plotting the proportion of sessions on the y axis as a function of frames retained (frames with pupil detection confidence of >0.6). Indoor sessions led to significantly higher pupil confidence than outdoor sessions. (B) Survival plot for active and sedentary sessions. Sedentary activities led to significantly higher pupil confidence.
Figure 5.
 
(A) Survival plot of sessions, plotting the proportion of sessions on the y axis as a function of frames retained (frames with pupil detection confidence of >0.6). Indoor sessions led to significantly higher pupil confidence than outdoor sessions. (B) Survival plot for active and sedentary sessions. Sedentary activities led to significantly higher pupil confidence.
Figure 6.
 
(A and B) Slope of detected pupil positions over time in units of degrees of visual angle (DVA) per minute. (C) Difference in average pupil position between early (first 1,000 frames) and late (last 1,000 frames) in each session, expressed in DVA.
Figure 6.
 
(A and B) Slope of detected pupil positions over time in units of degrees of visual angle (DVA) per minute. (C) Difference in average pupil position between early (first 1,000 frames) and late (last 1,000 frames) in each session, expressed in DVA.
Figure 7.
 
Representative frames within the VEDB. (Top row) Public activities. (Second row) Cooking. (Third row) Playing music. (Bottom row) Outdoor activities.
Figure 7.
 
Representative frames within the VEDB. (Top row) Public activities. (Second row) Cooking. (Third row) Playing music. (Bottom row) Outdoor activities.
Figure 8.
 
(A) Histogram of initial calibration error (degrees of visual angle [DVA]). (B) Histogram of validation error (DVA) across 423 sessions.
Figure 8.
 
(A) Histogram of initial calibration error (degrees of visual angle [DVA]). (B) Histogram of validation error (DVA) across 423 sessions.
Figure 9.
 
(Top) Heat maps showing the location of faces and hands in 22.5 million frames of egocentric video. (Bottom) Gaze-centered heatmaps of faces and hands.
Figure 9.
 
(Top) Heat maps showing the location of faces and hands in 22.5 million frames of egocentric video. (Bottom) Gaze-centered heatmaps of faces and hands.
Table 1.
 
Comparing related works Image not available indicates presence, Image not available indicates partial presence, Image not available indicates absence.
Table 1.
 
Comparing related works Image not available indicates presence, Image not available indicates partial presence, Image not available indicates absence.
Table 2.
 
Mean squared error values interframe intervals of each data stream
Table 2.
 
Mean squared error values interframe intervals of each data stream
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×