The deployment of eye movements to complex spatiotemporal stimuli likely involves a variety of cognitive factors. However, eye movements to movies are surprisingly reliable both within and across observers. We exploited and manipulated that reliability to characterize observers' temporal viewing strategies while they viewed naturalistic movies. Introducing cuts and scrambling the temporal order of the resulting clips systematically changed eye movement reliability. We developed a computational model that exhibited this behavior and provided an excellent fit to the measured eye movement reliability. The model assumed that observers searched for, found, and tracked a point of interest and that this process reset when there was a cut. The model did not require that eye movements depend on temporal context in any other way, and it managed to describe eye movements consistently across different observers and two movie sequences. Thus, we found no evidence for the integration of information over long time scales (greater than a second). The results are consistent with the idea that observers employ a simple tracking strategy even while viewing complex, engaging naturalistic stimuli.

*Children of Men*(Universal Pictures, 2006). The experiment was also conducted with a 3-min scene from the film

*Russian Ark*(the State Hermitage Museum, 2002). Both scenes were shot as single takes without any cuts.

*Children of Men*was subdivided into short clips, each of equal duration. This process was repeated for five different durations (0.5 s, 1 s, 2 s, 5 s, and 30 s), which we refer to as “scramble durations.” We pooled all of these clips together, randomly shuffled their order, and concatenated them, resulting in a 30-min movie composed of interleaved clips of varying lengths, with cuts at the transition between clips (Figure 1A). Randomly interleaving clips of different durations prevented anticipatory eye movements to predictable cut onsets, which might have occurred if observers viewed separate sequences containing clips of the same scramble duration. We refer to the scrambled movie as “interleaved” and the original 6-min movie as “intact.” The same manipulation was applied to the

*Russian Ark*scene to make an interleaved movie of 15 min.

*Children of Men*experiment. Three of these observers, and one additional observer, participated in the

*Russian Ark*experiment. Some observers had seen

*Children of Men*before the experiment, but there were no qualitative differences in the results between those observers and the observers who had not seen the movie. None of the observers had seen

*Russian Ark*before the experiment. For each experiment, each observer viewed the intact movie twice and the interleaved movie once (

*Children of Men*shown in two consecutive parts, ∼15 min at a time;

*Russian Ark*shown in whole). For all data reported for the main experiments, the observer viewed the intact movie first, then the interleaved movie, then the intact movie again. To verify that our conclusions did not rely on this ordering of conditions, we collected data from two additional observers who had not seen

*Children of Men*before the experiment. These observers viewed the interleaved scene of

*Children of Men*twice (on two separate days) before finally viewing the intact scene.

*Children of Men*stimuli were shown at 1037 × 560 resolution (35.5° × 19.5° of viewing angle) and the

*Russian Ark*stimuli were shown at 1037 × 585 resolution (35.5° × 20.4° of viewing angle). All stimuli were shown without sound, so as to avoid potential artifacts from temporally scrambling the soundtrack and to specifically identify eye movements induced by a visual stimulus (rather than a combined audiovisual stimulus). Both scenes evoked highly reliable eye movements despite the lack of sound.

*n*= 15 observers, combining data from both movies). One observer for the

*Children of Men*experiment was excluded from further analysis because the variance of his eye positions (both horizontal and vertical) for the interleaved movie and for one of the intact movie measurements were two standard deviations below that of the rest of the observers. Thus, all subsequent analyses for the main experiments were based on data from 10 observers for the

*Children of Men*experiment and 4 observers for the

*Russian Ark*experiment.

^{2}. The configuration was relatively conservative (hence, insensitive to noise) and ignored most microsaccades. On average, 1.7 ± 0.3 saccades/s (mean ± standard deviation across

*n*= 10 observers) were detected for the

*Children of Men*experiment, and 2.1 ± 0.3 saccades/s were detected for the

*Russian Ark*experiment (

*n*= 4 observers).

*g*and

*h*, the sample cross-covariance is defined as

*k*is the time lag between the two signals,

*N*is the number of samples, and

*μ*

_{ g }and

*μ*

_{ h }are the sample means of the two signals. Both

*g*and

*h*were zero-padded so that the sum was always over

*N*samples. For some analyses, we used each observer's own eye movements for the intact movie, but in other analyses, we used the median (across observers) eye movement time course for the intact movie. The median eye movement time course was computed by aligning all eye movement time courses (2 repeats of each intact movie per observer; 20 in total for

*Children of Men*stimuli, and 8 for

*Russian Ark*) to the same set of sampling time points and taking the median at each time point. The covariance of two time courses is the value of the cross-covariance function at a time lag of

*k*= 0. Covariance is often normalized by the product of the standard deviation of the time courses, yielding the familiar Pearson's correlation coefficient. We observed that eye position variances were not constant across scramble durations (see Eye movement reliability decreased with shorter scramble durations section; Figures 3E and 3F). Trying to account for how variances depend on scramble duration would have made the model intractable. Our principal analysis, therefore, was to compute unnormalized covariance.

*p*value as the fraction of the null distribution that was as large or larger than the covariance observed without randomization.

*λ*(Figure 5C) that the observer would find and lock on to the point of interest after each saccade. Assuming that the saccades were statistically independent after each cut made the model analytically tractable, but violations of this independence assumption would not have qualitatively changed the predictions of the model (see Integration of visual information across fixations during search section). We also defined

*Q*

_{H}and

*Q*

_{V}to be the “maximal” covariances attainable (in horizontal and vertical eye positions, respectively) between an intact eye movement time course and the trajectory of the point of interest. For a particular scramble duration, we expressed the predicted covariance between an unscrambled eye movement time course,

*S*

^{ d }, and the point of interest time course,

*S,*as a function of

*λ, Q*

_{H}, and

*Q*

_{V}(see Equation A12 in 1).

*n*= 10) for the intact movie as an estimate for the point of interest, which served as a prediction for the unscrambled eye movement time courses for the interleaved movie. Covariances were computed between individual observers' unscrambled eye movements,

*S*

^{ d }, and the median time course,

*S*(see Covariance analysis section above). We fit the model to the data by finding parameters that minimized the squared error between the predicted covariance (from Equation A12 in 1) and measured covariance. First, we estimated the parameters for the inter-saccade interval distribution of each observer. Specifically, parameters

*μ*and

*σ*(Equation A1 in 1) were determined by fitting a lognormal distribution (using maximum likelihood,

*lognfit*function in MATLAB) to each observer's inter-saccade intervals for the intact movie. Fitted values of

*μ*and

*σ*did not vary substantially across observers. Second, with

*μ*and

*σ*fixed for each observer, we then estimated the parameters that best accounted for the covariance values for that observer. The covariance was computed between that observer's unscrambled eye movement time course,

*S*

^{ d }, for each scramble duration

*d,*and the point of interest time course,

*S,*separately for horizontal and vertical eye positions. We used the median eye movements for the intact movie as an estimate of

*S*because it is robust to outliers; using the mean eye movement time course produced similar results. The fit was performed simultaneously for all scramble durations and simultaneously for horizontal and vertical eye positions. We accounted for individual variation in maximal covariance in both horizontal and vertical eye positions with the free parameters

*Q*

_{H}and

*Q*

_{V}. A constrained non-linear optimization routine (

*fmincon*function) in MATLAB was used to numerically solve for the values of the three free parameters (

*λ, Q*

_{H}, and

*Q*

_{V}) that minimized the squared error between the predicted and measured covariances (10 data points). In the fit,

*λ*was constrained to be between 0 and 1 and

*Q*

_{H}and

*Q*

_{V}to be greater than 0. Hence, there were a total of 5 free parameters:

*μ*and

*σ*were fit to the inter-saccade interval distributions, and

*λ, Q*

_{H}, and

*Q*

_{V}were fit to the measured covariances.

*λ*(Figure 5C)

*.*For each observer, eye movement epochs of 30 s were randomly sampled with replacement from the eye movement time course for the intact movie and concatenated to obtain an eye movement time course with length equivalent to the length of the original scene (6 min for

*Children of Men*and 3 min for

*Russian Ark*). Corresponding epochs were extracted from the five unscrambled eye movement time courses for the interleaved movie, such that for each 30-s epoch, eye positions for the 30-s scramble duration were derived from a single clip, and eye positions for the remaining four scramble durations were derived from clips that had been unscrambled to match the content of that 30-s clip. After each resampling, covariances were recomputed and the fit was performed to reestimate

*λ*. This procedure was repeated 1000 times, and the 2.5th and 97.5th percentiles of the resulting distribution of

*λ*values provided a 95% confidence interval (equivalent to two standard deviations if the distributions were normally distributed).

*λ, Q*

_{H}, and

*Q*

_{V}) to predict covariances on the remaining half of data and the fitted parameter

*λ*was used to compare model predictions with actual covariances for the other half of the data (testing). The cross-validation was unstable in individual observers due to the occasional occurrence (for some training and testing splits) of large differences in asymptotic covariances

*Q*

_{H}and

*Q*

_{V}between the training and testing data. We therefore performed this analysis only after concatenating data across all observers, which stabilized estimates of maximal reliability. This procedure was performed 1000 times to obtain a 95% confidence interval on the goodness-of-fit measure

*r*

^{2}(coefficient of determination or percentage variance explained by the fit) for the combined data.

*G*(

*t*) = E[(

*S*

^{ d }(

*t*) −

*S*(

*t*))

^{2}], was computed for each observer (Figure 6B), where

*S*

^{ d }(

*t*) was the unscrambled eye movement time course for clip duration

*d*from the interleaved movie,

*S*(

*t*) was the median eye movement time course for the intact movie, and

*t*ranged from 0 to

*d*for each

*S*

^{ d }of a particular duration

*d*(i.e., from the beginning to the end of each clip).

*G*(

*t*) was computed by averaging across all clips from all scramble durations for that observer, aligned to each cut.

*G*(

*t*) computed separately for each scramble duration

*d*yielded similar curves, therefore justifying averaging across durations, resulting in more averaging for smaller values of

*t*.

*v*

_{ Sd }(

*t*), was estimated as a function of time after a cut (Figure 6A). The sample mean eye position time course, E[

*S*

^{ d }(

*t*)], averaged across clips, was ∼0.5 for both horizontal and vertical dimensions (center of the screen) at any time

*t*. The variance

*v*

_{ Sd }(

*t*), therefore, reflected the fact that eye positions tended to cluster near the center of the screen shortly after a cut and then gradually expand outward over time (Figure 6A).

*v*

_{ Sd }(

*t*) was computed separately for each observer across all clips from all scramble durations for that observer; all observers showed the same tendency.

*G*

_{0}(

*t*), was computed as the sum of the variance (across clips) of the unscrambled eye movements,

*v*

_{ Sd }(

*t*), and the variance (over time) of the median eye movement time course (see 21 section in 1). Intuitively,

*G*

_{0}(

*t*) reflected how eye position error would have evolved over time after a cut if the unscrambled eye movements never locked on to the point of interest.

*G*

_{0}(

*t*) was not constant over time, as would be expected if the variance of

*S*

^{ d }(

*t*) was stationary, confirming that it would have been inappropriate to use

*G*(

*t*) by itself to infer the temporal dependence of

*S*

^{ d }(

*t*) on the point of interest.

*G*(

*t*)/

*G*

_{0}(

*t*) (Figure 6C), which could be interpreted as an estimate for the probability (across clips) that the unscrambled eye position was locked on to the point of interest (the median eye position) as a function of time after a cut (see 1 for derivation).

*Children of Men*. Each observer viewed the scene twice. The movie stimulus evoked reliable eye movements across repeated presentations within an individual observer and across observers (Figure 2A). We quantified the degree of reliability using cross-covariance (see Covariance analysis section), separately for horizontal and vertical eye movements. For each observer, cross-covariance was computed between eye movement time courses for two presentations of the intact movie (for that observer) and between eye movements for one presentation of the intact movie (for that observer) and the median eye movement time course across the other 9 observers (Figure 2B). In both cases, cross-covariance was maximal at a time lag of zero, suggesting that correlated changes in eye position were time locked to stimulus events. The width of the peak indicated the temporal precision of the time locking. The magnitude of the peak at a time lag of zero (i.e., the covariance) provided a measure of the reliability of eye movements for that observer, given instrument noise and the observer's internal cognitive and motor variability across repeated measurements. The cross-covariance for time lags far from zero provided a qualitative baseline for spurious covariance due to chance. In general, covariance was high (well above the baseline for all observers, and highly statistically significant:

*p*< 0.001 for all observers, phase randomization test; see Covariance analysis section).

*p*< 0.025 for the 0.5-s scramble duration for all observers in horizontal eye position and for 8 out of 10 observers in vertical eye position;

*p*< 0.025 for all other scramble durations for all observers in both horizontal and vertical; phase randomization test, see Covariance analysis section). Covariance increased monotonically with scramble duration, for each of the 10 observers (Figures 3A and 3B). Covariances were computed by comparing a single observer's unscrambled eye movements with the median intact eye movements across observers. Covariances computed by comparing eye movements within an observer were similar. The covariance between the eye movements for two presentations of the intact movie indicated maximal reliability attainable for an observer, in the absence of scrambling. This covariance was computed, for each observer, between each intact eye movement time course from the individual observer (two per observer) and the median time course across the other observers. Covariances were then averaged between the two estimates per observer and across all observers (Figures 3A and 3B, dashed lines). The fact that all observers were similarly affected by the scrambling manipulation suggests a behavioral strategy or computation that was common across observers.

*p*> 0.1 for all 6 comparisons; randomization test, whether or not corrected for multiple comparisons). This suggests that eye movement reliability did not depend significantly on repeated viewings of the same scene.

*p*> 0.05 for horizontal and vertical covariance values in all scramble durations; randomization test, corrected for multiple comparisons). Furthermore, for both of the additional observers, covariance values were qualitatively similar across the two repeated presentations of the interleaved scene, validating our earlier observation that reliability measurements did not depend substantially on the order of presentation or the experience of prior presentations.

*λ*that the fixation following each saccade would lock on to the point of interest. Thus, the probability that the observer locks on at a given time after a cut is a weighted sum: The first term is the probability that the first saccade occurs at that time and finds the point of interest, the second is the probability that the second saccade occurs at that time and finds the point of interest (given that the previous saccade did not), and so on. The mean of this probability distribution corresponds to the average time that it takes for an observer to find the point of interest following a cut. For small values of

*λ,*it becomes increasingly likely that the point of interest will be found only after a long period of time (Figure 4B). The probability of having locked on to the point of interest within any particular time after a cut likewise depends on

*λ*(Figure 4C). The function rises more slowly for a smaller

*λ,*because it takes more time to accumulate probability of having locked on.

*λ,*the probability of finding the point of interest at each fixation following a saccade (Figure 4D). We assumed that, while the observer locked on to some true point of interest, covariance with that point of interest was maximal. However, until he or she locked on, covariance was 0. Under this assumption, covariance over the course of the movie was proportional to the relative amount of time during which the observer was locked on (see Equation A9 in 1). For example, when scramble durations were long, an observer spent most of the time locked on, and covariance was nearly maximal. However, when scramble durations were short (and there were many cuts), the observer spent less time locked on and more time searching for points of interest, so covariance was smaller. By such reasoning, we derived a closed-form expression for the covariance expected at different scramble durations (see Equation A12 in 1). The relationship depends only on the frequency of saccades, the maximal obtainable covariance, and the free parameter

*λ,*which describes the probability that an observer found the point of interest at each fixation following a saccade. Values for these parameters were found by numerically minimizing the squared error between the observed covariances and the predicted covariances. Parameterizing saccade times using a lognormal distribution yielded a closed-form solution, but the qualitative predictions of the model did not depend on the specific form of the saccade time distribution.

*λ*and maximal covariances (horizontal and vertical) that best predicted the measured covariances across all scramble durations. The median (across observers) eye movement time course for the intact movie provided an estimate for the point of interest trajectory, and covariance was computed, for each observer, between each of the unscrambled eye movement time courses and this point of interest. The model fit the data well; when fit to the covariances combined across observers (see Modeling fitting section),

*r*

^{2}was 0.88 (cross-validated 2.5th–97.5th percentiles = 0.61–0.98). The fitted value of

*λ*was 0.79, corresponding to an expected time of 0.73 s (bootstrapped 2.5th–97.5th percentiles = 0.63–0.83 s) within which observers were able to find and lock on to the point of interest. The model was also separately fit to the data for each individual observer, and again accounted for most of the variance in the data from each observer (Figures 5A and 5B). Fitted values of

*λ*for individual observers were between 0.5 and 1 (mean

*λ*= 0.82 across 10 observers, Figure 5C), corresponding to an expected time of 0.75 ± 0.16 s (mean ± standard deviation,

*n*= 10) for locking on to the point of interest. Although there may have been systematic individual differences in

*λ,*our data did not have sufficient sensitivity or statistical power to explore it; values of

*λ*varied somewhat across observers, but the confidence intervals for the most part overlapped. We fit the model to data from the two additional observers who viewed the scenes in a different order (see Stimuli and experimental procedure section). The model provided a good fit for those observers as well, with fitted values of

*λ*= 0.82, 0.73 (additional observer 1, two separate presentations of the interleaved scene) and 0.56, 0.99 (observer 2), comparable to those obtained for the original 10 observers (

*p*= 0.35, randomization test).

*λ*was sensitive to the parameter estimates of the saccade latencies used during the fit, the expected time to find the point of interest (computed from a combination of the fitted value of

*λ*and inter-saccade interval parameters) did not depend on the specific parameterization of saccade times. Compared to the overall distribution of saccade latencies, inter-saccade intervals tended to be shorter immediately after a cut. Therefore, our use of saccade parameters derived from eye positions from the overall intact scene was an oversimplification. Using shorter inter-saccade intervals for the fit yielded smaller values of

*λ*than reported above. However, we verified that the overall expected time to find the point of interest remained the same. This means that when saccade latencies were shorter, the probability of finding the point of interest after each saccade was consequently lower, resulting in a greater number of saccades to reach the point of interest.

*λ*and saccade latencies) using a complementary and independent analysis (Figure 6). Deviations between unscrambled eye movements and the estimated point of interest trajectory (“eye position error”) were computed as a function of time after each cut (Figure 6B). Eye position error started high right after a cut, decreased sharply, but showed a gradual increase over time. However, the eye position error at any time point after a cut depended not only on the difference between the unscrambled eye position and the estimated point of interest but also on the variance of unscrambled eye movements (Figure 6A), which showed a similar decrease and then increase over time. To isolate the component of the eye position error that was independent of eye movement variance, we first computed the maximal eye position error, which reflected how eye position error would have evolved if the unscrambled eye movements never locked on to the point of interest (error was always maximal). This maximal eye position error was computed from the variance (across clips) of the unscrambled eye movements (Figure 6A) and the variance (over time) of the point of interest (see 21 section in 1). One minus the ratio between the measured and maximal eye position errors (“fractional explained variance,” Figure 6C) indicated how well the trajectory of the point of interest predicted the measured eye movements, independent of the eye movement variance (see 21 section in 1). The fractional explained variance started to increase about 0.2 s after cuts and flattened out after 0.5–0.8 s. This shows that the point of interest predicted the unscrambled eye movements poorly right after a cut but did better given enough time, consistent with the model's prediction that eye movements start out uncorrelated with the point of interest and then converge. Time courses of fractional explained variance computed separately for each scramble duration were nearly identical, consistent with the model's assumption that convergence on to the point of interest, on average, depended only on the amount of time the observer had to view the clip after a cut. The results are also consistent with the idea that the search-and-track process reset following each cut.

*λ*and the best-fitting lognormal parameters of saccade latencies (

*μ*and

*σ*; see Equation A1 in 1) showed a similar fractional explained variance (Figure 6C, inset), which also started to increase at 0.2 s after cuts and achieved asymptote around 1 s. The simulated fractional variance showed a more gradual rise, which might be due to our imperfect assumption that each saccade was independent and had a fixed probability of finding the point of interest (see Integration of visual information across fixations during search section). Despite this difference, the probability of finding the point of interest averaged over the initial few fixations was similar for both the measurement and the simulation, consistent with the predictions of our model. This analysis also revealed the fine-grained temporal dynamics of locking on, an aspect of the results (and the model) not fully captured by the covariance analysis.

*Russian Ark*, 2002). Eye movements for this movie showed a similar relationship between scramble duration and covariance (Figure 7), confirming that the dependence of eye movement reliability on scramble duration was not specific to the choice of film. The model fit the eye movements well (

*r*

^{2}= 0.91, cross-validated 2.5th–97.5th percentiles = 0.66–0.98), yielding values of

*λ*qualitatively similar to those estimated with the other movie (compare Figures 5C and 7C) and an expected time of 0.85 ± 0.40 s (mean ± standard deviation,

*n*= 4) within which observers were able to find the point of interest. It remains to be tested whether the model would yield substantially different results for other classes of movies. A starting assumption of the model is that the unperturbed eye movements are reliable (high covariance between eye movements for the intact movie). Since reliability depends on the content of a movie (Dorr, Martinetz, Gegenfurtner, & Barth, 2010; Hasson, Landesman et al., 2008; Shepherd et al., 2010), e.g., differences in the degree to which the stimulus engages an observer, it is possible that we would observe different results for a substantially different choice of film (e.g., a static scene without action or movement or a scene with many cuts). Nonetheless, the fact that our model provided a good fit for two very different movie stimuli is consistent with its content-free nature and suggests some degree of generalizability.

*λ*in Equation A3, see 1 and Figure 5C). We could not, however, exclude the possibility that this probability increased across fixations during the period before locking on (i.e., that information was accumulated across fixations about the likely location of the point of interest). Such a framework in which the observer uses prior information to search for relevant points of interest bears some resemblance to visual search (Treisman & Gelade, 1980). Human behavior during search has been modeled by assuming that the observer chooses where to look to maximize information about the location of the target (Najemnik & Geisler, 2005). Accordingly, visual information is integrated across fixations and updated iteratively. There is also empirical evidence for the accrual of visual information across the first two fixations during search (Caspi, Beutter, & Eckstein, 2004), within the time frame that observers typically find the point of interest for our movie stimuli. Note that visual search models indicate that human performance does not significantly depend on information integrated beyond a relatively short time scale of two fixations (Najemnik & Geisler, 2005). If the observer indeed integrates information across fixations to optimally locate the point of interest, the probability of locking on should increase with every fixation. In that case, the fitted values of

*λ*(Figures 5C and 7C) can be thought of as an average probability of finding the point of interest over those fixations. However, this would not change the model's prediction of the average time required to find the point of interest after a cut. As such, this elaboration would not affect the model's prediction for how covariance with the point of interest depends on scramble duration. Indeed, the fractional explained variance analysis revealed that the probability (across clips) of locking on rose more sharply than predicted by the model (Figure 6C), possibly suggesting integration of visual information within the first couple of saccades (e.g., the first saccade had a lower probably of locking on than predicted, and the second saccade had a higher probability).

*t*. The intervals at which an observer makes saccades are well described by a lognormal distribution, with parameters that can be estimated directly from our data (Figure 4A). The lognormal probability density function with parameters

*μ*and

*σ*is defined as

*j*th saccade is the sum of

*j*random variables, each with a probability density distribution

*f*(

*t*∣

*μ, σ*). We define the probability density distribution for the time of the

*j*th saccade as

*z*

_{ j }(

*t*). The pdf

*z*

_{ j }(

*t*) is the convolution of lognormal pdf

*f*(

*t*) with itself

*j*− 1 times. For

*j*> 1, this expression has no closed form, so we used an approximation. The convolution of

*j*− 1 identical lognormal functions

*f*with parameters

*μ*and

*σ*is commonly approximated by another lognormal distribution,

*f*(

*t*∣

*μ*

_{ j },

*σ*

_{ j }), where

*f*(

*t*∣

*μ*

_{ j },

*σ*

_{ j }) were matched to

*j*times those of

*f*(

*t*∣

*μ*,

*σ*) (Fenton–Wilkinson method; Fenton, 1960). We verified in simulation that this approximation was accurate to within 1% error for the range of

*μ, σ,*and

*j*used in our calculations.

Notation | Type | Definition |
---|---|---|

L | Constant | Duration of the original scene |

d | Constant | Duration of each clip used to evenly divide up the scene |

S | Time course | Point of interest time course (median eye movement time course for intact movie) |

S ^{ d } | Time course | Unscrambled eye movement time course for scramble duration d |

f(μ, σ) | Function | Lognormal probability density function describing inter-saccade intervals; depends on μ and σ |

z _{ j } | Function | Probability density function describing the time of the jth saccade |

p _{ T } | Function | Probability density function describing the likelihood of finding a point of interest as a function of time after a cut onset; depends on f and λ |

P _{ T } | Function | Cumulative density function of p _{ T } |

C | Function | Covariance between two eye movement time courses |

T | Variable | A continuous random variable with pdf p _{ T } |

t | Variable | Time after a cut (a value for random variable T with pdf p _{ T }; Pr(t < T < t + dt) = p _{ T } (t)dt for an infinitely small interval dt) |

τ | Variable | A random variable describing the amount of time an observer is not locked onto the point of interest within a cut; depends on T and d |

μ, σ | Parameters | Parameters governing the shape of the lognormal pdf f |

μ _{ j }, σ _{ j } | Parameters | Parameters governing the shape of the lognormal pdf for the time of the jth saccade (an approximation for z _{ j }(t)) |

λ | Parameter | Probability of finding and locking onto a point of interest on each fixation after a saccade |

Q | Parameter | Maximal covariance between an intact eye movement time course and the trajectory of the point of interest, as limited by noise |

N | Empirical measure | Total number of samples in the time courses (corresponding to a total time duration of L) |

N _{0} | Empirical measure | Number of samples (out of N) that the observer is not locked on to the point of interest |

S _{0} | Time course | Random point of interest with the same distribution as entries in S, i.e., with mean and variance μ _{ S } and v _{ S } |

μ _{ Sd } | Function | Mean of unscrambled eye movements, i.e., mean of S ^{ d } (stationary with respect to time after a cut) |

μ _{ S } | Function | Mean of the point of interest, i.e., mean of S (stationary with respect to time after a cut) |

v _{ Sd }(t) | Function | Variance of unscrambled eye movements after a cut, i.e., variance of S ^{ d }(t) |

V _{ S } | Function | Variance of the point of interest, i.e., variance of S (stationary with respect to time after a cut) |

G(t) | Function | Eye position error between the unscrambled time courses and the point of interest after a cut; G(t) = E[(S ^{ d }(t) − S(t))^{2}] |

G _{0}(t) | Function | Maximal eye position error or the eye position error expected between the unscrambled time courses and an uncorrelated, random point of interest after a cut; G _{0}(t) = E[(S ^{ d }(t) − S _{0})^{2}] = v _{ Sd }(t) + v _{ S } |

*λ*of finding and locking onto the point of interest. Thus, the probability of finding the point of interest precisely on the

*j*th fixation is

*λ*(1 −

*λ*)

^{ j−1}. This is the probability of finding the point of interest on the

*j*th fixation times the probability of not finding it on all previous fixations.

*p*

_{ T }(

*t*) for a continuous random variable

*T,*which describes the probability of finding a point of interest over time after a cut:

*f*(

*t*∣

*μ*

_{ j },

*σ*

_{ j }) is the lognormal approximation for

*z*

_{ j }(

*t*), the probability density distribution for the time of the

*j*th saccade, and parameters

*μ*

_{ j }and

*σ*

_{ j }are related to

*μ*and

*σ*as in Equations A2. Note that for

*j*= 1,

*z*

_{1}(

*t*) =

*f*(

*t*∣

*μ, σ*). Consistent with standard probability notation, lowercase

*t*denotes a specific value for the random variable

*T*:

*p*

_{ T }(

*t*) in Equation A3 is a sum of a series of lognormal distributions. Each lognormal distribution describes the time of an individual saccade, and each distribution is weighted by the probability of finding the point of interest following that saccade. When

*λ*= 1, the observer always fixates the point of interest after the first saccade, and

*p*

_{ T }(

*t*) is equal to a lognormal distribution describing the time of that saccade (all terms in the summation where

*j*> 1 equal 0). For smaller

*λ*, more saccades are required to find the point of interest, and the shape of

*p*

_{ T }(

*t*) changes to have larger probabilities associated with later saccades (Figure 4B).

*p*

_{ T }(

*t*) is

*P*

_{ T }(

*t*) increases to 1 as

*t*increases (

*P*

_{ T }(

*t*)→1 as

*t*→∞), which means that the probability of finding the point of interest converges to 1 as the amount of time allotted to find it increases (Figure 4C). For larger values of

*λ,*the slope is steeper, i.e.,

*P*

_{ T }(

*t*) converges to 1 more quickly. When there is only a finite amount of time in a clip (i.e.,

*t*is bounded), and especially when

*λ*is small,

*P*

_{ T }(

*t*) may still be far from 1 even when

*t*achieves its maximum value. That is, there is a non-trivial probability that the observer will not have found the point of interest before the end of the clip.

*L*is divided evenly into clips, each of length

*d*, whose order may be randomly scrambled. Over the entire movie, there are

*L*/

*d*clips in total. For each clip, the above probability distributions are used to estimate the average value of a random variable

*τ*, which describes the time during which the observer is not locked on to the point of interest for that clip. For any given clip, the maximal value that

*τ*can take is

*d*. When

*τ*is less than

*d*, the value of

*τ*depends on the value of the random variable

*T*with probability density function given by Equation A3. Thus, a natural choice is to define the variable

*τ*piecewise:

*τ*is given by

*T*is the random variable with density

*p*

_{ T }(

*t*) and cumulative distribution

*P*

_{ T }(

*t*) as given above. The two expectations on the right-hand side are

*S*(

*t*) is the “correct” eye position (as a function of time

*t*) corresponding to the point of interest in the intact movie, and

*S*

^{ d }(

*t*) is the unscrambled eye movement time course for scramble duration

*d*. The expected duration of

*S*

^{ d }(

*t*) that is not correlated with

*S*(

*t*) is therefore E(

*τ*) summed over all

*L*/

*d*cuts: (

*L*/

*d*)E(

*τ*).

*Q*to be the covariance between an intact eye movement time course made by the observer and the trajectory of the point of interest. If the observer's eye positions matched the location of the point of interest perfectly when locked on (i.e., there was no noise or variability),

*Q*would simply be the variance of

*S*(

*t*). The actual value of

*Q*depends on both the measurement noise and the observer's cognitive and motor variability (including exploratory eye movements to look for a new point of interest; see Model section). We interpret

*Q*as the maximal covariance attainable for that observer.

*Q,*times 1 minus the fraction of time during which the eye movements are uncorrelated (see 20 section below). By this assumption, the covariance

*C*between

*S*

^{ d }(

*t*) and

*S*(

*t*) is given by

*E*(

*τ*), we obtain

*T*≥

*d*) = 1 − Pr(

*T*<

*d*), and Pr(

*T*<

*d*) is the cumulative distribution

*P*

_{ T }(

*t*) evaluated at

*d,*substituting in Equation A8 and canceling out

*L*yields

*d*in the second term and the two 1s and rearranging terms) gives

*p*

_{ T }(

*t*) is the pdf defined in Equation A3 and its cumulative distribution

*P*

_{ T }(

*t*) may be computed through numerical integration. Note that this derivation is independent of the specific parameterization of saccade times. Any distributional form for saccade times can be plugged in to the equations for

*p*

_{ T }(

*t*) and

*P*

_{ T }(

*t*) to obtain an expression for predicted covariance. We used the lognormal distribution, which we observed to be a good description of the inter-saccade interval distribution (Figure 4A).

*Q,*times 1 minus the fraction of time during which the eye movements were uncorrelated (Equation A9). Here, we provide mathematical intuition for why this relationship holds and show that it is a reasonable assumption for our data.

*S*

^{ d }and

*S*is computed as

*μ*

_{ Sd }and

*μ*

_{ S }are the sample means of

*S*

^{ d }and

*S,*index

*k*indicates individual measurement samples, and

*N*is the total number of samples in the time courses (corresponding to a total time duration of

*L*).

*N*

_{0}measurement samples (uncorrelated) and is locked on for

*N*−

*N*

_{0}samples (with maximal covariance). Additionally, assume that the individual samples of

*S*

^{ d }and

*S*in Equation A13 are independent and that the sample mean

*μ*

_{ Sd }does not change as a function of

*N*

_{0}. It follows that for any value of

*N*

_{0}, the product (

*S*

^{ d }(

*k*) −

*μ*

_{ Sd })(

*S*(

*k*) −

*μ*

_{ S }) summed over

*N*

_{0}out of the

*N*terms will be approximately 0 (because

*S*

^{ d }and

*S*are uncorrelated for those terms), and the remaining

*N*−

*N*

_{0}samples will constitute 1 −

*N*

_{0}/

*N*of the maximal covariance

*Q*. Thus, for a finite sample,

*N*

_{0}corresponds to the time (

*L*/

*d*)E(

*τ*) during which the observer is not locked on, and the total number of samples

*N*corresponds to the total time

*L*. Thus, if Equation A14 holds for our data, it validates the assumption of the model as expressed in Equation A9.

*S*

^{ d }needs to be independent of

*N*

_{0}(number of samples for which

*S*

^{ d }and

*S*are uncorrelated). In fact, the variance in eye position was smaller for shorter scramble durations (Figures 3E and 3F). However, the sample mean of

*S*

^{ d }was approximately invariant (near the center of the screen) for unscrambled eye movement time course, as assumed in the derivation of Equation A14.

*G*

_{0}(

*t*). This value is expressed as

*G*

_{0}(

*t*) = E[(

*S*

^{ d }(

*t*) −

*S*

_{0}(

*t*))

^{2}], where

*t*is time after a cut,

*S*

^{ d }(

*t*) is unscrambled eye movements for scramble duration

*d*, and

*S*

_{0}(

*t*) is a random point of interest from the same distribution as the actual point of interest

*S*(

*t*) but not correlated with

*S*

^{ d }(

*t*). We show here that

*G*

_{0}(

*t*) is equal to the summed variances of the two underlying variables,

*S*

^{ d }(

*t*) and

*S*

_{0}(

*t*).

*S*

^{ d }(

*t*) and

*S*

_{0}(

*t*) are normally distributed at time

*t*;

*S*

^{ d }(

*t*) has variance

*v*

_{ Sd }(

*t*) and mean

*μ*

_{ Sd }, and

*S*

_{0}(

*t*) has variance

*v*

_{ S }and mean

*μ*

_{ S }. Furthermore,

*μ*

_{ Sd }=

*μ*

_{ S }for all time points

*t*. Note that treating

*v*

_{ S }and

*μ*

_{ S }as stationary with respect to

*t*is reasonable because

*S*

_{0}(

*t*) has the same mean and variance as the point of interest

*S*(

*t*); we would not expect the statistics of

*S*to change as a function of time

*t*after a cut from the manipulations of the interleaved movie.

*S*

^{ d }′(

*t*) =

*S*

^{ d }(

*t*) −

*μ*

_{ Sd }and

*S*

_{0}′(

*t*) =

*S*

_{0}(

*t*) −

*μ*

_{ S }, and we can substitute the variables in the expression of

*G*

_{0}(

*t*) with their mean-subtracted versions:

*v*

_{ Sd }(

*t*) and

*v*

_{ S }, respectively. The cross term 2E[

*S*

^{ d }′(

*t*)

*S*

_{0}′(

*t*)] ≈ 0 because

*S*

^{ d }′(

*t*) and

*S*

_{0}′(

*t*) are uncorrelated, zero mean, and normally distributed. Therefore,

*G*

_{0}(

*t*) depends on the trajectory of the unscrambled eye position variance

*v*

_{ Sd }(

*t*); if

*v*

_{ Sd }(

*t*) was constant irrespective of time after a cut, then the maximal eye position error

*G*

_{0}(

*t*) would also be constant.

*S*

^{ d }(

*t*) was computed by aligning the unscrambled eye movements for a particular scramble duration

*d*to each cut in that scramble duration.

*S*

^{ d }(

*t*) computed separately for each

*d*yielded similar curves as a function of

*t*. Therefore, at each time point

*t, v*

_{ Sd }(

*t*) may be estimated using the variance of

*S*

^{ d }(

*t*) across all clips (

*n*= 1344 clips from all 5 scramble durations for

*t*= 0–0.5 s;

*n*= 624 clips from the 4 longest scramble durations for

*t*= 0.5–1 s; and so on). We used the median eye movements across observers for the intact movie as an estimate for the point of interest

*S*(

*t*). The variance of the median time course

*S*(

*t*) across time provided an estimate for the variance

*v*

_{ S }, which was equivalent to computing the variance across clips for each

*t*under the assumption that variance was stationary with respect to time. We verified in our data that our assumptions for the derivation were reasonable, i.e., that

*S*

^{ d }(

*t*) (across clips) and

*S*(

*t*) (across time) were well approximated as Gaussian and that

*μ*

_{ Sd }≈

*μ*

_{ S }for all

*t*(i.e., the mean eye position across all clips was the same for the unscrambled and median intact time courses and near the center of the screen). Furthermore, simulations of

*G*

_{0}(

*t*), computed with randomly permuted values of

*S*(

*t*) as

*S*

_{0}(

*t*), yielded values close to

*v*

_{ Sd }(

*t*) +

*v*

_{ S }, as predicted by the derivation.

*G*(

*t*)/

*G*

_{0}(

*t*), where

*G*(

*t*) was the measured position error (see Eye position error, variance in eye position, and fractional explained variance section) and

*G*

_{0}(

*t*) was the maximal position error (computed using Equation A16). Here, we show that this quantity can be thought of as an approximate empirical estimate for the fraction of the time the observer was not locked on to a random point of interest (or locked on to the actual point of interest) as a function of time. Suppose at each time point

*t,*the observer has a probability of

*P*

_{ E }(

*t*) locking on to the point of interest. When the observer is locked on, E[(

*S*

^{ d }(

*t*) −

*S*(

*t*))

^{2}] ≈ 0. When the observer is not locked on (1 −

*P*

_{ E }(

*t*) of the time),

*S*(

*t*) will be random with respect to

*S*

^{ d }(

*t*), so E[(

*S*

^{ d }(

*t*) −

*S*(

*t*))

^{2}] ≈ E[(

*S*

^{ d }(

*t*) −

*S*

_{0}(

*t*))

^{2}]. Therefore, at any time point

*t*,

*G*

_{0}(

*t*) = E[(

*S*

^{ d }(

*t*) −

*S*

_{0}(

*t*))

^{2}]. Therefore, 1 −

*G*(t)/

*G*

_{0}(

*t*) ≈ 1 − (1 −

*P*

_{ E }(

*t*)) =

*P*

_{ E }(

*t*). Consequently, the quantity 1 −

*G*(

*t*)/

*G*

_{0}(

*t*) approximates

*P*

_{ E }(

*t*) and corresponds to the probability (across clips) at time

*t*after a cut that the unscrambled time course was locked on to point of interest (median intact eye position across observers).

*G*(

*t*), for

*t*up to 5 s, by generating artificial epochs of an unscrambled eye movement time course,

*S*

^{ d }(

*t*), and comparing these epochs of

*S*

^{ d }(

*t*) to the corresponding portions of the median eye movement time course,

*S*(

*t*). All samples of simulated

*S*

^{ d }(

*t*) were drawn from a measured eye movement time course for the intact movie (out of two repeats per each observer). To simulate the fact that each epoch of

*S*

^{ d }(

*t*) contained samples that were uncorrelated with

*S*(

*t*) right after a cut, we determined a random time Δ in each epoch after which the observer was presumed to lock on. Specifically, for Δ <

*t*≤ 5, samples of

*S*(

*t*) corresponded to the same segment of the movie as those in the median time course

*S*(

*t*), such that the covariance between

*S*

^{ d }(

*t*) and

*S*(

*t*) was maximal (as determined by that observer). For

*t*≤ Δ, samples of

*S*

^{ d }(

*t*) were set to those from a random portion of the intact time course, such that

*S*

^{ d }(

*t*) still contained actual positions on the screen but unrelated to

*S*(

*t*). The value of Δ was determined by the model. Specifically, it was drawn according to the distribution of a random variable that described the time during which an observer was not locked on to the point of interest for a clip (

*τ*in Equation A6). This random variable was determined using the fit parameter

*λ*and lognormal parameters of saccade latencies for that observer (Equation A3), subject to the constraint Δ ≤ 5 s (

*d*= 5 in Equation A6). The maximal position error

*G*

_{0}(

*t*) was computed using Equation A16, in which

*v*

_{ Sd }(

*t*) was the variance of the simulated

*S*

^{ d }(

*t*) (or the variance of the intact time courses), which was constant over time. We then computed the fractional explained variance 1 −

*G*(

*t*)/

*G*

_{0}(

*t*). The simulation was performed independently 1000 times for each observer, and the value 1 −

*G*(

*t*)/

*G*

_{0}(

*t*), averaged across simulations, yielded the model's prediction for the fractional explained variance (Figure 6C, inset).