Eye movements are an important data source in vision science. However, the vast majority of eye movement studies ignore sequential information in the data and utilize only first-order statistics. Here, we present a novel application of a temporal-difference learning algorithm to construct a scanpath successor representation (SR; P. Dayan, 1993) that captures statistical regularities in temporally extended eye movement sequences. We demonstrate the effectiveness of the scanpath SR on eye movement data from participants solving items from Raven's Advanced Progressive Matrices Test. Analysis of the SRs revealed individual differences in scanning patterns captured by two principal components that predicted individual Raven scores much better than existing methods. These scanpath SR components were highly interpretable and provided new insight into the role of strategic processing on the Raven test. The success of the scanpath SR in terms of prediction and interpretability suggests that this method could prove useful in a much broader context.

*scanpaths*, Stark & Ellis, 1981) often contain valuable information about underlying cognitive processes, they are difficult to quantify and interpret, and this has traditionally prevented eye-tracking researchers from including them in their analyses.

^{5}) scanpaths of length 6. The challenge is to tame this combinatorial explosion without losing the sequential information in the process. The existing methods for doing this can be classified into two broad classes. One approach represents scanpaths as strings of letters and uses string-editing distance—the number of additions and subtractions necessary to turn one sequence of letters into another—as a dissimilarity metric (e.g., Brandt & Stark, 1997; Myers & Schoelles, 2005). Yet, string-editing measures have a number of limitations, with the most critical being that they are best suited for comparing short sequences of similar length, making it difficult to infer cognitive states or strategies in temporally extended tasks or to compare across participants or trials that differ in duration.

*successor representation*(SR; Dayan, 1993; White, 1995) of an eye movement sequence that keeps the simplicity of the fixed-size transition matrix and extends the event horizon. The key idea is that upon observing a transition from one AOI to another, instead of simply updating the transition probability from the first to the second AOI, we associate the first AOI with the second AOI

*and*all expected subsequent AOIs based on prior visits to the second AOI. This is equivalent to learning to predict future scanpaths based on past scanpaths. After traversing the entire fixation sequence for a trial, the resulting SR can be conceptualized as having extracted the statistical regularities in temporally extended scanpaths, collapsing the information into a fixed-size matrix. Specifically, an SR matrix contains, for each AOI, the temporally discounted number of expected future fixations to all AOIs (Dayan, 1993). Given their uniform size, the SR matrices from different observers and/or trials can be analyzed using standard statistical methods to identify significant regularities for various comparisons of interest. The new method is very well suited for exploratory data analysis.

*constructive matching*, the participant tries to formulate the missing element based exclusively on matrix information, and then looks for that element in the response area. In

*response elimination*, each alternative is inspected in turn and evaluated whether it fits into the empty matrix slot. The former strategy tends to occur in high-scoring individuals and/or easier problems, the latter in low-scoring individuals and/or difficult problems (Bethell-Fox, Lohman, & Snow, 1984; Vigneau, Caissie, & Bors, 2006). We will show that the scanpath SR identifies the degree to which participants apply each of these two strategies. This can then be used to predict the individual scores on the Raven task.

^{1}(

*F*(2,32) = 0.15,

*p*= 0.86). Therefore, we analyzed the test data of all 35 participants together. The paper-and-pencil data and the verbal protocols are beyond the scope of this article.

*R*) covered the entire response area so that the spatial layout of the answers could not be used to predict the participants' scores.

**. The matrix is initialized with zeros and then updated for each transition in the sequence. Consider a transition from state**

*M**i*to state

*j*. The

*i*th column of the matrix—the column corresponding to the “sender” AOI—is updated according to

**is the identity matrix, each subscript picks a column in a matrix,**

*I**α*is a learning rate parameter (0 <

*α*< 1), and

*γ*is a temporal discount factor (0 <

*γ*< 1). In words, upon observing a transition

*i*→

*j*, the set of expected successors (

*M*_{ i }) for the sender

*i*is updated to include the receiver

*j*(represented as a unit column vector

*I*_{ j }) and the predicted set of successors (

*M*_{ j }) for the new location

*j*, discounted by

*γ*. The latter term is the key to extending the event horizon to encompass both immediate and long-range transitions—it includes the discounted future states in the prediction from the current state. For example, suppose a participant scans the top row of a Raven problem systematically from left to right: 1→2→3→1→2…. Then, the successors of location 1 will include

*both*location 2 and, weighted by

*γ*, location 3. By contrast, a first-order transition matrix would include only the association between 1 and 2. After traversing the whole scanpath, the estimated SR matrix approximates the ideal SR matrix, which contains the temporally discounted number of expected future fixations on all AOIs (rows), given the participant just fixated on any individual AOI (column). Note that the entries in the SR matrix are not probabilities. They are (discounted, expected)

*numbers of visits*, and thus, the sum across each column of the ideal SR matrix equals

*α*and

*γ*, the algorithm produced one 10 × 10 SR matrix per participant per trial. Averaging across the 28 trials for each participant, we were left with 35 individual matrices. Each matrix summarized the eye fixation patterns of the corresponding participant. To reduce the dimensionality of the space, we performed a principal component analysis (PCA; Everitt & Dunn, 2001) of the successor representations. Each SR matrix was reshaped to a vector of 100 features. The whole data set occupied a matrix of size 35 × 100. Following standard PCA practice, we rescaled each feature (column) so that it had zero mean and unit variance across the 35 participants. The first 20 principal components retained over 90% of the variance in the SR data. Conceptually, these components represent dimensions of individual differences in fixation patterns. They are expressed mathematically as orthogonal basis vectors in the 100-dimensional SR space. Each participant was characterized by 20 projections onto this rotated basis.

*M*= 21.9,

*SD*= 3.7).

^{2}We performed a hierarchical linear regression to assess how much of this variance can be explained on the basis of the SR principal component projections. Two of the projections correlated very strongly with the scores, whereas the third best predictor was insignificant. Therefore, we used two predictors in all regressions. We implemented a two-tier algorithm to maximize the fit to the Raven scores. In the inner loop, it calculated the SR matrices for given parameters

*α*and

*γ*(Equation 1), then calculated the first 20 principal components and the corresponding projections for each participant, picked the two projections that correlated most strongly with the scores, and constructed a linear regression model with these two predictors. In the outer loop, a Nelder–Mead optimization routine searched for

*α*and

*γ*that maximized the multiple regression coefficient of the inner loop model. The best fit (

*R*

^{2}= 0.56) was achieved with learning rate

*α** = 0.233 and discount factor

*γ** = 0.255. Figure 2d reports this optimal model. To our knowledge, this is the most accurate prediction of Raven scores based on eye-tracking data reported to date.

*α** and

*γ**. Figure 3c plots the average of these matrices. It represents a “pseudo-observer” who consistently follows the systematic strategy on all trials. The diagonal box structure is clearly visible (cf. Figure 2a). Note that the cells along the main diagonal have positive values even though the fixation sequences contained no transitions from any AOI directly back to itself. This illustrates an important difference between the successor representation and a transition probability matrix. Despite the absence of immediate repetitions in the sequence, there are plenty of round-trip scanpaths, which give rise to the positive SR values along the diagonal. We also generated 28 sequences of length 150 according to the toggling strategy. They contained multiple transitions to and from the response area (Figure 3b). The corresponding trial-averaged SR matrix (Figure 3d) has high values along the bottom and right edges, corresponding to scanpaths

*ending in*and

*starting from R*, respectively. Figures 3f and 3g plot the deviations from the grand mean (Figures 3e). This approximates the PCA algorithm, which reorganizes the

*variance*of the individual feature vectors. As our simplified illustration has only two cases, both patterns merge into a single “pseudo-component” that merely changes sign. The behavioral data set contained 35 cases whose SR matrices mixed the systematic pattern with the toggling pattern (and other idiosyncratic patterns) in different proportions. The SR projections quantify the degree to which these two strategies are expressed in the scanpaths of each individual participant. The systematic projection was positively correlated with the Raven scores, whereas the toggling projection was negatively correlated.

*R*

^{2}= 0.51 (corrected down to 0.48) for predicting Raven scores with a linear combination of the matrix time distribution index (defined in the Methods section), the number of toggles on easy items, and the latency on easy items. When applied to our data, however, these variables achieved a much lower uncorrected

*R*

^{2}= 0.16 (Table 1). The most that can be achieved with linear regression on any 3 dwell time predictors on our data is

*R*

^{2}= 0.21 (Table 1, bottom row).

R ^{2} | R _{ cv } ^{2} | |
---|---|---|

Successor representation (with PCA) | 0.56 | 0.41 |

Variables used by Vigneau et al. (2006) | ||

Proportional time on matrix (PTM) | 0.17 | 0.09 |

Latency to first toggle (FT) | 0.02 | 0.01 |

Latency on easy items (LEz) | 0.11 | 0.04 |

Number of toggles on easy items (NT) | 0.01 | 0.00 |

Toggle rate on easy items (TR) | 0.12 | 0.04 |

Matrix time distribution index (MTDI) | 0.02 | 0.01 |

Vigneau et al.'s model (MTDI + NT + LEz) | 0.16 | 0.03 |

Best traditional model (PTM + TR + LEz) | 0.21 | 0.09 |

Transition probability matrices (with PCA) | ||

First-order transitions, 2 components | 0.29 | 0.01 |

First-order transitions, 4 components | 0.51 | 0.07 |

Second-order transitions, 2 components | 0.42 | 0.19 |

Second-order transitions, 4 components | 0.57 | 0.26 |

*training set*of 34 participants and a

*test set*of 1 participant. We ran our two-tier algorithm on the training set. The parameters

*α*and

*γ*optimized on the training set were then used to calculate the SR matrix for the fixation sequences in the test set. Finally, we calculated the model's prediction of the test Raven score by multiplying the test SR matrix by the weight matrix estimated from the training set. We repeated this process 35 times, testing on the data from each participant in turn. This produced 35 predicted scores, each one based on a model that had no access to the data that were subsequently used to test it. The squared correlation between these cross-validated predictions and the observed scores was

*R*

_{cv}

^{2}= 0.41. This is a much better estimate of generalization performance than the goodness-of-fit

*R*

^{2}on the training set (Haykin, 2009). The latter is inflated because it reflects not only the genuine regularities in the population, which will generalize to new cases, but also the idiosyncrasies of the training sample, which will not. This explains the drop from

*R*

^{2}= 0.56 to

*R*

_{cv}

^{2}= 0.41. Note that this still is very respectable

*cross-validated*performance, which sets a new benchmark for Raven score prediction. For comparison, the corresponding values for the best model based on dwell time variables were

*R*

^{2}= 0.21 and

*R*

_{cv}

^{2}= 0.09 (Table 1). This suggests that the SR algorithm can extract reliable regularities from the data much better than traditional dwell time methods. The SR advantage comes from the sequential information in scanpaths and from the data-smoothing properties of the temporal-difference learning algorithm.

*α*= 0.236 (

*SD*= 0.02), mean

*γ*= 0.259 (

*SD*= 0.05). The stability of the temporal discount factor

*γ*suggests that the scanpath patterns have regularities with a characteristic time scale.

*R*

^{2}= 0.29 on the full training set but did not cross-validate (

*R*

_{cv}

^{2}= 0.01). Adding variables to the regression model improved the fit only marginally (e.g.,

*R*

_{cv}

^{2}= 0.07 with 4 components). This suggests that first-order transition matrices are too myopic to support robust prediction of Raven scores. It also demonstrates that the excellent performance of the SR method cannot be attributed to the PCA-based dimensionality reduction algorithm.

*R*

_{cv}

^{2}= 0.26) was achieved with 4 predictor variables. While quite respectable and much better than the

*R*

_{cv}

^{2}achievable with traditional measures, this falls far short of the SR-based prediction. Moreover, unlike the SR-based components (Figure 2), the second-order components were extremely hard to visualize and interpret.

*γ*parameter. The second conclusion is that the probability estimates need to be smoothed. There are not enough data to populate the matrices by simple counting, particularly in the second-order case. This scarcity of data (rather than computational constraints) appears to be the limiting factor in scanpath analysis in general. The SR learning algorithm (Equation 1) updates a whole column of the matrix after each transition, thereby smoothing the estimates. Stated differently, each cell in the SR matrix aggregates a whole class of observations. For example, cell (1, 1) would be updated after observing any of the following subsequences: 121, 131, …, 1R1; 1231, 1241, …. This reuses the data and reduces the variance of the estimates. This smoothing effect contributed to the stability of the SR components during leave-one-out cross-validation. By contrast, the first-order probability estimates were apparently noisier, and the PCA solution was unstable even though it involved matrices of the same shape estimated from the same data.

*fundamental matrix*in the theory of Markov chains (Kemeny & Snell, 1976). More recently, Gershman, Moore, Todd, Norman, and Sederberg (under revision) identified a formal connection between the SR and an influential model of episodic and semantic memory, the Temporal Context Model (e.g., Howard & Kahana, 2002; Sederberg, Howard, & Kahana, 2008).

*i*, our version does not include this same visit in the total (temporally discounted) number of visits to

*i*. Assuming a first-order Markov chain with transition probability matrix

**, our SR matrix**

*T***is based on the power series:**

*M***+**

*I**γ*

**+**

*T**γ*

^{2}

*T*^{2}+ … = (

**−**

*I**γ*

**)**

*T*^{−1}. To revert to the standard formulation of the SR learning algorithm, the term

*I*_{ j }in Equation 1 must be replaced by

*I*_{ i }. In the special case when

*γ*= 0, our algorithm tracks the transition matrix

**instead of the identity matrix**

*T***.**

*I***(White, 1995) is a direct application of more general convergence proofs about TD(**

*M**λ*) learning in the reinforcement learning literature (Dayan, 1992; Jaakkola, Jordan, & Singh, 1994; Sutton, 1988). To ensure convergence, it is necessary to decrease the learning rate

*α*as the data accumulate. The technical conditions include

*n*is the number of observations (Dayan & Sejnowski, 1993, cited in White, 1995).

*α*for all sequences regardless of length. It would be interesting to explore parameterizations that reduce the effective learning rate for longer sequences. The clipping of sequences longer than 100 fixations (described in the Methods section) is a crude way of regularizing the sequence length. Our present results indicate that, even with a fixed learning rate, the learning algorithm can accommodate substantial variability in length. As mentioned earlier, this is a major advantage over string-editing methods for comparing scanpaths. Varying the learning rate as a function of sequence length will provide additional robustness and reduce the variance of the estimates. This is a promising topic for future research.

*t*(34) = 3.48,

*p*< 0.001), replicating published results (Bors & Vigneau, 2003; Denney & Heidrich, 1990).

*λ*) for general

*λ*. Machine Learning, 8, 341–362.