Open Access
Article  |   November 2024
Modeling the dynamics of contextual cueing effect by reinforcement learning
Author Affiliations
  • Yasuhiro Hatori
    Research Institute of Electrical Communication, Tohoku University, Sendai, Japan
    National Institute of Occupational Safety and Health, Japan, Tokyo, Japan
    hatori@s.jniosh.johas.go.jp
  • Zheng-Xiong Yuan
    Research Institute of Electrical Communication, Tohoku University, Sendai, Japan
    lvp0526@gmail.com
  • Chia-Huei Tseng
    Research Institute of Electrical Communication, Tohoku University, Sendai, Japan
    CH_Tseng@alumni.uci.edu
  • Ichiro Kuriki
    Research Institute of Electrical Communication, Tohoku University, Sendai, Japan
    Graduate School of Science and Engineering, Saitama University, Saitama, Japan
    ikuriki@mail.saitama-u.ac.jp
  • Satoshi Shioiri
    Research Institute of Electrical Communication, Tohoku University, Sendai, Japan
    shioiri@riec.tohoku.ac.jp
Journal of Vision November 2024, Vol.24, 11. doi:https://doi.org/10.1167/jov.24.12.11
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Yasuhiro Hatori, Zheng-Xiong Yuan, Chia-Huei Tseng, Ichiro Kuriki, Satoshi Shioiri; Modeling the dynamics of contextual cueing effect by reinforcement learning. Journal of Vision 2024;24(12):11. https://doi.org/10.1167/jov.24.12.11.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Humans use environmental context for facilitating object searches. The benefit of context for visual search requires learning. Modeling the learning process of context for efficient processing is vital to understanding visual function in everyday environments. We proposed a model that accounts for the contextual cueing effect, which refers to the learning effect of scene context to identify the location of a target item. The model extracted the global feature of a scene and gradually strengthened the relationship between the global feature and its target location with repeated observations. We compared the model and human performance with two visual search experiments (letter arrangements on a gray background or a natural scene). The proposed model successfully simulated the faster reduction of the number of saccades required before target detection for the natural scene background compared with the uniform gray background. We further tested whether the model replicated the known characteristics of the contextual cueing effect in terms of local learning around the target, the effect of the ratio of repeated and novel stimuli, and the superiority of natural scenes.

Introduction
Searching for objects in a busy visual scene is an important daily task, where knowledge of environments is used for efficient attentional selections. In the real world, it is difficult to find a situation where humans visually perceive objects without context. Appropriate contexts facilitate visual processing, such as detecting objects, whereas inappropriate contexts impair it (Biederman, 1972; Davenport & Potter, 2004; Hayhoe & Ballard, 2005; Shioiri & Ikeda, 1989). For example, a coffee cup is usually put on a desk and not on the floor; therefore, one may attend to the top of a desk when searching for a cup. These benefits likely come from strengthening the association between the target location and the environment in the internal representation through experience. 
It is well known and documented that context information facilitates visual search when used repeatedly. Chun and Jiang (1998) engaged participants in searching for a target letter, T, embedded among letter distractors (Ls) and found that searching ability improved through repeated exposures of the same layouts even though participants were not consciously aware of this acquisition. This improvement is referred to as the contextual cueing effect (Chun & Jiang, 1998). The contextual cueing effect has been investigated for various types of context (Sisk, Remington, & Jiang, 2019), including object shape (Endo & Takeda, 2004), background color and texture (Kunar, Flusberg, & Wolfe, 2006), three-dimensional layouts (Kawahara, 2003; Shioiri, Kobayashi, Matsumiya, & Kuriki, 2018; Tsuchiai, Matsumiya, Kuriki, & Shioiri, 2012; Zang, Shi, Müller, & Conci, 2017), exposure duration (Ogawa & Kumada, 2008), and naturalistic backgrounds (Brockmole, Castelhano, & Henderson, 2006; Brockmole & Henderson, 2006; Rosenbaum & Jiang, 2013). With naturalistic backgrounds, contextual cueing effects have different characteristics, as the effect becomes larger than that with randomly distributed letters on homogeneous backgrounds, and participants tend to be aware of repetition, suggesting that the learning was somewhat explicit (Brockmole & Henderson, 2006; Brockmole et al., 2006; Rosenbaum & Jiang, 2013). 
The learning process has also been investigated by modeling the effect (e.g., Backhause, Heinke, & Humphreys, 2005; Brady & Chun, 2007). Brady and Chun (2007) proposed a two-layer neural network to simulate the contextual cueing effect. During learning, the model gradually strengthened the connections between positions of the target and each of distractors (i.e., element-wise distractor–target association). Exposure to repeated layouts led the model to detect a target efficiently for repeated stimuli. Beesley, Vadillo, Pearson, and Shanks (2016) reported the role of the configuration of distractors for contextual cueing effects as a competing framework of element-wise learning. They created stimuli with combinations of subpatterns in which each subpattern was associated with the same target location. A larger contextual cueing effect emerged for consistently combined subpatterns (their relative locations were the same) across blocks compared to those inconsistently combined (their relative locations were varied across blocks). This result could not be explained by element-wise learning because each subpattern was associated with the same target location, thus yielding the same effect in the consistently and inconsistently combined subpatterns. Based on such results, Beesley and colleagues (2016) proposed a model that memorized global configurations of letter arrangements and associated each configuration with a target position. This configuration model explained the contextual cueing effect for consistently repeated configuration. 
Although the previous models explained contextual cueing for repeated letter arrangements on a uniform background (Backhause et al., 2005; Beeseley et al., 2016; Brady & Chun, 2007), they used letter positions as given information. However, it is not usually possible to express image contexts as a set of item positions, especially in natural scenes. Many objects with different sizes and shapes exist in natural scenes, and it is impossible to identify contexts only with the positions of several objects. It should also be noted that the contextual cueing effect requires a relatively short time to appear. Electroencephalography studies have reported that the difference in brain activities for the repeated and non-repeated stimuli emerged within 100 ms after stimulus presentation (Chaumon, Drouet, & Tallon-Baudry, 2008; Chaumon, Hasboun, Baulac, Adam, & Tallon-Baudry, 2009; Zinchenko, Conci, Töllner, Müller, & Geyer, 2020). Behavioral studies also have shown the effect with similarly short durations (Kobayashi & Ogawa, 2020; Peterson & Kramer, 2001). A processing time of 100 ms is likely not long enough to analyze the details of a scene; therefore, a biologically plausible model of contextual cueing effect must incorporate mechanisms to extract context from natural scenes within a time as short as 100 ms. 
One study has proposed a model to associate natural scenes and items of interest. Torralba, Oliva, Castelhano, and Henderson (2006) proposed a model that guided visual attention in natural scenes for a specific task (e.g., finding a person) without contextual cueing, using a scene dataset that included target (people) locations. Their model applied spatial filtering to a natural scene to obtain global features of the scene (i.e., spatial envelope) (Oliva & Torralba, 2001). Global features are obtained by pooling outputs of multiscale-oriented filters, which represent the selectivity of neurons in the primary visual cortex (Hubel & Wiesel, 1968; Jones & Palmer, 1987). Human psychophysical studies also support the idea of neural processes tuned for orientation and scale (Blakemore & Campbell, 1969; Pantle, & Sekuler, 1968). Spatial pooling has been widely reported to explain psychophysical and physiological findings (e.g., Groen, Ghebreab, Prins, Lamme, & Scholte, 2013; Parkes, Lund, Angelucci, Solomon, & Morgan, 2001; Sakai & Tanaka, 2000). Based on these findings, we assume that the spatial pooling of orientation takes place, and the global features can be generated in the actual brain. Because the global features can be computed without access to details such as segmentation and object identification yet still be usable to classify scene categories (Oliva & Torralba, 2001), the set of global features is appropriate for the index to represent scene contexts in contextual cueing effect studies. The model extracts a set of global features to represent the scenes first and then learns the link between scenes (identified by the global features) and the target location expressed by a spatial map of probabilities for target existence. Probability maps for predicting a target are learned for each scene and used as prior knowledge for visual search tasks after identifying scenes. The model was built for predicting a target location with existing knowledge, so learning processes were not included. The development of knowledge (i.e., the improvement in time) is an essential factor in understanding the human function of learning because humans use partially learned information in everyday conditions. A model that includes a learning process may provide a detailed understanding of how much information is learned in progress under a given stimulus repetition. 
The present study aimed to build a model of visual and memory processes that accounts for the contextual cueing effects with both random letter arrangements and natural scene backgrounds. Visual search depends on low-level visual attributes such as visual salience (Itti, Koch, & Niebur, 1998) and scene knowledge or memory (Torralba et al., 2006). We hypothesized that visual salience is modulated by the probability of target location estimated from the global feature of a scene. Suppose the association between global features and target position is learned through repeated exposure to stimuli and target location. In that case, visual search becomes faster through repetition, as the contextual cueing effect shows. The crucial feature of the model is the learning process of the association between global features and target position. In the present study, we investigated the time course of learning in addition to the learning effect itself to investigate the dynamic process (Bergmann, Koch, & Schubö, 2019; Bergmann, Tünnermann, & Schubö, 2020; Chun & Jiang, 2003; Tseng & Lleras, 2013). The contextual cueing effect grows gradually in its shortening of reaction time, indicating that the effect is accumulated through repeated observation of scenes. In everyday life, there are perhaps many occasions where people have an advantage with this partially learned knowledge. We focused on learning speed as a parameter in the present model to control the accumulation effect. 
We followed the model of Torralba and colleagues (2006) to extract global features of scenes, calculate a prediction map of a given target, and associate the two. The present model controls the parameter of learning speed within a reinforcement learning method to simulate the cumulative learning effect (Glimcher, 2011). We conducted two psychophysical experiments using two different contexts (randomly arranged letters and natural scenes). To compare the model simulation with human performance, we counted the number of saccades needed before reaching the target location as an index of time required to detect the target similar to the previous studies (Beeseley et al., 2016; Brady & Chun, 2007). The reason for adopting saccade counts instead of reaction times is that it is difficult to precisely simulate an accurate response process, such as recognizing a target, generating a motor command, and executing the command, as their temporal characteristics vary from saccade to saccade or trial to trial. If we estimate the number of saccades for target finding from the model output, no assumption other than that attention deployment is improved through learning is necessary to compare the performance between human participants and model simulations. Simulations were successful for both experiments, showing different time courses and speeds: The learning was faster in the natural scenes than in the letter arrangements on a homogeneous background. We also applied the proposed model to other characteristics of the contextual cueing effect, such as local learning around the target (Brady & Chun, 2007), the effect of the ratio of repeated and novel stimuli (Zinchenko, Conci, Müller, & Geyer, 2018), and the dominance of natural scene over letter arrangement (Rosenbaum & Jiang, 2013) in order to examine the generality of the proposed model. 
Psychophysical experiments
We first describe two experiments that explored the contextual cueing effect. Two experiments were conducted to investigate the effect of two different types of image contexts (Figure 1). Experiment 1 used a homogeneous gray background with the target and distractor letters randomly distributed on the display. Experiment 2 used natural scenes with target and distractor letters. The task was to find a target letter, T. Eye movements were measured during a search in addition to time to find the target. The number of saccades before finding the target can be an index of performance level. 
Figure 1.
 
Trial sequence for Experiment 1 (left) and Experiment 2 (right). A fixation cross was displayed for 700 ms after the participant pressed a key to initiate the trial. A search display was presented until the participant's response or for up to 10 seconds. The search display consisted of the target, T, and distractors, Ls, in Experiment 1. In Experiment 2, the target was presented on a natural scene with seven Ls randomly distributed on the scene to avoid the target popping out as a single letter in the scene.
Figure 1.
 
Trial sequence for Experiment 1 (left) and Experiment 2 (right). A fixation cross was displayed for 700 ms after the participant pressed a key to initiate the trial. A search display was presented until the participant's response or for up to 10 seconds. The search display consisted of the target, T, and distractors, Ls, in Experiment 1. In Experiment 2, the target was presented on a natural scene with seven Ls randomly distributed on the scene to avoid the target popping out as a single letter in the scene.
Experiment 1: Learning randomly arranged letter layouts
Participants
Sixteen participants were recruited as volunteers (mean age, 24 ± 2.2 years). The sample size was estimated from a power analysis with G*power (Faul, Erdfelder, Buchner, & Lang, 2009; Faul, Erdfelder, Lang, & Buchner, 2007). Fourteen participants were necessary to reach an α error of 0.05 and a power of 0.80 together with a medium effect size (ηp2 = 0.25). The number of participants in this experiment was comparable to previous studies (e.g., Brady & Chun, 2007; Chun & Jiang, 1998; Rosenbaum & Jiang, 2013). All participants had normal or corrected-to-normal vision and were naïve about the purpose of the experiment. All of them were briefed before the experiment and signed consent forms. This study was approved by the Ethics Committee of the Research Institute of Electrical Communication, Tohoku University. The work was carried out in accordance with the tenets of the Declaration of Helsinki. 
Apparatus
Stimuli were presented on a 17-inch liquid-crystal display (Dell Technologies, Round Rock, TX) with a resolution of 1280 × 1024 pixels and a refresh rate of 60 Hz. The observer's head was fixed using a chin rest. The distance between the observer and the display was 60 cm, at which the display subtended 33° × 26° in visual angle. Eye movements were measured and recorded using an eye-tracking device (Tobii EyeX; Tobii AB, Stockholm, Sweden) with an accuracy of 0.4° and a sampling rate of 60 Hz. 
Stimuli and procedure
The whole display was divided by invisible grid lines into 12 (horizontal) × 8 (vertical) invisible cells, each of which subtended 2.8° × 3.2°. Target and distractor letters were placed randomly in the center of these cells. The orientation of target T was randomly chosen from 90° or 270°, and the orientation of each distractor L was randomly chosen from 0°, 90°, 180°, or 270°. Figure 1 shows an example of the stimulus display. Participants searched for the target among 15 distractors (Figure 1, left). There were 30 blocks consisting of 24 trials; among the 24 trials of each block, 12 trials were with repeated layouts, and the other 12 trials had novel layouts. In the repeated layouts, the positions of all letters were identical throughout the experiment. In the novel layouts, the positions of the letters were chosen randomly. The presentation order of these layouts was randomized in each block. 
Before each block, the participant sequentially fixated at nine circles on the display to calibrate the eye tracker. Each trial began with a fixation display when the participant pressed a key to initiate the trial. A search display appeared after 700 ms of the fixation display presentation, and the participant began to search for the target T. When the participant found the target, they reported the orientation of the target by pressing one of two arrow keys as quickly and accurately as possible. The stimulus display remained on the screen until the participant's response was received. The next trial began immediately after the keypress for the response, initiating the presentation of the fixation display for 700 ms, which was long enough for preparing the performance at the next trial. Trials in which the participant did not respond within 10 seconds were terminated and excluded from data analysis. The total number of trials discarded from the analysis was less than 1%. 
Results
We aggregated reaction time (RT) results from three blocks into one epoch (i.e., 36 trials each for repeated and novel layouts per epoch). Figure 2a depicts RT as a function of epoch for repeated and novel layouts, and the results show the typical contextual cueing effect (i.e., shorter RT for the repeated layouts when compared with the novel layouts). We conducted a two-way, repeated-measures, within-subject analysis of variance (ANOVA) for the average RT for the first and last epochs between the two layouts (Beesley et al., 2016; Chun & Jiang, 1998; Rosenbaum & Jiang, 2013). There were significant main effects of epochs, F(1, 15) = 6.30, p < 0.05, and of layouts, F(1, 15) = 31.34, p < 0.01. Their interaction was also significant, F(1, 15) = 15.45, p < 0.01, suggesting larger RT shortening through epochs for the repeated layouts. We applied a t-test for the difference in RTs between repeated and novel layouts in each epoch. RTs at epochs 1 and 2 were virtually identical in the repeated and novel layouts (p = 0.95 for epoch 1 and p = 0.84 for epoch 2), but significant differences appeared across later epochs (all p < 0.05), except for epoch 7 (p = 0.17). 
Figure 2.
 
Results of Experiment 1. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) layouts (averages of all participants with standard errors of the mean).
Figure 2.
 
Results of Experiment 1. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) layouts (averages of all participants with standard errors of the mean).
Similarly, eye-movement results (i.e., the number of saccades) showed a clear contextual cueing effect. Figure 2b shows the average of the number of saccades across all participants needed for target detection. We conducted a two-way, repeated-measures, within-subject ANOVA for the number of saccades for the first and last epochs between the two layouts (Harris & Remington, 2017; Peterson & Kramer, 2001). There were significant main effects of epoch, F(1, 15) = 7.96, p < 0.05, and of layout, F(1, 15) = 14.09, p < 0.01. Their interaction was also significant, F(1, 15) = 11.66, p < 0.01, suggesting a larger reduction in the number of saccades through epochs for the repeated layouts. We applied a t-test for the difference in the number of saccades between repeated and novel layouts in each epoch. The number of saccades at epochs 1 and 2 were virtually identical in the repeated and novel layouts (p = 0.92 for epoch1 and p = 0.90 for epoch 2), but significant differences appeared across later epochs (all p < 0.05), except for epoch 4 (p = 0.10). A decrease in the number of saccades could explain the major portion of the RT reduction, as suggested previously (Harris & Remington, 2017; Peterson & Kramer, 2001; Tseng & Li, 2004). 
Experiment 2: Learning natural scene backgrounds
In Experiment 2, we used natural scenes as the background (Brockmole et al., 2006; Rosenbaum & Jiang, 2013) to investigate how context or surrounding information in everyday life conditions influences learning effects. Natural scenes were overlaid by the same type of letter arrangements used in Experiment 1 (T and L). It was necessary to prevent participants from simply looking for letter-like features to perform the task. The target letter might occasionally pop out in a display by letter-like visual features if only one letter was displayed. The letter arrangement was not a part of context in this experiment, and the arrangement of L-shaped distractors was renewed for all trials even for the repeated natural scenes. Therefore, the contextual cueing effect in Experiment 2 was on the association of the target location with the natural scene. 
Participants
Fifteen new participants (mean age, 24.8 ± 0.7 years) volunteered for Experiment 2. The sample size calculation was the same as in Experiment 1. All participants had normal or corrected-to-normal vision and were naïve about the purpose of the experiment. One subject was excluded from analysis due to the loss of eye-tracking data. 
Stimuli and procedure
Stimuli and procedures were identical to those in Experiment 1 except for four points described below. First, the search display background was chosen from among 500 crossroad scenes in Japanese cities collected from Google Street View (https://www.google.com/maps/streetview). These were everyday scenes familiar to our participants, but they were unlikely to have seen these scenes before because the scenes were from cities outside the prefecture they live. Although we did not conduct a formal interview, no participant reported an encounter of known scenes during the experiment. In a repeated layout, the same scene was used across all blocks with the same target (letter T) location. In a novel layout, a new scene was always used. Regardless, the distractor (letter L) locations were always updated in both repeated and novel conditions. Features in natural scenes, but not in the letter arrangements, were the focus of interest in this experiment. 
Second, we decreased the number of distractors (letter Ls) in each trial to seven because their primary role was to prevent the target (letter T) from popping out of the scene. We derived the number of distractors from a pilot experiment to match task difficulty with Experiment 1. 
Third, participants were given an additional recognition task after completing all blocks of the visual search experiment. Scenes used for repeated scenes were mixed with the same number of novel scenes chosen from the unused crossroad scenes and presented in random order. Presentation of the letter L was randomly chosen from 0°, 90°, 180°, or 270° at both the target and distractor locations in repeated scenes and at randomly selected eight locations in novel scenes. Participants first answered whether they had seen each image in the search experiment, and they then indicated the most likely target location by clicking one of the Ls based on either memory or guessing if they simply did not know. 
Finally, the number of trials differed from Experiment 1. Each participant completed 20 blocks consisting of 40 trials. Half of trials (20) were repeated scenes, and the other half were novel scenes within a block. When natural images were used as context, participants could memorize stimuli explicitly; therefore, the number of repeated stimuli was increased to avoid stimuli being memorized too quickly to observe the learning process. 
Results
We aggregated RT results from two blocks into one epoch for the analysis (i.e., 40 trials for repeated and novel scenes in one epoch) to maximally equate the number of trials in one epoch to observe the effect of different types of contextual information. Figure 3a shows reaction times as a function of epoch for repeated and novel scenes, and the results show the contextual cueing effect of shorter RTs for repeated scenes than novel scenes. We conducted a two-way, repeated-measures, within-subject ANOVA for the average RTs for the first and last epochs between the two layouts (Beesley et al., 2016; Chun & Jiang, 1998; Rosenbaum & Jiang, 2013). There were significant main effects of epoch, F(1, 13) = 9.62, p < 0.01, and of scenes, F(1, 13) = 14.92, p < 0.01. The interaction was also significant, F(1, 13) = 18.38, p < 0.01, suggesting larger RT shortening through epochs for the repeated scenes. We applied a t-test for the difference in RTs between repeated and novel scenes in each epoch. There was no significant difference in RTs at epoch 1 between repeated and novel layouts (p = 0.09), but significant differences appeared across later epochs (p < 0.05). 
Figure 3.
 
Results of Experiment 2. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) scenes (averages of all participants with standard errors of the mean).
Figure 3.
 
Results of Experiment 2. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) scenes (averages of all participants with standard errors of the mean).
Eye-movement results also showed a clear contextual cueing effect. Figure 3b shows the average number of saccades across all participants needed for target detection. We conducted a two-way, repeated-measures, within-subject ANOVA for the number of saccades for the first and last epochs between the two layouts (Harris & Remington, 2017; Peterson & Kramer, 2001). There were significant main effects of epoch, F(1, 13) = 6.98, p < 0.05, and of scenes, F(1, 13) = 4.88, p < 0.05. The interaction was also significant, F(1, 13) = 10.08, p < 0.01, suggesting a larger reduction in the number of saccades through epochs for the repeated scenes. We applied a t-test for the difference in RTs between repeated and novel scenes in each epoch. There was no significant difference in RTs at epoch 1 between repeated and novel scenes (p = 0.18), but significant differences appeared across later epochs (all p < 0.05). 
We compared the learning rates of the contextual cueing effect in Experiments 1 and 2 with the difference in the number of saccades for the repeated and novel stimuli (Figure 9). We fitted a power function, e(x) = a + bxc, to the difference in the number of saccades (Bergmann et al., 2020; Brooks, Rasmussen, & Hollingworth, 2010; Chun & Jiang 2003). In the power function, a is the value at which the asymptote of contextual cueing effect, b is the magnitude of contextual cueing effect (the difference between the initial and asymptotic values), x is the epoch, and c is the slope of the function. For each experiment, we fitted the power function to 1000 bootstrap-sampled data. The mean values of c were 0.19 (95% confidence interval [CI], 0.11–0.27) in Experiment 1 and 0.40 (95% CI, 0.31–0.50) in Experiment 2. We subtracted the slope (c) in Experiment 1 from that in Experiment 2 and found that 4.5% of the samples took negative values, demonstrating that the learning was significantly faster when naturalistic backgrounds were used as context. 
The recognition test showed a recognition rate of 64.8% for the repeated scenes, which is a statistically significant difference from chance performance (50%), t(13) = 20.57, p < 0.05, indicating that the learning effect in this experiment was obtained by an explicit process at least partially. This result is consistent with previous studies with natural scenes (Brockmole & Henderson, 2006; Brockmole et al., 2006; Rosenbaum & Jiang, 2013). 
To analyze the effect of explicit knowledge on the contextual cueing effect, we divided the repeated scenes into recognized and unrecognized scenes based on the recognition test. For each, we calculated the difference in the number of saccades between the repeated scenes from the novel scenes (Figure 4) as an index for the contextual cueing effect. We conducted a two-way, repeated-measures, within-subject ANOVA on the average difference in the number of saccades between the first and last epochs in the two layouts. None of the main effects and interaction was significant: recognition, F(1, 48) = 2.72, p = 0.11; epoch, F(1, 48) = 0.87, p = 0.36; recognition × epoch, F(1, 48) = 0.084, p = 0.77. We also compared the contextual cueing effect in the last epoch as the most conservative case because it was impossible to know when the participants explicitly remembered the scenes. There were no statistically significant differences between recognized and unrecognized repeated stimuli, even in the last epoch, t(17) = 1.14, p = 0.14. 
Figure 4.
 
The contextual cueing effect (the difference in the number of saccades) for repeated stimuli with recognition (black) or without recognition (gray) (averages of all participants with standard errors of the mean).
Figure 4.
 
The contextual cueing effect (the difference in the number of saccades) for repeated stimuli with recognition (black) or without recognition (gray) (averages of all participants with standard errors of the mean).
We further investigated whether the participants with higher recognition accuracy had a stronger contextual cueing effect. Figure 5 shows the contextual cueing effect plotted as a function of recognition accuracy. Pearson's correlation coefficient for the difference in the number of saccades between repeated and novel scenes at the last epoch and recognition accuracy was –0.31, but the value was not statistically significant, t(12) = –1.11, p = 0.29 (Figure 5, left). The tendency was the same if the contextual cueing effect was calculated as the difference in RTs between repeated and novel scenes, r = –0.31, t(12) = –1.14, p = 0.28 (Figure 5, right). Although explicit memory may affect the contextual cueing effect to some extent, statistical analyses of the present results do not support the notion. 
Figure 5.
 
Differences in the number of saccades (left) and reaction times (right) between repeated and novel stimuli at the last epoch are plotted as a function of recognition accuracy.
Figure 5.
 
Differences in the number of saccades (left) and reaction times (right) between repeated and novel stimuli at the last epoch are plotted as a function of recognition accuracy.
Discussion
Differences in the conditions between two experiments
The use of natural scenes in Experiment 2 is the largest difference from Experiment 1, but there were also two additional differences in experimental conditions: the number of repeated stimuli and the number of letters embedded in the stimuli. The faster learning effect in Experiment 2 than in Experiment 1 can be explained by the use of natural scenes, as previous studies have reported (Brockmole & Henderson, 2006; Rosenbaum & Jiang, 2013). The present results provided another piece of evidence that learning is faster for natural images by using crossroad scenes. We speculate that the number of repeated stimuli could make learning slower instead of faster, but there has been no study to investigate the effect of the number of repeated stimuli on contextual cueing effect, as far as we know. The larger number of repeated layouts in Experiment 2 cannot explain the difference in learning speeds between Experiments 1 and 2. Similarly, embedded letters randomly distributed in the stimulus could impair learning due to the layout information changing from block to block, so this could not explain the difference in learning speeds between Experiments 1 and 2, either. 
Roles of explicit memory
The roles of explicit memory on the contextual cueing effect are controversial. Some studies have reported that explicit memory facilitates contextual cueing (Kroell, Schlagbauer, Zinchenko, Müller, & Geyer, 2019; Shioiri et al., 2018). Other studies have reported that explicit memory has no effect (Chun & Jiang, 2003; Westerberg, Miller, Reber, Cohen, & Paller, 2011) or even reduces the benefit of contextual cueing (Spaak & de Lange, 2020). If explicit memory has any effect on attention deployment, it is expected that the correlation between the contextual cueing effect and recognition accuracy would be significant. However, there was no positive support for the benefit of explicit memory on contextual cueing in Experiment 2 (Figure 5). Geyer, Rostami, Sogerer, Schlagbauer, and Müller (2020) used letter-arrangement stimuli as context with two different recognition tests (yes/no recognition and target quadrant report). The correlation was significant when participants had the target quadrant report task, suggesting a relationship between explicit memory and contextual cueing. It would be interesting to investigate the effect of explicit memory on natural scene stimuli by using the target quadrant report task as the recognition test (Meyen, Vadillo, Von Luxburg, & Franz, 2023). 
One explanation for differentiating stimuli as recognized or not was the time participants spent looking at the stimuli. A comparison of mean reaction times at the last epoch for recognized and unrecognized stimuli (recognized, 1281.4 ± 290.6 ms; unrecognized, 961.8 ± 142.6 ms) revealed statistically significant differences between them, t(19) = 3.67, p < 0.01. In Experiment 2, the stimuli were not strictly controlled in terms of local contrast around letters and distance of the target from the center of a stimulus. It is possible that targets in some of the stimuli were less visible than others. For such stimuli, participants had to carefully observe the images and spend longer time than for the others, which could have helped them memorize them as a result. Further investigations are necessary to discuss the effect of explicit memory on the contextual cueing effect. 
Simulation
Model structure
We developed a computational model (Figure 6) to understand the learning process of spatial layouts revealed in the contextual cueing effect. The model assumed that the visual system extracts global features from each scene to associate with the target location. Global features were defined from an ensemble (or combinations) of six orientations and four spatial scales (Oliva & Torralba, 2001; Torralba et al., 2006). Following the method of Torralba et al. (2006), 24 Gabor filters (six orientations and four scales) were applied to each of the 4 × 4 subregions spatially divided in each image. These 384 filter outputs (6 orientations × 4 scales × 4 × 4 subregions) were used as elements of a vector that we defined as the measures of global features (spatial envelope) of the image (Figure 6a). The peak sensitivities for the spatial frequency in each scale were 1.7, 3.4, 6.8, and 13.4 cycles/degree, respectively. These sensitivities are in line with psychophysiological reports (Blakemore & Campbell, 1969; Pantle & Sekuler, 1968). The global feature vector is used to identify image or scene representation in the memory, corresponding to the image input. We refer to this representation in memory as the image class. 
Figure 6.
 
(a) Two examples of global feature (bottom) obtained from a letter arrangement on a gray background (left) and natural scene background (right). A lighter color in the bottom vector indicates a higher value. The dynamic range is the same for the two global features. (b) A schematic illustration for gaze prediction. The saliency map obtained from the stimulus is modulated by learned context (the target map) through simulations.
Figure 6.
 
(a) Two examples of global feature (bottom) obtained from a letter arrangement on a gray background (left) and natural scene background (right). A lighter color in the bottom vector indicates a higher value. The dynamic range is the same for the two global features. (b) A schematic illustration for gaze prediction. The saliency map obtained from the stimulus is modulated by learned context (the target map) through simulations.
The possible location of the target within a scene was expressed as a topographical map of probability (target map). The target map models the effect of learning the association between the target location and the image class. The target map was expressed as a 6 × 4 matrix, where a cell of the matrix corresponded to 2 × 2 virtual cells for letter distribution in a stimulus. This matrix size was determined based on a pilot simulation of the number of saccades in the first epoch of the experiments. Varying the matrix size changes the saccade numbers that reach the target cell; a smaller size results in a smaller number of saccades because of lower spatial resolution for the target position. The pilot simulation showed that the number of saccades to detect the target was about 10 with a 6 × 4 matrix. The value of each cell in a target map indicated the estimated probability of the target appearance. The target map was initialized as all cells had the same value (i.e., uniform probability distribution) at the beginning of the search experiment, and the map was updated after each trial in the learning process as described below. 
We used a temporal difference algorithm of reinforcement learning (Sutton & Barto, 1981) for updating target maps, which incrementally strengthened the association between each image class and the target location. At each trial, the global feature vector of an input image was compared with that of image classes stored in memory. The similarity value (si) was determined by the distance between the current global feature (C), and the global feature of the ith memorized image class (Gi) as follows:  
\begin{eqnarray} {s_i} = {\rm{\ }}\frac{{\mathop \sum \nolimits_{j = 1}^n {C_j}G_j^i}}{{\sqrt {\mathop \sum \nolimits_{j = 1}^n {{\left( {{C_j}} \right)}^2}} \sqrt {\mathop \sum \nolimits_{j = 1}^n {{\left( {G_j^i} \right)}^2}} }},\quad \end{eqnarray}
(1)
where j is the index of an element of a vector and n is the size of the global feature vector (i.e., 384 in the present model). The similarity value (si) is used as the weight for updating the ith target map. When all similarity values were below a preset threshold (see later), then a new image class was generated with a uniform probability map. Otherwise, all target maps with a similarity higher than the threshold were updated with the following formula:  
\begin{eqnarray} {\boldsymbol{M}}_i^{\prime} = {\rm{\ }}{{\boldsymbol{M}}_i} + \eta {s_i}\left( {{{\boldsymbol{T}}_i} - {{\boldsymbol{M}}_i}} \right),\quad \end{eqnarray}
(2)
where Mi and Mi′ represent the target map of the ith image class before and after updating. Constant η is a learning rate that controls the learning speed. Ti is the target location matrix of the given trial— that is, one at the target location and otherwise zero. The target map was updated based on the error between the current map and the answer. The learning based on error between a current output and answer has been used in modeling behavior, such as classical conditioning (Rescorla & Wagner, 1972), as well as contextual cueing (Brady & Chun, 2007). Functional magnetic resonance imaging (fMRI) studies on humans have reported brain activities related to error computation (e.g., O'Doherty, Dayan, Friston, Critchley, & Dolan, 2003); therefore, we used this formula for updating the target map. The target map was normalized after updating so that the summation of the map took the value of 1. After repeated exposure to the same images, the target map gradually approached having a peak at the target location, with which the target can be localized soon after identification of the image class for a given image. The target map is updated through repetition, but global feature is not; that is, what the model learns is the association between global feature and target position. If the η value that controls the learning rate is equal to 1, then the peak can be achieved after a single trial. The value of η was determined for each experiment by minimizing the difference between the model performance and the experimental results. 
Model evaluation
Learning speed
We used the number of saccadic eye movements to enable a direct comparison of the performance between human participants and model simulations similar to the previous studies (Beeseley et al., 2016; Brady & Chun, 2007). The reason for adopting saccade counts instead of RTs is because it is difficult to precisely simulate an accurate response process, such as recognizing a target, generating a motor command, and executing the command, as their temporal characteristics vary from saccade to saccade or trial to trial. If we estimate the number of saccades for target finding from the model output, no other assumption than attention deployment is improved through learning is necessary to compare the performance between human participants and model simulations. As shown in the psychophysical results (Figures 2 and 3), the number of saccades and RTs yield essentially the same learning curves. Therefore, the number of saccades is an appropriate measure to validate the present performance. 
To estimate eye movements during visual search, we implemented the contextual cueing effect as an additional component of the model of bottom-up attention, referred to as a saliency map (Itti et al., 1998). The saliency map is a topographical map of conspicuity computed from low-level visual features. It is one of the well-recognized models of bottom-up attention, such as estimating gaze shift while playing video games (Peters & Itti, 2007), watching movies based on a first-person points of view (Hiratani, Nakashima, Matsumiya, Kuriki, & Shioiri, 2013), or memorizing scenes with eye and head coordination (Nakashima et al., 2015), and it is applicable to a variety of fields such as robot navigation (Chang, Siagian, & Itti, 2010) and video compression (Li, Qin, & Itti, 2011). The map to predict gaze locations (gaze prediction map) was derived by weighting the saliency map with the target map to derive probabilities of gaze locations for the visual search experiments. A gaze prediction map is computed by cell-by-cell multiplication of the saliency map and target map. In the simulation, the saliency for each stimulus image (randomly arranged letters with a homogeneous background in Experiment 1 or natural scene in Experiment 2) was calculated first, then the saliency value at each location was multiplied by the target map (Figure 6b), which was updated after each trial. 
We assumed that gaze shifts to the location with the highest probability of target existence among 6 × 4 cells, and the probability becomes zero after processing the information there to prevent scanning the same location again, as a phenomenon referred to as inhibition of return suggests (Klein, 2000; Klein & MacInnes, 1999; Posner, Rafal, Choate, & Vaughan, 1985). Inhibition of return is a phenomenon that explains the inhibitory effect at the attended location after several hundred milliseconds of attending the location. According to this assumption, gaze follows the rank of probability from high to low until reaching the target cell. Figure 7 shows an example of gaze shift predictions. The probability rank of the target cell in the gaze prediction map is the fourth from the top out of 24 cells at the first epoch, whereas it becomes the first at the last epoch; that is, the model requires four saccades to locate the target at the initial epoch but only one saccade after learning through 10 epochs. 
Figure 7.
 
Example of an input stimulus (top left) and its target map (two right figures in the top row), and gaze prediction map (two right figures in the bottom row) for the first epoch (middle) and last epoch (right). A gaze prediction map was obtained from a saliency map (bottom left) and a target map (see Figure 6). Brighter colors in the gaze prediction map indicate higher probability. The number of saccades required to locate the target is based on the ranking of a value on the gaze prediction map. A cell with a target is denoted by the red outline. The yellow grids indicate the boundaries of the cells and are for presentation purposes only.
Figure 7.
 
Example of an input stimulus (top left) and its target map (two right figures in the top row), and gaze prediction map (two right figures in the bottom row) for the first epoch (middle) and last epoch (right). A gaze prediction map was obtained from a saliency map (bottom left) and a target map (see Figure 6). Brighter colors in the gaze prediction map indicate higher probability. The number of saccades required to locate the target is based on the ranking of a value on the gaze prediction map. A cell with a target is denoted by the red outline. The yellow grids indicate the boundaries of the cells and are for presentation purposes only.
The model has two free parameters to simulate the contextual cueing effect: threshold of similarity (sth) and learning rate (η). A similarity threshold (sth) decides whether the global feature of a given stimulus belongs to one of the image classes existing in memory through the learning process. If we choose the value close to one, the correlation between a given global feature and each of the image classes tends to be lower than the threshold. In this case, each global feature constitutes one class. We determined these values as follows because there is no theory to predict these parameters. We assumed that the number of classes was equal to the total number of repeated images used in the experiments (12 in Experiment 1 and 20 in Experiment 2) or larger than but as close as possible to it. We determined the similarity threshold (sth) based on a trial-and-error approach for each experiment (Figure 8). Figure 8a shows the root-mean-square errors (RMSEs) between the model performance and the experimental results for sth values ranging from 0.05 to 0.95 with a step of 0.05 and η values ranging from 1.0 × 10–4 to 2.5 × 10–3 with a step of 1.0 × 10–5. As can be seen in Figure 8, because the RMSEs between the model performance and the experimental results did not change much with thresholds smaller than 0.95, we adopted one from the thresholds that gave similarly small RMSEs (sth = 0.70 for Experiment 1 and sth = 0.88 for Experiment 2). The average number of classes across participants for each threshold was 21.5 ± 2.3 for Experiment 1; for Experiment 2, the average number was 30.6 ± 2.6. 
Figure 8.
 
The RMSEs for the different simulation parameters. (a) RMSEs are plotted as a function of the thresholds and learning rates. The yellow color indicates larger RMSEs, and the red circles indicate the smallest RMSEs. (b) RMSEs are plotted as a function of the learning rate. The threshold was fixed to the value used in the subsequent simulation (0.75 for Experiment 1 and 0.88 for Experiment 2).
Figure 8.
 
The RMSEs for the different simulation parameters. (a) RMSEs are plotted as a function of the thresholds and learning rates. The yellow color indicates larger RMSEs, and the red circles indicate the smallest RMSEs. (b) RMSEs are plotted as a function of the learning rate. The threshold was fixed to the value used in the subsequent simulation (0.75 for Experiment 1 and 0.88 for Experiment 2).
Learning rate η modulated the degree of update within a given target map. That is, learning progresses faster or slower with a larger or smaller η, respectively. We determined the value of η in terms of least square error for simulation repetition with a systematical variation of the value (Figure 8). The best-fit values of η were 3.0 × 10–4 with sth of 0.70 in Experiment 1 and 1.3 × 10–3 with sth of 0.88 in Experiment 2. 
We defined the learning effect from contextual cueing as the difference in the number of saccades between repeated and novel layouts. Figure 9 compares the contextual cueing effect between the experiments and simulations (Figure 9a for Experiment 1 and Figure 9b for Experiment 2). Simulation results were averaged across 16 runs (Experiment 1) or 14 runs (Experiment 2) that were the same as the number of participants in each experiment. Simulation of the contextual cueing effect (Figure 9, gray lines) showed results similar to the experimental results (Figure 9, black lines). Fitting errors quantified as the RMSEs were 0.37 for Experiment 1 and 0.79 for Experiment 2. These errors were 63% smaller than data deviations from the average (0.99 for Experiment 1 and 2.00 for Experiment 2), which were the baseline for data deviation. Because the model explains the current results, we attempted to predict the contextual cueing effect in several conditions reported previously. 
Figure 9.
 
Learning effect of repeated exposure to repeated stimuli. The model result is an average of simulations repeated with the same number of participants for each experiment. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are standard errors of the mean.
Figure 9.
 
Learning effect of repeated exposure to repeated stimuli. The model result is an average of simulations repeated with the same number of participants for each experiment. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are standard errors of the mean.
Local learning around the target
Not all distractors are equally associated with target position when letter arrangement is used as the only context (Brady & Chun, 2007; Olson & Chun, 2002). Specifically, participants learned the association between a target and distractors near the target. A study by Brady and Chun (2007) reported that a model can explain such a learning effect of local arrangements. We examined whether the proposed model could also predict the contextual cueing effect even when distractor letters only near the target were repeated. Simulations were conducted under the condition where four letters were placed in each quadrant, and only the position of three distractors in the quadrant containing the target was repeated. Other conditions were identical to Experiment 1 in this study. We repeated the simulation 16 times, which was the same as the number of participants in Experiment 1. The model parameters threshold and learning rate were set to 0.75 and 0.1, respectively. 
Figure 10 shows the difference in the number of saccades between repeated and novel stimuli. The model reproduced the contextual cueing effect when the distractors around the target were repeated, such that the search for repeated stimuli becomes more efficient as the epochs progress (Figure 10, black line). We found statistically significant different contextual cueing effects between epochs 1 and 10, t(15) = 4.39, p < 0.01. The proposed model that does not have explicit spatial constraint succeeded in predicting the contextual cueing effect for layouts with a small portion of distractors nearby the target. However, the contextual cueing effect also emerged when the positions of the four distractors in the quadrant diagonal to the quadrant containing the target were repeated, t(15) = 7.92, p < 0.01 (Figure 10, gray line), which is not consistent with the psychophysical results (Brady & Chun, 2007). Although this is a logical consequence from the determination of a class based on features over the whole display, it is worth simulating with actual stimulus conditions used in experiments to make sure that there is no unknown artifact specific to actual layout displays. These results suggest the necessity of applying spatial constraints around the target in the learning stage. 
Figure 10.
 
Learning effect for quadrant-predictive stimuli. Distractors in the same quadrant (black line) or diagonal quadrant (gray line) containing the target were repeated. The mean difference in the number of saccades between repeated and novel layouts is plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 10.
 
Learning effect for quadrant-predictive stimuli. Distractors in the same quadrant (black line) or diagonal quadrant (gray line) containing the target were repeated. The mean difference in the number of saccades between repeated and novel layouts is plotted as a function of epoch. Error bars are the standard errors of the mean.
Effect of similarity threshold
In the present model, the similarity threshold can be set arbitrarily between 0 and 1, and different similarity thresholds can yield some variabilities of fitting errors (Figure 8). As can be seen in Figure 8, the best fits (smallest RMSEs) were found with a threshold of 0.65 and learning rate of 3.0 × 10–4 for Experiment 1 and a threshold of 0.60 and learning rate of 2.3 × 10–3 for Experiment 2. The simulation results show appropriate learning rates for Experiments 1 and 2 but rather ambiguous results for the similarity threshold. Because RMSEs do not change much with thresholds smaller than 0.95, we adopted a threshold from those that provided similarly small RMSEs, assuming that the number of memorized layouts is close to the number of repeated layouts in the experiment (i.e., not too many layouts are memorized). Accordingly, similarity thresholds were chosen as 0.70 for the letter arrangement in Experiment 1 and 0.88 for the natural scene background in Experiment 2. We confirmed that these thresholds did not yield one-to-one class representation in both stimuli. The number of classes (Experiment 1, 21.5 ± 2.3; Experiment 2, 30.6 ± 2.6) was within approximately twice the number of repeated stimuli (Experiment 1, 12; Experiment 2, 20). 
We further evaluated the effect of choice of the similarity threshold with additional simulation. Zinchenko et al. (2018) reported that contextual cueing was absent when the proportion of repeated stimuli to novel stimuli was small. Suppose a single stimulus constitutes a single class. In that case, contextual cueing effects should occur regardless of the ratio between repeated and novel stimuli because the classes corresponding to each repeated stimulus can tell the target position. However, this is not the case if similar stimuli constitute a class because novel stimuli similar to repeated stimuli exist, and the global feature of those stimuli can be associated with different target locations. We tested this prediction through a simulation experiment. We set the number of repeated and novel stimuli to 8 and 32, in line with Zinchenko et al. (2018), and ran simulations under two different conditions: one with a low similarity threshold (sth = 0.60) and one with a high threshold (sth = 0.90). The low similarity threshold was chosen as the maximum threshold that yielded the number of classes approximately within twice the number of repeated stimuli. The learning rate (η) was set to 1.0 × 10–4 for both conditions. The simulation was repeated 16 times, which was the same as the number of participants in Experiment 1 and larger than the number of participants in the previous study (n = 13 in Zinchenko et al., 2018). The average numbers of classes across simulations were 13.8 ± 2.5 and 686.6 ± 29.4 for low and high similarity thresholds, respectively. Figure 11 shows the difference in the number of saccades between repeated and novel stimuli. In the low-similarity threshold condition (Figure 11, gray line), the contextual cueing effect did not occur, t(15) = 1.56, p = 0.14, replicating the result of the previous study (Zinchenko et al., 2018). In contrast, the contextual cueing effect emerged in the high-similarity threshold condition, t(15) = 3.42, p < 0.01 (Figure 11, black line). These results support the existence of class-like memory representation. It should be noted that the parameter settings were based on rather small datasets in the present study, although our model can explain the experimental results. Future studies, under more general conditions, should further investigate the similarity threshold, which may be used in image memorization in the human brain. 
Figure 11.
 
Learning effect with low (sth = 0.60; gray line) and high (sth = 0.90; black line) similarity thresholds. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 11.
 
Learning effect with low (sth = 0.60; gray line) and high (sth = 0.90; black line) similarity thresholds. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Superiority of natural scenes over letter arrangement
In the study by Rosenbaum and Jiang (2013), participants were instructed to search for a target letter embedded in a natural image and distractor letters. In that experiment, both the natural image and distractor letters were associated with the target location. The contextual cueing effect disappeared when the association between the natural images and the target letter was removed; natural scene backgrounds are superior to letter arrangements even when both contexts are predictive of the target position (Rosenbaum & Jiang, 2013). We examined whether the proposed model could simulate the superiority of natural images over letter arrangement. The simulation conditions were the same as in Experiment 2 except for the following points. A transfer phase of two blocks followed the learning phase, as in the experiment by Rosenbaum and Jiang (2013). In the transfer phase, the letter arrangement in the repeated condition remained the same, but the background natural images were replaced with crossroad scenes not used in the learning phase. Repeated and novel scenes were utilized in 20 trials for each block. We repeated the simulation 16 times, which was the same as the number of participants in the previous study (Rosenbaum & Jiang, 2013). The model parameters threshold and learning rate were set to 0.88 and 1.3 × 10–3, respectively. 
The black line in Figure 12 shows the simulation results for the proposed model, where two blocks are grouped together as one epoch. In the learning phase (epochs 1–10), the difference between the repeated and novel stimuli increased constantly with epoch. In the transfer phase at the 11th epoch, however, the difference was about zero, and no statistically significant difference was found between epochs 1 and 11, t(15) = 0.33, p = 0.75. This is similar to the experimental results reported by Rosenbaum and Jiang (2013)
Figure 12.
 
Learning effect of repeated exposure to repeated stimuli (epochs 1–10) and that of the transfer phase (epoch 11). (a) Example stimuli used in the simulation. In the transfer phase, natural the scene was changed but the letter position was the same for repeated stimuli. In contrast, both the natural scene and letter position were changed for novel stimuli. (b) Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 12.
 
Learning effect of repeated exposure to repeated stimuli (epochs 1–10) and that of the transfer phase (epoch 11). (a) Example stimuli used in the simulation. In the transfer phase, natural the scene was changed but the letter position was the same for repeated stimuli. In contrast, both the natural scene and letter position were changed for novel stimuli. (b) Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
We also applied the model by Beesley and colleagues (2016) for comparisons, in which the context is defined as the positions of the letters (Figure 12, gray line). To simulate the Beesley et al. model, the global feature in the present model was replaced by a vector in which elements containing letters took the value of 1; otherwise, zero. The simulation parameters, threshold and learning rate, were set to 0.88 and 1.0 × 10–3, respectively. The model by Beesley et al. showed a contextual cueing effect during the transfer phase, whereas a psychophysical experiment did not show the effect (Rosenbaum & Jiang, 2013). There was a significant contextual cueing effect between epochs 1 and 11, t(15) = 8.86, p < 0.01, indicating that a configuration model that considers letter arrangement alone does not replicate the superiority of natural images (Rosenbaum & Jiang, 2013). 
Learning effect for letter arrangement on non-repeated natural scene
Random arrangements of letters should contain various orientations and spatial frequencies (or scale) as random dot patterns do. Natural scenes can also be considered as assemblies of a variety of orientations and frequencies with a bias to lower frequencies. Lines below the stimulus illustrations in Figure 6a show the values of each feature vector by the level of lightness, where a lighter color indicates a larger value. Although light lines are seen for both types of stimuli, the lines tend to be lighter and the differences in lightness among the different lines tend to be larger for natural scenes compared with letter arrangements. The larger feature values for natural scenes can be a reason for the learning effect of repetitions for natural scenes with letter arrangements. There is a report of a contextual cueing effect with repetitions of letter arrangements with novel scene background for each letter arrangement repetition. Rosenbaum and Jiang (2013) reported the establishment of a contextual cueing effect based on letter arrangement when the letter arrangement was repeated but the natural scene background was not. We performed additional simulations under conditions similar to those of Rosenbaum and Jiang (2013) to test whether the proposed model can reproduce the contextual cueing effect for repetitions of letter arrangements with novel scene backgrounds. The simulation conditions were the same as the previous simulation (Figure 12) except for the following points: (1) in the repeated condition, only the letter arrangement remained the same and background natural images were randomly changed across blocks; and (2) the transfer phase was excluded. Repeated and novel scenes were made in 16 trials for each block. We repeated the simulation 16 times, which was the same as the number of participants in the previous study (Rosenbaum & Jiang, 2013). The model parameters threshold and learning rate were set to 0.84 and 1.7 × 10–3, respectively. 
Although the difference between epochs 1 and 10 was marginally significant, t(15) = 1.87, p = 0.081, the model reproduced the tendency of the reduction in the number of saccades to find a target even when the letter placement was repeated on randomly changing natural scenes (Figure 13). This result suggests that the global feature used in this study also encoded information about letter stimulus and that learning of the relationship between the repeated information and the target location was taking place. 
Figure 13.
 
Learning effect of repeated letter arrangements on a randomly chosen natural scene. Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 13.
 
Learning effect of repeated letter arrangements on a randomly chosen natural scene. Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Gaze prediction
The model performance on visual search reproduced the saccade number reductions as the experiment progressed (Figure 9). If human observers learned the probabilistic map of the target position, then they might look at locations with a high probability of having the target. To test this, we investigated whether the present model predicts fixation locations as well as the reduction in the number of saccades. We evaluated model performance in terms of gaze prediction accuracy at the last epoch, where the maximal context cueing effect was expected in each experiment. 
We used a conventional method with normalized scanpath saliency (NSS) to evaluate gaze prediction accuracy (Bylinskii, Judd, Oliva, Torralba, & Durand, 2019; Peters, Iyer, Itti, & Koch, 2005). Model output was z-normalized to have a mean of zero and a unit standard deviation. NSS is calculated as the average of z scores at the fixation locations:  
\begin{eqnarray} NSS = \frac{1}{N}\mathop \sum \limits_{i = 1}^N M\left( {{x_i},{y_i}} \right),\quad \end{eqnarray}
(3)
where N is the number of fixations, (xi, yi) is the ith fixation position, and M is a normalized model output. A value of zero means that the model prediction is at chance level, and positive NSS values indicate that the prediction is better than chance. 
Figure 14 shows NSS values with and without the contextual cueing effect for Experiments 1 and 2. We compared model performance with the Wilcoxon signed-rank test. In Experiment 1, model performance was significantly improved for pooled results of repeated and novel layouts with consideration of the contextual cueing effect when compared with the original salience map alone (0.92 ± 0.033 vs. 0.88 ± 0.031; z = 3.52, p < 0.01). When repeated and novel layouts were evaluated separately, the performance of our gaze prediction map showed significant improvement over the simple saliency map in repeated layouts (0.99 ± 0.047 vs. 0.92 ± 0.044; z = 3.52, p < 0.01), but not in novel layouts (0.84 ± 0.027 vs. 0.84 ± 0.027; z = −0.78, p = 0.46). Results were similar for Experiment 2. Although the performance of the model did not show significant improvement when contextual cueing was used rather than the salience map alone (0.60 ± 0.033 vs. 0.59 ± 0.030; z = 1.41, p = 0.17) for repeated and novel images combined, the performance of our gaze prediction map showed significant improvement over the saliency map alone in repeated images (0.65 ± 0.055 vs. 0.60 ± 0.048; z = 2.48, p < 0.05), but not in novel images (0.56 ± 0.030 vs. 0.58 ± 0.031; z = –1.85, p = 0.068). These results suggest that the learning effect of the model significantly improved the prediction of gaze and attention locations for repeated letter arrangements and natural images. 
Figure 14.
 
Model performance (normalized scanpath saliency) for the salience map only (white) and the gaze prediction map (i.e., saliency map and learned context, shown in gray).
Figure 14.
 
Model performance (normalized scanpath saliency) for the salience map only (white) and the gaze prediction map (i.e., saliency map and learned context, shown in gray).
Discussion
We developed a model for the learning process of the contextual cueing effect, which improved target localization through repeated visual searches of the same displays. The model successfully simulated the general trend of the learning effect with regard to human behavior. Here, we discuss the differences in learning speeds and underlying neural mechanisms. 
We showed that the learning rate (η) was larger in Experiment 2 (η = 1.3 × 10–3) than in Experiment 1 (η = 3.0 × 10–4). A larger η indicates a larger reduction in the number of saccades to find a target (Figure 9); that is, learning speed is faster with natural scenes than with random letter arrangements. We also compared the learning rate at the best-fitted values (Figure 8a, red circles) and found that the larger learning rate in Experiment 2 (η = 2.3 × 10–3) compared with Experiment 1 (η = 3.0 × 10–4), confirming the faster learning in Experiment 2. Similarly, a previous study reported that learning with upside-down natural images was slower than with right-side-up images (Brockmole & Henderson, 2006). Another study using indoor scenes suggested the importance of global scene information (e.g., room identity) for contextual cueing effects in natural scenes that have rich contextual information (Brockmole et al., 2006). The difference in learning speed may arise because natural scenes have useful or rich contextual information that is used in everyday life. The medial temporal lobe (MTL) is involved in scene memory (Montaldi, Spencer, Roberts, & Mayes, 2006), and the hippocampus is the neural basis for contextual cueing effects (Chun & Phelps, 1999; Geyer, Baumgartner, Müller, & Pollmann, 2012; Giesbrecht et al., 2013; Manns & Squire, 2001; Preston & Gabrieli, 2008; Spaak & de Lange, 2020). Blood oxygenation level–dependent (BOLD) activity in the hippocampus significantly differed between the repeated displays and the novel displays during the learning phase of a contextual cueing task using letter arrangements (Giesbrecht et al., 2013, Spaak & de Lange, 2020). Although no fMRI study has investigated contextual cueing effects with natural scenes, the hippocampus has been reported to play roles in learning contextual information such as the meaning of scenes (Bar & Aminoff, 2003; Bar, Aminoff, & Schacter, 2008). Random arrangement of the letters L and T provide little semantic information, but natural scene stimuli do. Hippocampus activity is expected to be different for these two stimuli. It may be the case that contextual information processed in the MTL, including the hippocampus, contributes to the faster learning speed in natural scenes. 
Previous studies have reported that repeated stimuli associated with high reward were learned faster than those with low or no reward (Bergmann et al., 2019; Bergmann et al., 2020; Tseng & Lleras, 2013), suggesting that reward increases the degree of contextual cueing effect. If, for example, we assume that searching for a target in a natural scene is more fun and success provides a larger reward compared with letter arrangements, then the present model reflects the faster learning speeds observed with natural scene backgrounds compared with letter arrangements. The present model also showed a larger contextual cueing effect for natural scene backgrounds in terms of the difference between repeated and novel layouts at the last epoch. Although the contextual cueing effect at the last epoch was not the asymptote in the model, the model could predict the larger learning effect with natural scene backgrounds in general. 
The choice of similarity threshold (sth) is crucial, as well as learning rate. The threshold was chosen based on the number of repeated stimuli in the current study. Although RMSE values depend less on the choice of the threshold, simulation results can depend on the value of the threshold (e.g., Figure 11). Because it is difficult to identify the number of repeated stimuli, the threshold should be modulated depending on the environment that humans are encountering. Future studies will investigate the neural mechanisms of the modulation. 
In the present study, the model learned the association between the target location and the entire global feature vector (i.e., configuration learning) rather than each element of the vector (i.e., element-wise learning). The learning of global information has been reported in letter arrangement (Beeseley et al., 2016; Xie, Chen, & Zang, 2020) and natural image (Brockmole et al., 2006). It should be noted that using configuration learning in the model does not exclude the existence of element-wise learning. These two types of learning are not mutually exclusive and can exist simultaneously as learning mechanisms. In fact, psychophysical studies suggest both element-wise and configuration learning occur (e.g., Jiang & Wagner, 2004; Zheng & Pollman, 2019). Clarifying the mechanisms and the roles of local and global learning is an interesting research direction. 
General discussion
We performed two experiments to investigate the contextual cueing effect with different contexts (random letter arrangements and natural scenes), and we developed a model that successfully simulated the time courses of behavioral learning in both cases. Because the contextual cueing effect reflects attentional guidance to the target location after learning (Chaumon et al., 2008; Chun & Jiang, 1998; Geyer, Zehetleitner, & Müller, 2010; Kobayashi & Ogawa, 2020; Luck & Hillyard, 1994; Woodman & Luck, 1999), the proposed model provides aspects of an attention process as well as a learning process. Both endogenous attention (top-down) and exogenous attention (bottom-up) modulate attentional focus (Carrasco, 2011; Connor, Egeth, & Yantis, 2004; Ogawa & Komatsu, 2004) with regard to the spotlight or zoom-lens metaphors (Downing & Pinker, 1985; Eriksen & St. James, 1986; Matsubara, Shioiri, & Yaguchi, 2007; Palmer & Moore, 2009; Shioiri, Honjyo, Kashiwase, Matsumiya, & Kuriki, 2016; Shioiri, Yamamoto, Oshida, Matsubara, & Yaguchi, 2010); however, the present model assumes modulation of bottom-up saliency by context learned through repetition, facilitating finding targets in familiar environments. Based on this view, the contextual cueing effect can be recognized as a phenomenon of the mechanism of bottom-up attention with the influence of a memory process. The underlying neural mechanisms of this model are a topic of interest, and we discuss the possible mechanism of the modulation process here. 
Previous studies have reported that the MTL, including the hippocampus (Chun & Phelps, 1999; Geyer et al., 2012; Giesbrecht, Sy, & Guerin, 2013; Manns & Squire, 2001; Preston & Gabrieli, 2008; Spaak & de Lange, 2020) and prefrontal cortex (Spaak & de Lange, 2020), are involved in the contextual cueing effect. Spaak and de Lange (2020) found different roles for the medial temporal and frontal lobes on contextual cueing: The hippocampus contributes to learning the association between context and target position, and the frontal lobe is engaged in attention deployment based on the learned association. It has been reported that attentional modulation in the sensory cortex is mediated by feedback from higher order areas (Debes & Dragoi, 2023; Kirchberger, Mukherjee, Self, & Roelfsema, 2023). The frontal lobe may modulate activities in the sensory cortices based on the association between learned context and the target to enhance salience regarding the target location. 
It has been reported that learned context affects visual processing in the occipital lobe (visual cortex) in contextual cueing experiments as early as 100 ms after stimulus presentation (Chaumon et al., 2008; Zinchenko et al., 2020), which indicates that the modulation occurs before the voluntary shift of attention toward the target item (Johnson, Woodman, Braun, & Luck, 2007; Schankin & Schubö, 2009). Indeed, the processing time of around 100 ms is suggested to be dominated by a feed-forward sweep (Lamme & Roelfsema, 2000). One possible mechanism to account for this early modulation is that saliency coding is updated through learning with consideration of context. 
There are at least two mechanisms for the learning effects observed in the early stages of visual processes. The first is perceptual learning (Watanabe, Náñez, & Sasaki, 2001; Zohary, Celebrini, Britten, & Newsome, 1994); for example, experience with detecting an oriented bar among distractors that were similarly oriented (i.e., low saliency) changed the activities of V1 neurons, making the detection task easier (Yan, Zhaoping, & Li, 2018). Given the retinotopic organization in visual cortices (Engel et al., 1994; Silver & Kastner, 2009), the experience of finding a target in repeated stimuli could change the activity of V1 neurons at that target location, resulting in faster detection of the target for repeated stimuli compared with novel stimuli. This effect, however, is independent of the global features in general, and the modulation on saliency that the present model assumes is unlikely. 
The second is feedback from the perirhinal cortex in the MTL. Physiological studies have reported that the perirhinal cortex responds to repeated stimuli within 100 ms after stimulus onset (Gamond et al., 2011; Lehky & Tanaka, 2007) and modulates activities in visual areas (e.g., V2) when familiar objects are presented (Barense, Ngo, Hung, & Peterson, 2012; Peterson, Cacciamani, Barense, & Scalf, 2012). In addition, amnesic participants with damage to the MTL, including the perirhinal cortex, have shown impairment of the contextual cueing effect (Chun & Phelps, 1999; Manns & Squire, 2001). All of these reports suggest that the perirhinal cortex contributes to modulating bottom-up saliency based on familiarity with the presented stimuli. 
The present model used the same learning rule for associating target locations to contexts extracted from random letter arrangements and natural scenes. This may not be appropriate because only associations between targets and natural scene backgrounds are learned when both the natural scenes and letter arrangements are predictive of target locations (Rosenbaum & Jiang, 2013), which suggests that different mechanisms take place in implicit and explicit learning. However, we attempted to use a single rule for two different experiments based on the fact that implicit learning and explicit learning (at least partially) share a common neural mechanism (Turk-Browne, Yi, & Chun, 2006). Although explicit and implicit processes have different definitions, the contextual cueing effect of the MTL (or other higher level brain processes) on sensory processes does not have to be independent in implicit and explicit learning. The simulation results for the proposed model in the present study reinforced the superiority of natural scenes in learning target–context associations (Figure 12), suggesting that different mechanisms are not required to explain the present results. Of course, this implication does not preclude or deny the possibility of the existence of different mechanisms of contextual cueing effect related to implicit and explicit learning. Future studies should examine the mechanisms of both implicit and explicit learning. 
Conclusions
We developed a model to capture the learning dynamics of the contextual cueing effect and evaluated it with a reduction in the saccade numbers required to search targets in a visual search. The model successfully simulated the behavioral learning time courses in both random letter arrangements and natural scenes. The different time courses between the two different context conditions can be interpreted as a difference in learning speed in the model. The mechanisms that differentiate learning speeds in learning systems are interesting and important for further studies. 
Acknowledgments
This study was partially supported by the Ministry of Education of Culture, Sports, Science and Technology, Japan (19H01111 and 24H00700 to SS; 19K20640 to YH). 
Commercial relationships: none. 
Corresponding author: Satoshi Shioiri. 
Email: shioiri@riec.tohoku.ac.jp. 
Address: Research Institute of Electrical Communication, Tohoku University, Sendai, Miyagi 980-0812, Japan. 
References
Backhaus, A., Heinke, D., & Humphreys, G. W. (2005). Contextual Learning in the Selective Attention for Identification model (CL-SAIM): Modeling contextual cueing in visual search tasks. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). Piscataway, NJ: Institute of Electrical and Electronics Engineers.
Bar, M. & Aminoff, E. (2003). Cortical analysis of visual context. Neuron, 38(2), 347–358. [CrossRef] [PubMed]
Bar, M., Aminoff, E., & Schacter, D. L. (2008). Scenes unseen: The parahippocampal cortex intrinsically subserves contextual associations, not scenes or places per se. Journal of Neuroscience, 28(34), 8539–8544. [CrossRef]
Barense, M. D., Ngo, J. K. W., Hung, L. H. T., & Peterson, M. A. (2012). Interactions of memory and perception in amnesia: The figure–ground perspective. Cerebral Cortex, 22, 2680–2691. [CrossRef]
Beesley, T., Vadillo, M. A., Pearson, D., & Shanks D. R. (2016). Configural learning in contextual cuing of visual search. Journal of Experimental Psychology: Human Perception and Performance, 42(8), 1173–1185. [PubMed]
Bergmann, N., Koch, D., & Schubö, A. (2019). Reward expectation facilitates context learning and attentional guidance in visual search. Journal of Vision, 19(3):10, 1–10, https://doi.org/10.1167/19.3.10. [CrossRef]
Bergmann, N., Tünnermann, J., & Schubö, A. (2020). Reward-predicting distractor orientations support contextual cueing: Persistent effects in homogeneous distractor contexts. Vision Research, 171, 53–63. [CrossRef] [PubMed]
Biederman, I. (1972). Perceiving real-world scenes. Science, 177(4043), 77–80. [CrossRef] [PubMed]
Blakemore, C., & Campbell, F. W. (1969). On the existence of neurones in the human visual system selectively sensitive to the orientation and size of retinal images. The Journal of Physiology, 203(1), 237–260. [CrossRef] [PubMed]
Brady, T. F., & Chun M. M. (2007). Spatial constraints on learning in visual search: Modeling contextual cuing. Journal of Experimental Psychology: Human Perception and Performance, 33(4), 798–815. [PubMed]
Brockmole, J. R., Castelhano, M. S., & Henderson, J. M. (2006). Contextual cueing in naturalistic scenes: Global and local contexts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(4), 699–706. [PubMed]
Brockmole, J. R., & Henderson, J. M. (2006). Using real-world scenes as contextual cues for search. Visual Cognition, 13(1), 99–108. [CrossRef]
Brooks, D. I., Rasmussen, I. P., & Hollingworth, A. (2010). The nesting of search contexts within natural scenes: Evidence from contextual cuing. Journal of Experimental Psychology: Human Perception and Performance, 36, 1406–1418. [PubMed]
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., & Durand, F. (2019). What do different evaluation metrics tell us about saliency models? IEEE Transactions Pattern Analysis and Machine Intelligence, 41, 740–757. [CrossRef]
Carrasco, M. (2011). Visual attention: The past 25 years. Vision Research, 51(13), 1484–1525. [CrossRef] [PubMed]
Chang, C. K., Siagian, C., & Itti, L. (2010). Mobile robot vision navigation & localization using gist and saliency. In IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems (IROS) (pp. 4147–4154). Piscataway, NJ: Institute of Electrical and Electronics Engineers.
Chaumon, M., Drouet, V., & Tallon-Baudry, C. (2008). Unconscious associative memory affects visual processing before 100 ms. Journal of Vision, 8(3):10, 1–10, https://doi.org/10.1167/8.3.10. [CrossRef] [PubMed]
Chaumon, M., Hasboun, D., Baulac, M., Adam, C., & Tallon-Baudry, C. (2009). Unconscious contextual memory affects early responses in the anterior temporal lobe. Brain Research, 1285, 77–87. [CrossRef] [PubMed]
Chun, M. M., & Jiang, Y. (1998). Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 36(1), 28–71. [CrossRef] [PubMed]
Chun, M. M., & Jiang, Y. (2003). Implicit, long-term spatial contextual memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 224–234. [PubMed]
Chun, M. M., & Phelps, E. A. (1999). Memory deficits for implicit contextual information in amnesic subjects with hippocampal damage. Nature Neuroscience, 2, 844–847. [CrossRef] [PubMed]
Connor, C. E., Egeth, H. E., & Yantis, S. (2004). Visual attention: Bottom-up versus top-down. Current Biology, 14(19), 850–852. [CrossRef]
Davenport, J. L., & Potter, M. C. (2004). Scene consistency in object and background perception. Psychological Science, 15(8), 559–564. [PubMed]
Debes, S. R., & Dragoi, V. (2023). Suppressing feedback signals to visual cortex abolishes attentional modulation. Science, 379, 468–473. [PubMed]
Downing, C. J. & Pinker, S. (1985). The spatial structure of visual attention. In Posner, M. I. & Marin, O. S. M. (Eds.), Attention and Performance XI: Mechanisms of attention and visual search (pp. 171–188). London: Routledge.
Endo, N., & Takeda, Y. (2004). Selective learning of spatial configuration and object identity in visual search. Perception & Psychophysics, 66(2), 293–302. [PubMed]
Engel, S. A., Rumelhart, D. E., Wandell, B. A., Lee, A. T., Glover, G. H., Chichilnisky, E.-J., ... Shadlen, M. N. (1994). fMRI of human visual cortex. Nature, 369, 525. [PubMed]
Eriksen, C. W., & St. James, J. D. (1986). Visual attention within and around the field of focal attention. Perception & Psychophysics, 40(4), 225–240. [PubMed]
Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149–1160. [PubMed]
Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175–191. [PubMed]
Gamond, L., George, N., Lemaréchal, J.-D., Hugueville, L., Adam, C., & Tallon-Baudry, C. (2011). Early influence of prior experience on face perception. NeuroImage, 54, 1415–1426. [PubMed]
Geyer, T., Baumgartner, F., Müller, H. J., & Pollmann, S. (2012). Medial temporal lobe-dependent repetition suppression and enhancement due to implicit vs. explicit processing of individual repeated search displays. Frontiers in Human Neuroscience, 6, 272. [PubMed]
Geyer, T., Rostami, P., Sogerer, L., Schlagbauer, B., & Müller, H. J. (2020). Task-based memory systems in contextual-cueing of visual search and explicit recognition. Scientific Reports, 10, 16527. [PubMed]
Geyer, T., Zehetleitner, M., & Müller, H. J. (2010). Contextual cueing of pop-out visual search: When context guides the deployment of attention. Journal of Vision, 10(5):20, 1–11, https://doi.org/10.1167/10.5.20.
Giesbrecht, B., Sy, J. L., & Guerin, S. A. (2013). Both memory and attention systems contribute to visual search for targets cued by implicitly learned context. Vision Research, 85, 80–89. [PubMed]
Glimcher, P. W. (2011). Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, USA, 108(supplement 3), 15647–15654.
Groen, I. I. A., Ghebreab, S., Prins, H., Lamme, V. A. F., & Scholte, H. S. (2013). From image statistics to scene gist: Evoked neural activity reveals transition from low-level natural image structure to scene category. Journal of Neuroscience, 33(48), 18814–18824.
Harris, A. M., & Remington, R. (2017). Contextual cueing improves attentional guidance, even when guidance is supposedly optimal. Journal of Experimental Psychology: Human Perception and Performance, 43(5), 926–940. [PubMed]
Hayhoe, M., & Ballard, D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9(4), 188–194. [PubMed]
Hiratani, A., Nakashima, R., Matsumiya, K., Kuriki, I., & Shioiri, S. (2013). Considerations of self-motion in motion saliency. 2013 Second IAPR Asian Conference on Pattern Recognition (pp. 783–787). Piscataway, NJ: Institute of Electrical and Electronics Engineers.
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex, Journal of Physiology, 195(1), 215–243.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, 20(11), 1254–1259.
Jiang, Y., & Wagner, L. C. (2004). What is learned in spatial contextual cuing— configuration or individual locations? Perception & Psychophysics, 66, 454–463. [PubMed]
Johnson, J. S., Woodman, G. F., Braun, E., & Luck, S. J. (2007). Implicit memory influences the allocation of attention in visual cortex. Psychonomic Bulletin & Review, 14, 834–839. [PubMed]
Jones, J. P., & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1187–1211. [PubMed]
Kawahara, J. I. (2003). Contextual cueing in 3D layouts defined by binocular disparity. Visual Cognition, 10(7), 837–852.
Kirchberger, L., Mukherjee, S., Self, M. W., & Roelfsema, P. R. (2023). Contextual drive of neuronal responses in mouse V1 in the absence of feedforward input. Science Advances, 9, eadd2498. [PubMed]
Klein, R. M. (2000). Inhibition of return. Trends in Cognitive Sciences, 4, 138–147. [PubMed]
Klein, R. M., & MacInnes, W. J. (1999). Inhibition of return is a foraging facilitator in visual search. Psychological Science, 10, 346–352.
Kobayashi, H., & Ogawa, H. (2020). Contextual cueing facilitation arises early in the time course of visual search: An investigation with the speed-accuracy tradeoff task. Attention, Perception, & Psychophysics, 82(6), 2851–2861. [PubMed]
Kroell, L. M., Schlagbauer, B., Zinchenko A., Müller, H. J., & Geyer, T. (2019). Behavioural evidence for a single memory system in contextual cueing. Visual Cognition, 27(5–8), 551–562.
Kunar, M. A., Flusberg, S. J., & Wolfe, J. M. (2006). Contextual cuing by global features. Perception & Psychophysics, 68(7), 1204–1216. [PubMed]
Lamme, V. A. F., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neurosciences, 23, 571–579. [PubMed]
Lehky, S. R., & Tanaka, K. (2007). Enhancement of object representations in primate perirhinal cortex during a visual working-memory task. Journal of Neurophysiology, 97, 1298–1310. [PubMed]
Li, Z., Qin, S., & Itti, L. (2011). Visual attention guided bit allocation in video compression. Image and Vision Computing, 29(1), 1–14.
Luck, S. J., & Hillyard, S. A. (1994). Spatial filtering during visual search: Evidence from human electrophysiology. Journal of Experimental Psychology: Human Perception and Performance, 20, 1000–1014. [PubMed]
Manns, J. R., & Squire, L. R. (2001). Perceptual learning, awareness, and the hippocampus. Hippocampus, 11, 776–782. [PubMed]
Matsubara, K., Shioiri, S., & Yaguchi, H. (2007). Spatial spread of visual attention while tracking a moving object. Optical Review, 14(1), 57–63.
Meyen, S., Vadillo, M. A., Von Luxburg, U., & Franz, V. H. (2023). No evidence for contextual cueing beyond explicit recognition. Psychonomic Bulletin & Review, 31(3), 907–930. [PubMed]
Montaldi, D., Spencer, T. J., Roberts, N., & Mayes, A. R. (2006). The neural system that mediates familiarity memory. Hippocampus, 16(5), 504–520. [PubMed]
Nakashima, R., Fang, Y., Hatori, Y., Hiratani, A., Matsumiya, K., Kuriki, I., ... Shioiri, S. (2015). Saliency-based gaze prediction based on head direction. Vision Research, 117, 59–66. [PubMed]
O'Doherty, J. P., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 38, 329–337. [PubMed]
Ogawa, H., & Kumada, T. (2008). The encoding process of nonconfigural information in contextual cuing. Perception & Psychophysics, 70(2), 329–336. [PubMed]
Ogawa, T., & Komatsu, H. (2004). Target selection in area V4 during a multidimensional visual search task. Journal of Neuroscience, 24(28), 6371–6382.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Olson, I. R., & Chun, M. M. (2002). Perceptual constraints on implicit learning of spatial context. Visual Cognition, 9, 273–302.
Palmer, J., & Moore, C. M. (2009). Using a filtering task to measure the spatial extent of selective attention. Vision Research, 49(10), 1045–1064. [PubMed]
Pantle, A., & Sekuler, R. (1968). Size-detecting mechanisms in human vision. Science, 162(3858), 1146–1148. [PubMed]
Parkes, L., Lund, J., Angelucci, A., Solomon, J. A., & Morgan, M. (2001). Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience, 4(7), 739–744. [PubMed]
Peters, R. J., & Itti, L. (2007). Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 1–8). Piscataway, NJ: Institute of Electrical and Electronics Engineers.
Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision Research, 45, 2397–2416. [PubMed]
Peterson, M. A., Cacciamani, L., Barense, M. D., & Scalf, P. E. (2012). The perirhinal cortex modulates V2 activity in response to the agreement between part familiarity and configuration familiarity. Hippocampus, 22, 1965–1977. [PubMed]
Peterson M. S., & Kramer A. F. (2001). Attentional guidance of the eyes by contextual information and abrupt onsets. Perception & Psychophysics, 63(7), 1239–1249. [PubMed]
Posner, M. I., Rafal, R. D., Choate, L. S., & Vaughan, J. (1985). Inhibition of return: Neural basis and function. Cognitive Neuropsychology, 2, 211–228.
Preston, A. R., & Gabrieli, J. D. E. (2008). Dissociation between explicit memory and configural memory in the human medial temporal lobe. Cerebral Cortex, 18(9), 2192–2207.
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement, In Black, A. H. & Prokasy, W. F. (Eds.), Classical conditioning Ⅱ: Current theory and research (pp. 64–99). New York: Appleton-Century-Crofts.
Rosenbaum, G. M., & Jiang, Y. V. (2013). Interaction between scene-based and array-based contextual cueing. Attention, Perception, & Psychophysics, 75(5), 888–899. [PubMed]
Sakai, K., & Tanaka, S. (2000). Spatial pooling in the second-order spatial structure of cortical complex cells. Vision Research, 40(7), 855–871. [PubMed]
Schankin, A. & Schubö, A. (2009). Cognitive processes facilitated by contextual cueing: Evidence from event-related brain potentials. Psychophysiology, 46, 668–679. [PubMed]
Shioiri, S., Honjyo, H., Kashiwase, Y., Matsumiya, K., & Kuriki, I. (2016). Visual attention spreads broadly but selects information locally. Scientific Reports, 6, 35513. [PubMed]
Shioiri, S., & Ikeda, M. (1989). Useful resolution for picture perception as a function of eccentricity, Perception, 18, 347–361. [PubMed]
Shioiri S., Kobayashi, M., Matsumiya, K., & Kuriki, I. (2018). Spatial representations of the viewer's surroundings. Scientific Reports, 8: 7171, 1–9. [PubMed]
Shioiri, S., Yamamoto, K., Oshida, H., Matsubara, K., & Yaguchi, H. (2010). Measuring attention using flash-lag effect. Journal of Vision, 10(10):10, 1–13, https://doi.org/10.1167/10.10.10.
Silver, M. A., & Kastner, S. (2009). Topographic maps in human frontal and parietal cortex. Trends in Cognitive Sciences, 13, 488–495. [PubMed]
Sisk, C. A., Remington, R. W., & Jiang, Y. V. (2019). Mechanism of contextual cueing: A tutorial review. Attention, Perception, & Psychophysics, 81(8), 2571–2589. [PubMed]
Spaak, E., & de Lange F. P. (2020). Hippocampal and prefrontal theta-band mechanisms underpin implicit spatial context learning. Journal of Neuroscience, 40(1), 191–202.
Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88(2), 135–170. [PubMed]
Torralba, A., Oliva, A., Castelhano, M. S., & Henderson, J. M. (2006). Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological Review, 113(4), 766–786. [PubMed]
Tseng, Y. C., & Li, C. S. R. (2004). Oculomotor correlates of context-guided learning in visual search, Perception & Psychophysics, 66(8), 1363–1378. [PubMed]
Tseng, Y. C., & Lleras, A. (2013). Rewarding context accelerates implicit guidance in visual search. Attention, Perception, & Psychophysics, 75, 287–298. [PubMed]
Tsuchiai, T., Matsumiya, K., Kuriki, I., & Shioiri, S. (2012). Implicit learning of viewpoint-independent spatial layouts. Frontiers in Psychology, 3, 207. [PubMed]
Turk-Browne, N. B., Yi, D. J., & Chun, M. M. (2006). Linking implicit and explicit memory: Common encoding factors and shared representations. Neuron, 49(6), 917–927. [PubMed]
Watanabe, T., Náñez, J. E., & Sasaki, Y. (2001). Perceptual learning without perception. Nature, 413, 844–848. [PubMed]
Westerberg, C. E., Miller, B. B., Reber, P. J., Cohen, N. J., & Paller, K. A. (2011). Neural correlates of contextual cueing are modulated by explicit learning. Neuropsychologia, 49, 3439–3447. [PubMed]
Woodman, G. F., & Luck, S. J. (1999). Electrophysiological measurement of rapid shifts of attention during visual search. Nature, 400, 867–869. [PubMed]
Xie, X., Chen, S., & Zang, X. (2020). Contextual cueing effect under rapid presentation. Frontiers in Psychology, 11, 1–14. [PubMed]
Yan, Y., Zhaoping, L., & Li, W. (2018). Bottom-up saliency and top-down learning in the primary visual cortex of monkeys. Proceedings of National Academy of Sciences, USA, 115(41), 10499–10504.
Zang, X., Shi, Z., Müller, H. J., & Conci, M. (2017). Contextual cueing in 3D visual search depends on representations in planar-, not depth-defined space. Journal of Vision 17(5):17, 1–17, https://doi.org/10.1167/17.5.17.
Zheng, L., & Pollmann, S. (2019). The contribution of spatial position and rotated global configuration to contextual cueing. Attention, Perception, & Psychophysics, 81, 2590–2596. [PubMed]
Zinchenko, A., Conci, M., Müller, H. J., & Geyer, T. (2018). Predictive visual search: Role of environmental regularities in the learning of context cues. Attention, Perception, & Psychophysics, 80, 1096–1109. [PubMed]
Zinchenko, A., Conci, M., Töllner, T., Müller, H. J., & Geyer, T. (2020). Automatic guidance (and misguidance) of visuospatial attention by acquired scene memory: Evidence from an N1pc polarity reversal. Psychological Science, 31, 1531–1543. [PubMed]
Zohary, E., Celebrini, S., Britten, K. H., & Newsome, W. T. (1994). Neuronal plasticity that underlies improvement in perceptual performance. Science, 263, 1289–1292. [PubMed]
Figure 1.
 
Trial sequence for Experiment 1 (left) and Experiment 2 (right). A fixation cross was displayed for 700 ms after the participant pressed a key to initiate the trial. A search display was presented until the participant's response or for up to 10 seconds. The search display consisted of the target, T, and distractors, Ls, in Experiment 1. In Experiment 2, the target was presented on a natural scene with seven Ls randomly distributed on the scene to avoid the target popping out as a single letter in the scene.
Figure 1.
 
Trial sequence for Experiment 1 (left) and Experiment 2 (right). A fixation cross was displayed for 700 ms after the participant pressed a key to initiate the trial. A search display was presented until the participant's response or for up to 10 seconds. The search display consisted of the target, T, and distractors, Ls, in Experiment 1. In Experiment 2, the target was presented on a natural scene with seven Ls randomly distributed on the scene to avoid the target popping out as a single letter in the scene.
Figure 2.
 
Results of Experiment 1. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) layouts (averages of all participants with standard errors of the mean).
Figure 2.
 
Results of Experiment 1. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) layouts (averages of all participants with standard errors of the mean).
Figure 3.
 
Results of Experiment 2. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) scenes (averages of all participants with standard errors of the mean).
Figure 3.
 
Results of Experiment 2. Reaction time (a) and saccade number (b) at each epoch for repeated (black) and novel (gray) scenes (averages of all participants with standard errors of the mean).
Figure 4.
 
The contextual cueing effect (the difference in the number of saccades) for repeated stimuli with recognition (black) or without recognition (gray) (averages of all participants with standard errors of the mean).
Figure 4.
 
The contextual cueing effect (the difference in the number of saccades) for repeated stimuli with recognition (black) or without recognition (gray) (averages of all participants with standard errors of the mean).
Figure 5.
 
Differences in the number of saccades (left) and reaction times (right) between repeated and novel stimuli at the last epoch are plotted as a function of recognition accuracy.
Figure 5.
 
Differences in the number of saccades (left) and reaction times (right) between repeated and novel stimuli at the last epoch are plotted as a function of recognition accuracy.
Figure 6.
 
(a) Two examples of global feature (bottom) obtained from a letter arrangement on a gray background (left) and natural scene background (right). A lighter color in the bottom vector indicates a higher value. The dynamic range is the same for the two global features. (b) A schematic illustration for gaze prediction. The saliency map obtained from the stimulus is modulated by learned context (the target map) through simulations.
Figure 6.
 
(a) Two examples of global feature (bottom) obtained from a letter arrangement on a gray background (left) and natural scene background (right). A lighter color in the bottom vector indicates a higher value. The dynamic range is the same for the two global features. (b) A schematic illustration for gaze prediction. The saliency map obtained from the stimulus is modulated by learned context (the target map) through simulations.
Figure 7.
 
Example of an input stimulus (top left) and its target map (two right figures in the top row), and gaze prediction map (two right figures in the bottom row) for the first epoch (middle) and last epoch (right). A gaze prediction map was obtained from a saliency map (bottom left) and a target map (see Figure 6). Brighter colors in the gaze prediction map indicate higher probability. The number of saccades required to locate the target is based on the ranking of a value on the gaze prediction map. A cell with a target is denoted by the red outline. The yellow grids indicate the boundaries of the cells and are for presentation purposes only.
Figure 7.
 
Example of an input stimulus (top left) and its target map (two right figures in the top row), and gaze prediction map (two right figures in the bottom row) for the first epoch (middle) and last epoch (right). A gaze prediction map was obtained from a saliency map (bottom left) and a target map (see Figure 6). Brighter colors in the gaze prediction map indicate higher probability. The number of saccades required to locate the target is based on the ranking of a value on the gaze prediction map. A cell with a target is denoted by the red outline. The yellow grids indicate the boundaries of the cells and are for presentation purposes only.
Figure 8.
 
The RMSEs for the different simulation parameters. (a) RMSEs are plotted as a function of the thresholds and learning rates. The yellow color indicates larger RMSEs, and the red circles indicate the smallest RMSEs. (b) RMSEs are plotted as a function of the learning rate. The threshold was fixed to the value used in the subsequent simulation (0.75 for Experiment 1 and 0.88 for Experiment 2).
Figure 8.
 
The RMSEs for the different simulation parameters. (a) RMSEs are plotted as a function of the thresholds and learning rates. The yellow color indicates larger RMSEs, and the red circles indicate the smallest RMSEs. (b) RMSEs are plotted as a function of the learning rate. The threshold was fixed to the value used in the subsequent simulation (0.75 for Experiment 1 and 0.88 for Experiment 2).
Figure 9.
 
Learning effect of repeated exposure to repeated stimuli. The model result is an average of simulations repeated with the same number of participants for each experiment. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are standard errors of the mean.
Figure 9.
 
Learning effect of repeated exposure to repeated stimuli. The model result is an average of simulations repeated with the same number of participants for each experiment. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are standard errors of the mean.
Figure 10.
 
Learning effect for quadrant-predictive stimuli. Distractors in the same quadrant (black line) or diagonal quadrant (gray line) containing the target were repeated. The mean difference in the number of saccades between repeated and novel layouts is plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 10.
 
Learning effect for quadrant-predictive stimuli. Distractors in the same quadrant (black line) or diagonal quadrant (gray line) containing the target were repeated. The mean difference in the number of saccades between repeated and novel layouts is plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 11.
 
Learning effect with low (sth = 0.60; gray line) and high (sth = 0.90; black line) similarity thresholds. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 11.
 
Learning effect with low (sth = 0.60; gray line) and high (sth = 0.90; black line) similarity thresholds. The mean differences in the number of saccades between repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 12.
 
Learning effect of repeated exposure to repeated stimuli (epochs 1–10) and that of the transfer phase (epoch 11). (a) Example stimuli used in the simulation. In the transfer phase, natural the scene was changed but the letter position was the same for repeated stimuli. In contrast, both the natural scene and letter position were changed for novel stimuli. (b) Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 12.
 
Learning effect of repeated exposure to repeated stimuli (epochs 1–10) and that of the transfer phase (epoch 11). (a) Example stimuli used in the simulation. In the transfer phase, natural the scene was changed but the letter position was the same for repeated stimuli. In contrast, both the natural scene and letter position were changed for novel stimuli. (b) Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 13.
 
Learning effect of repeated letter arrangements on a randomly chosen natural scene. Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 13.
 
Learning effect of repeated letter arrangements on a randomly chosen natural scene. Mean differences in the number of saccades between the repeated and novel layouts are plotted as a function of epoch. Error bars are the standard errors of the mean.
Figure 14.
 
Model performance (normalized scanpath saliency) for the salience map only (white) and the gaze prediction map (i.e., saliency map and learned context, shown in gray).
Figure 14.
 
Model performance (normalized scanpath saliency) for the salience map only (white) and the gaze prediction map (i.e., saliency map and learned context, shown in gray).
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×