Several recent studies indicate that the visual system predicts visual features across saccades based on learned transsaccadic associations between peripheral and foveal input. However, the stimuli that were used in these studies were simple and artificial, and predictions were made only within one feature dimension: for example, shape (
Herwig et al., 2015;
Köller et al., 2020;
Paeye et al., 2018), size (
Bosco et al., 2015;
Valsecchi et al., 2020;
Valsecchi & Gegenfurtner, 2016), or spatial frequency (
Herwig & Schneider, 2014;
Herwig et al., 2018). The two experiments of the present study extended these findings by demonstrating that humans can also make transsaccadic predictions about more complex stimuli (fruits/balls and faces), which are then reflected in a biased perception. Furthermore,
Experiment 1 showed that numerous new associations can be learned within a short time span.
In the first experiment, transsaccadic associations between balls and fruits were established during the acquisition phase. In the test phase, participants were able to identify the correct object that was peripherally presented to them in the majority of cases. But sometimes (in 10–15% of the cases), they made judgment errors. These errors occurred more often for objects that had been previously swapped during the acquisition compared to objects that had not been swapped. Furthermore, participants chose the wrong category significantly more often for objects that had been swapped compared to the not swapped objects. Importantly, these category errors occurred mainly because participants chose with a higher-than-chance rate exactly the transaccadically associated foveal counterpart of the presented peripheral object. This latter finding might shed some light on the kind of processing that underlies the transsaccadic learning and predictions. One possibility is that certain rules are learned that govern how peripheral input needs to be processed to predict the foveal percept. This process would be rather computationally intensive, but one would expect an easy transfer to other situations. The other possibility is that individual object instance associations are stored and then retrieved during perception. This process would be quite memory intensive, and transfer to novel situations should be more difficult. The finding that category errors occurred mainly because participants chose the previously associated foveal counterpart suggests that associations were not generalized into a rule about category changes (e.g., “fruits change to balls during the saccade”). Instead, individual object instances were associated, and hence the learning in
Experiment 1 can be considered rather object specific than rule based. This resembles perceptual learning, which has also been argued to occur in an object-specific mechanism (
Furmanski & Engel, 2000). Admittedly, such a processing, in which stimulus instances are all stored separately, is highly memory intensive (
Dill & Fahle, 1997), especially considering the vast number of different objects that can be encountered in everyday life and for which in conclusion transsaccadic associations are potentially memorized as well. Nevertheless, it seems possible as long-term associative memory has an estimated capacity of several thousand associations (
Voss, 2009). Other studies about visual long-term memory further demonstrate its massive capacity for storing even details of objects (
Brady, Konkle, Alvarez, & Oliva, 2008) and suggest that there is virtually no limit for the retention of item-specific visual information (
Standing, 1973). Another theory that potentially fits well with this object-specific and memory-intensive learning is the instance theory of automatization (
Logan, 1988). It presents a learning mechanism called automatization in which algorithmic processing transitions into memory-based processing through the encoding, storing, and retrieval of separate stimulus encounters.
Experiment 1 further revealed that participants’ confidence was not affected by the status of the presented object (i.e., whether it was a normal or a swapped object). Thus, learning new transsaccadic associations for some items did not result in an experienced difference in the confidence perception of them. Confidence reflected only the correctness of responses: Participants were more confident when they made correct responses and less confident when they made incorrect responses. This fits well to others studies showing a strong correlation between confidence and performance (
Kotowicz et al., 2010) or even between participants’ confidence and the precision of their working memory for items (
Rademaker, Tredway, & Tong, 2012). There are different possibilities of why participants give a low confidence rating. The first is that they temporarily do not pay attention and consequently must guess the answer. Participants are aware of this and are therefore not confident in their answer. This would result in equal frequencies between all response options. The second scenario would be that what they perceived was not unambiguous to them and therefore they cannot decide between certain items. Again, this results in participants having low confidence in their response, but in this case, they do not decide randomly between all options. Because the results showed that participants picked the associated item with a higher rate than if they had simply guessed, the latter scenario must have been true (at least in some cases). This further demonstrates that participants likely perceived a mixture of the peripherally presented item and the prediction associated with it. And in some of these ambiguous cases, they relied more on the latter one.
The second experiment tested whether transsaccadic predictions can also be learned for more complex stimuli in an experiment with a more metric response judgment. With this type of response, it is possible to address the question to what extent perception is biased toward previously associated foveal input. Therefore,
Experiment 2 used morphed pictures of human faces that ranged from female to male. The results showed that participants can learn transsaccadic changes in the gender of faces. After the acquisition phase, their perception of the gender of peripherally presented faces was biased in the direction of the learned foveal association. That is, participants for whom the faces of the changed sequence changed from male to female during acquisition perceived the median gender face of that sequence more female than that of the sequence in which no gender change occurred during acquisition. Conversely, participants for whom the transsaccadic change occurred from female to male during acquisition thereafter perceived the median face of the swapped sequence more male compared to that of the normal sequence.
Interestingly, the size of the biasing effect is comparable to previous studies, which used simple stimuli. Averaged across participants, the judgment difference between the normal and the swapped median face relative to the change size during acquisition reflects an 8.88% contribution of the newly acquired foveal prediction into the peripheral percept. That is exactly the maximum relative contribution that was found for shape changes in
Köller et al. (2020). The fact that there were some participants who did not show any strong effect might be due to the difficulty of the task (evident in the large variances in the response data as well as in the reports of participants). The peripheral viewing time in our experiment was limited to 500 ms, which is longer than in previous experiments on simple visual features (350 ms) but still shorter than in other studies about face perception in which a method of adjustment was used (750 ms) (e.g.,
Liberman, Fischer, & Whitney, 2014). Thus, at least for some participants, more encoding time might have been necessary for these complex stimuli.
For the second experiment, the perceptual bias could also indicate an adaption-like effect. Each participant only saw three out of the four extreme faces (A1, A5, B1, B5) during the acquisition phase and might have adapted to these images. Consequently, the perception of the neutral face in the normal sequence would be biased away from the seen face and toward the unseen face. Multiple studies have shown this kind of aftereffect for faces. For example, faces are perceived as distorted in a direction opposite to the adapting distortion (
Webster & MacLin, 1999), or gender perception of previously ambiguous faces is biased away from the adapting gender (
Webster, Kaping, Mizokami, & Duhamel, 2004). Thus, this alternative explanation cannot be completely ruled out. However, in all these (and other adaption studies), the presentation duration of the adaptation stimulus was much longer (mostly several seconds to minutes) with additional repetitions (again often several seconds) before each test trial. In comparison, in our study, the presentation times were well below a second and the potential adaptation was also not repeated in the test phase.
Leopold, Rhodes, Müller, and Jeffery (2005) could show that face identity aftereffects increase logarithmically with adaptation time, although their adapting times ranged from 1 to 16 s. It is therefore unlikely that our short presentation times led to an adaptation effect. Furthermore, studies with transsaccadic changes of object size with only a “swapped” object could also show a recalibration of perception toward the postsaccadic foveal association (
Valsecchi et al., 2020;
Valsecchi & Gegenfurtner, 2016).
Taken together, the visual system was able to learn or update multiple new transsaccadic associations for complex and more realistic stimuli within a short time frame. Predictions based on these associations were integrated presaccadically with the actual peripheral input resulting in a biased perception. This demonstrates that the human brain constantly keeps track of certain statistics of our environment to make accurate predictions about our surroundings. Of course, the stimuli used in the present experiments were far away from depicting natural scenes, but still, these complex stimuli are a step closer to realistic everyday objects. Therefore, our findings might be taken as a first tentative hint to the functional relevance of this transsaccadic prediction mechanism outside the laboratory.
Previous studies have shown that the prediction mechanism for peripheral stimuli does not depend on the execution of a saccade (
Paeye et al., 2018;
Valsecchi & Gegenfurtner, 2016). Instead, it has been suggested that the mechanism reflects a more general function of the visual system, which allows the prediction of detailed foveal information given coarse peripheral input. If predictions are made for multiple and complex objects in our periphery, it could lead to the impression of perceptual homogeneity in our field of view. Nevertheless, the predictive mechanism seems to profit from the saccadic eye movements as stronger biasing effects can be seen here compared to fixation conditions (
Paeye et al., 2018). Possibly the prediction system is optimized for saccades, because they are the most common event that leads to the acquisition of peripheral-foveal associations.
The presented results further extend findings made in the study by
Cox, Meier, Oertelt, and DiCarlo (2005), which showed predictable object confusions across the saccade for “greeble” stimuli. These are also complex but, in comparison to our study, very artificial und unfamiliar stimuli. The objects only changed slightly during the learning phase and did not change their semantic category like the stimuli in our experiments. Object predictions in their study were inevitably linked with the objects’ presentation side. This was not the case for the present study, where predictions were made based on the peripheral image irrespective of the presentation side. Furthermore, with our second experiment, we used a metric response mode, which allowed us to estimate the relative strength of the predictions. The study by
Cox et al. (2005), on the other hand, used a same-different task, which leaves open the question how exactly the objects that were judged as “different” were perceived.
Previous studies have suggested that transsaccadic learning is very specific to its retinotopic location (
Herwig et al., 2018) and that transsaccadic prediction is therefore likely to take place in low- or mid-level visual areas where a classical and finer retinotopy is prevalent (
Gardner, Merriam, Movshon, & Heeger, 2008;
Grill-Spector, 2003;
Hadjikhani, Dale, Liu, Cavanagh, & Tootell, 1998). But this location specificity might have only occurred because these studies used simple stimuli that are represented in low- or medium-level visual areas. The complex stimuli that were used in the present study are also represented as semantic objects (fruits or balls) or faces in high-level brain areas like the inferior temporal (IT) cortex (
Kravitz, Saleem, Baker, Ungerleider, & Mishkin, 2013;
Tanaka, 1993). Consequently, it could be assumed that transsaccadic predictions about objects might also originate in these high-level brain regions where the objects are represented. Support for this idea can be found in a study by
Li and DiCarlo (2008), in which, similar to the presented study, high-level objects where swapped out during the saccade. After enough experience, this led to a decrease in initial object selectivity of IT neurons (in the primate brain). Thus, a higher level of the ventral visual stream (e.g.,
Rousselet, Thorpe, & Fabre-Thorpe, 2004) was affected by the transsaccadic manipulation.
If transsaccadic predictions are based on high-level object representations, the question arises whether they would still be location specific. For example, the fusiform face area within IT, which shows an increased activity to the presentation of face stimuli (
Kanwisher, McDermott, & Chun, 1997), has been described as a nonretinotopic region of the ventral stream (
Halgren et al., 1999). Accordingly, it is presumable that predictions for faces (or other high-level representations) are not retinotopically location specific but generalize to other locations. This is something that could be investigated in the future.
On the opposite, assuming that the predictions must originate in these higher-level areas is not necessarily true. One cannot rule out the possibility that participants were able to identify or predict the objects based on certain low-level features within the complex stimuli. It is also conceivable that high-level areas govern the reweighting of primary cortex inputs, and these weights are changed during the learning process, as presented in a rule-based learning model for perceptual learning by
Zhang et al. (2010). Thus, based on the current study, one cannot conclude whether the prediction was made before or after a semantic representation of the objects and faces was created. Hence, more research is needed to further differentiate when and where exactly in the brain the presaccadic integration of prediction and peripheral input takes place. From the neural study by
Edwards, Vanrullen, and Cavanagh (2018), it is known that peripheral presaccadic stimulus information is available after the saccade and influences postsaccadic processing (e.g., congruent presaccadic input facilitates the processing).
Huber-Huber, Buonocore, Dimigen, Hickey, and Melcher (2019) suggested that first a prediction about the target is generated, and then the integration of presaccadic and postsaccadic information takes place at around 50 to 90 ms after fixation onset, followed by a facilitated categorization. By using a similar methodology (combined electroencephalogram and eye-tracking), it might be possible to narrow down the timeline for the presaccadic processing even more.