Open Access
Article  |   October 2019
The influence of sequential predictions on scene-gist recognition
Author Affiliations
  • Maverick E. Smith
    Department of Psychological Sciences, Kansas State University, Manhattan, KS, USA
    ms1434@ksu.edu
  • Lester C. Loschky
    Department of Psychological Sciences, Kansas State University, Manhattan, KS, USA
    loschky@ksu.edu
Journal of Vision October 2019, Vol.19, 14. doi:https://doi.org/10.1167/19.12.14
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Maverick E. Smith, Lester C. Loschky; The influence of sequential predictions on scene-gist recognition. Journal of Vision 2019;19(12):14. doi: https://doi.org/10.1167/19.12.14.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Past research suggests that recognizing scene gist, a viewer's holistic semantic representation of a scene acquired within a single eye fixation, involves purely feed-forward mechanisms. We investigated whether expectations can influence scene categorization. To do this, we embedded target scenes in more ecologically valid, first-person-viewpoint image sequences, along spatiotemporally connected routes (e.g., an office to a parking lot). We manipulated the sequences' spatiotemporal coherence by presenting them either coherently or in random order. Participants identified the category of one target scene in a 10-scene-image rapid serial visual presentation. Categorization accuracy was greater for targets in coherent sequences. Accuracy was also greater for targets with more visually similar primes. In Experiment 2, we investigated whether targets in coherent sequences were more predictable and whether predictable images were identified more accurately in Experiment 1 after accounting for the effect of prime-to-target visual similarity. To do this, we removed targets and had participants predict the category of the missing scene. Images were more accurately predicted in coherent sequences, and both image predictability and prime-to-target visual similarity independently contributed to performance in Experiment 1. To test whether prediction-based facilitation effects were solely due to response bias, participants performed a two-alternative forced-choice task in which they indicated whether the target was an intact or a phase-randomized scene. Critically, predictability of the target category was irrelevant to this task. Nevertheless, results showed that sensitivity, but not response bias, was greater for targets in coherent sequences. Predictions made prior to viewing a scene facilitate scene-gist recognition.

Introduction
When navigating through our world, we typically have expectations about the kinds of scenes we will see from one moment to the next. For instance, through repeated experiences of walking from the kitchen to the living room in your and other people's homes, you may expect that kitchens and living rooms appear near one another. It is possible—although unknown—that we use such knowledge accrued from a lifetime of experiences to generate predictions that aid in identifying everyday scene categories (Bar, 2007). It is intuitive to assume that unpredictable scene changes in our visual experience would be difficult to perceive, such as if when expecting a living room after a home kitchen, you instead find yourself in an office cubicle. But despite the surprise that would be elicited by such an unpredictable scene category, observers can readily identify a scene's meaning, namely its gist, from a brief glimpse, even when the scene category changes unexpectedly from one moment to the next (Potter, 1976).1 Indeed, this is the standard way of investigating scene-gist processing—presenting scenes in randomized sequences. This is despite the fact that we almost never experience scenes in this way in the real world (except when channel surfing on TV). Thus, the current study asks a novel question: Is scene-gist recognition affected by sequential expectations that one might experience in the real world? 
Scene-gist processing
We define scene gist as a holistic semantic representation of a scene acquired within a single eye fixation (Larson, Freeman, Ringer, & Loschky, 2014). Scene-gist recognition, typically measured in terms of scene categorization, occurs extremely rapidly (Biederman, Rabinowitz, Glass, & Stacy, 1974; Greene & Oliva, 2009). In terms of retinal image-processing time, scene categorization typically reaches asymptote with a 100-ms target-image-to-mask stimulus onset asynchrony (SOA), and the inflection point of the SOA function is typically at 40–50 ms (Bacon-Mace, Mace, Fabre-Thorpe, & Thorpe, 2005; Joubert, Rousselet, Fize, & Fabre-Thorpe, 2007; Loschky et al., 2007; Rousselet, Joubert, & Fabre-Thorpe, 2005). In terms of reaction time, medians are around 400 ms, and can be as fast as 250 ms (Joubert et al., 2007). Likewise, studies using electroencephalography or magnetoencephalography to measure brain activity to briefly flashed scenes have shown that higher order ventral visual areas distinguish scene categories by 100–250 ms poststimulus (Ramkumar, Hansen, Pannasch, & Loschky, 2016; Thorpe, Fize, & Marlot, 1996). 
Given the speed of gist perception, the information underlying gist recognition is necessarily holistic. Computational models have shown that scenes can be categorized using spatially localized information from the image's amplitude spectrum (i.e., the distribution of spatial frequencies and orientations in the image; Oliva & Torralba, 2001). A scene's amplitude spectrum can be used to describe its spatial envelope (naturalness, openness, roughness, expansion, ruggedness), which can be used to categorize scenes at both the superordinate and basic levels. Furthermore, images that share perceptual properties are also typically semantically similar. When viewers do a rapid scene-categorization task, spatial-envelope properties have been shown to be useful in decoding those viewers' brain activity in terms of different scene categories (Ramkumar et al., 2016). 
Feed-forward information extraction
In considering the number of processing steps from the retina to higher level cortical areas, the results just discussed appear to challenge the intuition noted that unexpected visual changes may be harder to perceive. Instead, rapid scene categorization may rely upon massively parallel mechanisms within the initial pass of feed-forward information flow, from the retina through the higher level visual cortex, devoid of top-down influences (Fabre-Thorpe, Delorme, Marlot, & Thorpe, 2001; VanRullen & Thorpe, 2002). Likewise, studies using artificial neural networks modeled from the constraints of the visual system have found that most of the stimulus-relevant information used in recognition can be extracted from the bottom-up input alone (Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Greene & Hansen, 2018; Serre, Oliva, & Poggio, 2007). Nevertheless, these models assign little importance to the many bidirectional connections between brain structures, which may exceed the number of feed-forward connections (Salin & Bullier, 1995). Reciprocal connectivity may provide the neural infrastructure to support top-down interactions between prediction and perception (Kveraga et al., 2011). 
Perceptual predictions and their influence on information extraction
As noted, studies of rapid scene categorization to date have typically presented participants with scenes from multiple categories in randomized sequences so that participants cannot predict a scene's category from one trial to the next. However, the semantic and structural relationships between the scenes we interact with in our day-to-day lives are not random. In our experience with the world, scene categories change in predictable ways along the paths we take (e.g., walking from your office to a parking lot after a day of work). Hallways, not parking lots, are usually on the other side of office doorways. By limiting the number of possible scene categories that one may expect to see from moment to moment, the visual system may preactivate those representations (e.g., of hallways after multiple views of an office), facilitating perception as opposed to reinterpreting the gist of a scene from scratch on each fixation. To our knowledge, no studies have yet investigated this issue in the context of scene-gist perception. However, a number of separate lines of research speak to it. 
Scene categorization is influenced by prior knowledge (e.g., schemas). Categorization is harder if scenes contain information that violates expectations (e.g., a boulder in a living room as opposed to a boulder on a mountain; Greene, Botros, Beck, & Fei-Fei, 2015). Likewise, expectations generated from a scene's context can provide information to facilitate object recognition. Objects are more accurately recognized when in consistent settings (e.g., a chicken in a farmyard vs. a mixer in a farmyard; Bar, 2004; Bar & Ullman, 1996; Biederman, Mezzanotte, & Rabinowitz, 1982). Except for a few studies, the time course of such scene-object context effects remains unclear (Ganis & Kutas, 2003; Truman & Mudrik, 2018; Võ & Wolfe, 2013). Top-down factors may influence perceptual analysis very early (e.g., predictions may facilitate detection of low-level perceptual features; Biederman et al., 1982; Kveraga et al., 2011; Reinitz, Wright, & Loftus, 1989) or their influence may occur later (e.g., predictions may only facilitate matching of perceptual tokens to semantic memory representations; Bar, 2004; Bar & Ullman, 1996). Alternatively, expectations may influence only postperceptual processing or how participants respond in the task (Hollingworth & Henderson, 1998). Nevertheless, no one has yet investigated how the perception of one scene can enable a person to generate predictions that facilitate rapid categorization of subsequent spatiotemporally connected scenes. 
A related and important issue is that of priming of scene perception. Studies have shown perceptual priming between scenes having similar layouts (Sanocki & Epstein, 1997), across different views of the same scene (Castelhano & Pollatsek, 2010), and between sequential pairs of visually similar scenes (Caddigan, Choo, Fei-Fei, & Beck, 2017). Other studies have shown conceptual priming of scene identification by words (Reinitz et al., 1989). However, these priming studies have not been done using more ecologically valid spatiotemporally coherent scene sequences, as we see scenes in our everyday life. 
Scene Perception And Event Comprehension Theory
Our interest in this problem comes from the recently developed theoretical framework of the Scene Perception & Event Comprehension Theory (SPECT; Loschky, Hutson, Smith, Smith, & Magliano, 2018; Loschky, Larson, Smith, & Magliano, 2019; Magliano, Loschky, Clinton, & Larson, 2013). SPECT considers scene-gist recognition as one part of the larger problem of perceiving and understanding scenes and events in the real world. According to SPECT, the extraction of a scene's gist is essential to understanding an event (i.e., a period of time with a distinct beginning and ending). SPECT proposes a distinction between front-end and back-end mechanisms. Front-end mechanisms are involved with processing information within single eye fixations, including visual processing of a scene's gist (i.e., information extraction). Back-end mechanisms are involved in processing information in memory across multiple fixations, particularly in constructing the current event model in working memory (i.e., one's online representation of what is happening now). Scene gist plays an important role in both front- and back-end processes. The gist of a scene is recognized within a single eye fixation during the process of information extraction in the front end (Greene & Oliva, 2009; Larson, 2012), which is important for laying the foundation of the current event model in the back end (Loschky et al., 2019). Importantly, SPECT proposes that back-end processes involved in event-model construction influence front-end processes involved in information extraction. Let's suppose a viewer is watching a handheld video of someone walking from their office to a parking lot. Within the first fixation, the viewer recognizes the gist of the first scene category—for instance, that it is an office. Will the viewer need to extract the same gist on each fixation (office, office, office, etc.) even when the scene category repeats? As noted earlier, many have claimed that scene-gist recognition processes are automatic and feed-forward (e.g., Joubert et al., 2007); therefore, observers may extract the gist of the scene anew on every fixation. We call this hypothesis the feed-forward gist hypothesis. Nevertheless, it would seem to be a waste of cognitive resources to extract the same gist on many consecutive fixations. Thus, fewer resources may be required to extract the scene location on subsequent fixations if the scene information is consistent with what one expects (i.e., when the scene category is not expected to change). We call this hypothesis the scene-gist priming hypothesis
The scene-gist priming hypothesis suggests further hypotheses. Scene-gist priming could depend upon the number of previous fixations on a repeated scene, with priming increasing as the number of fixations on the scene increases. We will call this the within-scene category priming hypothesis. Furthermore, we can hypothesize that scene-gist priming may occur between spatiotemporally interrelated but different basic-level scene categories. For instance, if the handheld video shows the person with the camera walking to the door of the office, the viewer may predict the scene category on the other side of the office door based on schemas for what scene categories connect with offices—for example, that it is a hallway. If the viewer's prediction is correct, it could facilitate gist extraction relative to seeing a less predictable scene category, such as a parking lot. We will call this the between-scenes category priming hypothesis. Between-scenes and within-scene category priming differ in important ways. Within-scene category priming, across multiple fixations of the same scene category, may be explained by a combination of perceptual priming due to the overlap in bottom-up processing of similar visual features across different views of the same scene and conceptual priming due to the same gist concept shared across multiple fixations (e.g., “office”) in the current event model. Alternatively, priming between different but expected scene categories (e.g., an office and a hallway) primarily shares only conceptual information, though it sometimes shares some perceptual information (e.g., offices and hallways share indoor scene features such as rectilinearity). Based on this consideration, we could hypothesize that the degree of within-scene category priming would be greater than between-scenes category priming. We will call this the within > between category priming hypothesis
Experiment 1
To test these hypotheses, we created 24 spatiotemporally connected image base sequences from a starting location to a destination (e.g., going from an office to a parking lot). All the stimuli used in the experiments are available at https://osf.io/83sjx/. We also provided a full image sequence in the supplementary materials (Supplementary File S1). Participants' task was to identify the category of one target scene embedded within a 10-scene image sequence, presented in rapid serial visual presentation (RSVP). To manipulate the spatiotemporal coherence of the scenes presented on each trial, targets were presented in either spatiotemporally coherent or randomized sequences. In coherent sequences, higher level expectations may prime upcoming scene categories. This may result in more accurate categorization of targets in coherent sequences than in randomized sequences, in which scene categories cannot be predicted from one moment to the next. 
Method
Participants
There were 48 participants (16 female, 32 male; age: M = 19.57, SD = 2.01) from Kansas State University's undergraduate Psychological Sciences research pool, who participated in the experiment for course credit. In determining our sample size, we could not carry out a power analysis since the research question and experimental method were novel. Thus, we could estimate the effect size neither based on prior studies nor based on our own pilot data, which were from three lab members who were aware of the research hypotheses. We therefore based the sample size on the experimental design. We blocked the order in which participants viewed each of the 24 image sequences using a 24 × 24 Williams Latin square (Williams, 1949). We had two replicates per block (i.e., 48 participants). 
Stimuli and design
Examples of a coherent and a randomized sequence are shown in Figure 1. Figure 1a is a simplified version of Figure 1b. The target in both the coherent and randomized sequences in the example is the third image. The spatiotemporally connected image sequences were photographed from a first-person viewpoint, so they appeared as if the observer were navigating through the environment from one destination to another. We collected 480 total photographs from Kansas State University's campus and the local Manhattan, Kansas, metropolitan area. Two hundred forty images were taken on campus, composed of eight different basic-level scene categories: four indoor categories (office, classroom, hallway, and stairwell) and four outdoor (parking lot, courtyard, sidewalk, and lawn). Of the eight categories, two indoor categories (office, classroom) and two outdoor (parking lot and courtyard) were chosen as starting points and destinations. These are the locations observers appeared to be navigating to and from. The remaining four were transitional scene categories (scene categories between starting points and destinations). We created forward versions of each of the spatiotemporally connected scene sequences (e.g., office to a parking lot, courtyard to a classroom), as well as their reverses (e.g., parking lot to an office, classroom to a courtyard), from each of the four destination locations to each of the others, but not along the same pathways (i.e., the office-to-classroom sequence vs. the classroom-to-office sequence were not the same office or classroom, nor did they contain the same transitional scene categories). By crossing starting points and destinations, and by using multiple different exemplars for each category, we prevented participants from being able to predict which destination scene category they would appear to be navigating, given the category of the first image on each trial. 
Figure 1
 
Trial schematic of the sequence of events within two example trials. (a) A simplified version of a trial, showing the sequence of screens up to the response. The sequence of scenes in (i) is coherent, beginning with an office and ending with a scene of a parking lot; in (ii) the same images are shown in an example randomized sequence. (b) A more complete version of a trial, showing the continuation of the sequences after participants responded. Such a continuation occurred for any sequence in which the target was less than the 10th image. The continuation was shown so that viewers always saw a full 10-image sequence, regardless of which image in the sequence was the target.
Figure 1
 
Trial schematic of the sequence of events within two example trials. (a) A simplified version of a trial, showing the sequence of screens up to the response. The sequence of scenes in (i) is coherent, beginning with an office and ending with a scene of a parking lot; in (ii) the same images are shown in an example randomized sequence. (b) A more complete version of a trial, showing the continuation of the sequences after participants responded. Such a continuation occurred for any sequence in which the target was less than the 10th image. The continuation was shown so that viewers always saw a full 10-image sequence, regardless of which image in the sequence was the target.
The same number of off- and on-campus images were used. There were eight different off-campus categories: four indoor categories (store interior, bedroom, stairwell, and hallway) and four outdoor (park, city center, sidewalk, and alley). Store interior, bedroom, park, and city center were determined as starting points and destination locations, and stairwell, hallway, sidewalk, and alley were chosen as transitional categories. Destination categories for the videos that were taken on versus off campus were not intermixed into spatiotemporally connected sequences (e.g., a store interior never appeared in the same sequence as an office, nor did it appear as an option in the alternatives for a sequence taken from on campus). 
Each of the 24 base sequences was composed of 20 images. There were five scene categories per first-person base sequence and four images per scene category (e.g., four offices, four hallways, four stairwells, four sidewalks, four parking lots). Of the 20 total images in each base sequence, participants saw 10 in the coherent and 10 in the randomized conditions. We randomly determined which of the 20 images participants were shown in the coherent and randomized conditions independently for each participant, with the stipulation that each participant saw at least one image from each of the five categories within a trial. We wanted our coherent sequences to be predictable based on semantic knowledge but not based on artifacts due to the manner in which we constructed our sequences. Thus, we created subsequences in which we randomly chose to show one to three images from each category before a category shift (e.g., three offices, one hallway, two stairwells, etc.). Specifically, our goal in doing so was to reduce the possibility that people could make an intelligent guess of when there would be a shift to a new category. If we always showed the same number of images from each category before a shift, then a viewer could predict when shifts to different categories would occur. Importantly, this would be strictly based on the artifactual nature of the sequence structure.2 Furthermore, we randomly determined the temporal location (1–10) of the target on each trial. For example, the target could be any image from each 10-image sequence. 
Images within each of the five categories that made up a 10-image sequence were randomly assigned for each participant with two stipulations: An image never repeated in both coherent and randomized versions of the sequences within participants, and both the coherent and the randomized versions had 10 images. Lastly, in the coherent condition, if the number of images shown from a scene category was greater than one, images were presented in their coherent spatiotemporal order (e.g., 1, 3, not 3, 1). This could only happen by chance in the randomized condition. 
All images were presented in color on 17-in. Samsung SyncMaster 957 MBS monitors (with an image resolution of 1,024 × 768) running at an 85-Hz refresh rate. A chin rest was used to stabilize head position at a viewing distance of 53.3 cm from the screen. Single pixels subtended 0.0369° of visual angle. 
Ten 1/f-amplitude color noise masks were used to mask target images. Target-mask pairings were randomized across all trials and participants. 
Procedure
The experiment was a 2 (spatiotemporal coherence: coherent vs. random order) × 2 (image location: on campus vs. off campus) within-subject design. Image location was treated as a nuisance variable, as it was not of interest. The manipulation of the spatiotemporal coherence of the scenes was blocked and counterbalanced across participants. We used blocking to increase the likelihood that participants would perceive the coherence of spatiotemporally coherent sequences. In coherent trials, subsequences of one to three images from the same category came before a change to a different category (e.g., one, two, or three offices before a shift to a hallway). Thus, priming could occur within categories (e.g., viewers might see an office after one or more offices) as well as between categories (e.g., viewers might see a hallway after one or more offices). For within-category priming, the target could be the first, second, or third image within a subsequence. If the target was the second or third image within a subsequence (e.g., the second or third office), then it was preceded by one or two within-category primes. If the target was the first image within a subsequence (e.g., the first office or the first hallway), then it would have zero primes from within its category. If it was the first image within a trial, then it was not primed at all. Otherwise, it was primed by one or more between-categories primes—in either situation, it would have zero within-category primes. For between-category primes, the number of primes was always equal to the number of scenes from the immediately preceding category (e.g., the first hallway image could be preceded by one, two, or three offices). The numeric values of within-category primes (i.e., 0, 1, or 2) were thus equal to those of the between-categories primes (i.e., 1, 2, or 3) minus one. 
Participants completed 10 practice trials to familiarize themselves with both the task and the speed of image presentation. None of the practice images were repeated in the experimental trials. There were 48 total experimental trials per participant: 24 coherent and 24 randomized sequences. Prior to beginning each trial, participants saw a list of the scene-category labels they were about to view in a randomized order within a 4 × 2 grid. Participants initiated a trial by fixating a dot in the center of a neutral gray background while pressing a button using the computer mouse. Ten scene images from each sequence were shown in RSVP. Each 10-image sequence included one target and nine primes (or nontarget images). Some targets were preceded by zero, one, or two primes that were of the same category as the target (within-scene category priming), while others were not. In other cases, primes were from a different but semantically related scene category (between-scenes category priming). Primes were given 300 ms of processing time (24-ms presentation + a 276-ms neutral gray interstimulus interval [ISI]) because that allows participants to both identify the images in an RSVP sequence and store them in conceptual short-term memory (Potter, 1976). The single target image in the sequence was also presented for 24 ms, but was immediately followed (i.e., after a 0-ms ISI) by a 48-ms 1/f noise mask to limit processing time, thus making the target harder to identify than the primes (Hansen & Loschky, 2013). We presented an alerting tone through headphones simultaneously with the onset of the target. Doing so in an RSVP stream has been shown to reduce the attentional blink (Kranczioch & Thorne, 2013). Immediately following presentation of the target and mask, the participant was shown an eight-alternative forced-choice (8-AFC) array of scene-category labels and asked to select the category that matched the target. The 8-AFC included all eight categories from the matching on-campus or off-campus base sequence. We randomized the location of the labels in the 4 × 2 array on every trial to avoid location-based response biases in guessing (e.g., consistently responding in the top left corner). The temporal position of the target image within each 10-image sequence was randomly determined, but it was equalized across participants, so participants could not guess when the target image would appear in each trial. The remainder of the sequence of images was presented immediately after participants made a response, as shown in Figure 1b, with processing times for these nontarget images being identical to the primes (i.e., 300 ms). Thus, participants saw the entirety of each 10-image sequence on each trial, regardless of the position of the target in the sequence. 
Results
We conducted all analyses using R (Version 3.1.1) with the lme4 library (Bates, Maechler, Bolker, & Walker, 2014). We used a logistic multilevel model (Jaeger, 2008) to predict the probability of correctly identifying the target scene from the predictors of spatiotemporal coherence of the sequence (coherent vs. randomized), image location (on-campus vs. off-campus), and their interaction. We treated the participant and image intercepts as random effects. Spatiotemporal coherence was effect-coded as coherent = 1 and randomized = −1 in all analyses. Image location was effect-coded as off-campus = 1 and on-campus = −1. Consistent with the scene-gist priming hypothesis, targets presented in coherent sequences (M = 53.5%, SE = 1.47%) were categorized more accurately than targets in randomized sequences (M = 33.85%, SE = 1.39%), β = 0.47, η2p = 0.62,3 z = 9.57, p < 0.0001. Of less interest to the study, images taken from the on-campus location (M = 46.31, SE = 1.47%) were identified more accurately than images taken from the off-campus location (M = 41.06%, SE = 1.45%), β = −0.14, z = −2.25, η2p = 0.10, p = 0.02. Importantly, however, the effect of the spatiotemporal coherence was the same for both locations, β = 0.008, z = 0.16, η2p = 0.00, p = 0.87. 
We conducted two different analyses to examine within-scene and between-scenes category priming. Sequential expectations for upcoming scenes in coherent sequences can be generated both within the same category (e.g., seeing one view of an office may lead one to anticipate another) and between scene categories (e.g., seeing multiple first-person views of someone navigating through an office toward a closed door may prime expectations for a hallway). The first analysis included only instances when the target image was of the same scene category as the preceding primes (e.g., an office after zero, one, or two views of the same office scene).4 We refer to this as within-scene category priming. The second analysis included only instances when the target image was of a different category than the preceding primes (e.g., a target hallway after one, two, or three offices). We refer to this type of priming as between-scenes category priming. We treated the number of sequential images of the priming category continuously in both sets of analyses. Using multilevel modeling, one can treat continuous variables as repeated measures (Cohen, Cohen, West, & Aiken, 2003). We also fitted two models, to examine where asymptotic recognition performance was reached for both within-scene and between-scenes category priming. These analyses are included in the supplementary materials available at https://osf.io/83sjx/
Consistent with the within-scene category priming hypothesis, as shown in Figure 3a , performance improved as the target was preceded by more primes from its category, β = 0.72, z = 8.16, η2p = 0.59, p < 0.001. Recognition of the target image was better when it was preceded by two primes of its same category (M = 68.63%, SE = 3.26%) than when preceded by one (M = 65.49%, SE = 2.48%) or zero (M = 40.59%, SE = 2.04%). There is a large amount of overlap in the perceptual and conceptual properties of a scene from one fixation to the next when the scene category repeats across multiple views. This overlap improved scene-gist categorization. Furthermore, participants performed better when the target was preceded by one or two primes of its same category in comparison to zero. This finding is also consistent with the within > between category priming hypothesis because targets that were not primed by scenes that were of its same category were primed by scenes that were of a different category (i.e., between-categories priming). 
We next analyzed the effect of the number of primes from a different category immediately preceding the target image. Consistent with the between-scenes category priming hypothesis, as the number of primes increased (e.g., a target picture of a hallway after multiple views of an office), the degree of between-categories scene facilitation increased, β = 0.46, z = 2.55, η2p = 0.07, p = 0.01. As shown in Figure 3b, target images of a different category from their primes were more accurately identified if preceded by three primes (M = 42.16%, SE = 4.91%) than if preceded by two (M = 29.25%, SE = 4.44%) or one (M = 25.00%, SE = 5.00%). This is important because it shows that scene spatiotemporal-coherence priming is not limited to a within-category effect (i.e., an effect analogous to repetition priming), since facilitation was found both within and between scene categories. 
Target images in coherent sequences were categorized more accurately than targets shown in randomized sequences. Furthermore, we found evidence of both within-scene and between-scenes category facilitation. This suggests that a prediction of a scene category made prior to seeing it may influence its recognition. Alternatively, it is also possible that the effects we observed were not due to facilitation of the current event model on scene-gist perception but instead were the result of participants being unable to distinguish the prime from the target category because primes had a longer SOA than targets. As noted earlier, primes were given 300 ms of processing time. This is long enough to be both detected and stored in conceptual short-term memory (Potter, 1976). However, targets were given only 24 ms of processing time. Thus, participants may have responded with the last scene category they readily perceived, namely the prime. Such a response bias could explain the accuracy advantage for target scenes in coherent sequences based on the within-category facilitation effect, but it could not explain the between-categories facilitation effect we observed. To more carefully examine whether participants were responding with the category label that matched the target, the prime, or one of the other category labels that was neither the prime nor target, we conducted an exploratory series of analyses. If participants confused the prime shown immediately prior to the target with the target itself, then they should have responded with the prime category label regardless of whether targets were in coherent or randomized sequences. Alternatively, if participants responded based on their perception of the target, rather than that of the prime, and their perception of the target was facilitated in the coherent sequence condition, then they should have been more likely to select the category label of the target than the prime in the coherent condition. 
Exploratory analysis of category responses
To identify whether participants confused the prime with the target's category, we first removed all instances when the target was the first image in a trials, since those targets were not primed. Furthermore, only cases when the prime and target category differ can be used to address this important question. If the prime and target are from the same category, then confusing the target's category with that of the prime yields the same response as responding to the target's category. Therefore, we included only instances when the prime and target categories differed in our analysis (i.e., only cases of between-categories priming). This resulted in the removal of 1,463 observations so that only 840 remained. We then conducted three multilevel logistic regressions. Only the outcome variable differed between analyses. In the first analysis, we included only responses with target and prime labels. In the second analysis, we included only responses with target and other nonprime category labels. In the third and final analysis, we examined errors by including only responses with prime and other category labels. We used the spatiotemporal-coherence manipulation as a predictor in all three analyses. The outcome variable in each model was the probability of a participant's response (i.e., target vs. prime, target vs. other, and prime vs. other). All models included the random intercept of participant. In the first analysis, a response with the category that matched the target was coded as a 1 and a response with the category that matched the prime image was coded as a 0. Participants were significantly more likely to respond with the category label that matched the target than with the category of the prime when the sequence was coherent, β = 0.17, z = 2.30, p = 0.02. In the second analysis, responses with the prime label were removed, and a response with the category of the target was again coded as a 1 and a response with one of the other category labels was coded as a 0. Again, results showed that when sequences were coherent, participants were significantly more likely to respond with the category label that matched the target image than one of the other nonprime scene-category labels, β = 0.23, z = 3.01, p = 0.003. Lastly, we examined participant errors. Participants were no more likely in coherent than in randomized sequences to select the scene category that matched the prime versus one of the other category labels, β = 0.05, z = 0.73, p = 0.47. In sum, participants were better able to distinguish the category of the target image from both the immediately preceding prime and the other scene categories in coherent sequences, and were no more likely to select the prime than any of the other scene-category labels in coherent versus randomized sequences. Thus, responding to the prime rather than the target cannot explain our priming effects. 
We also explored another alternative explanation for the advantage for coherent over randomized sequences, and the greater within-category than between-categories priming. As we mentioned earlier when introducing within-category priming, scenes may have been primed not by their expectation but by the amount of visual featural overlap between images regardless of conceptual influences (though images that share layout information also tend to belong to the same scene category; Oliva & Torralba, 2001). Repetition priming enhances perception (Bar & Biederman, 1998; Sperber, McCauley, Ragain, & Weil, 1979). For instance, two different views of the same office share overlapping objects, textures, and perhaps even the same spatial layout. Furthermore, even as one navigates from an office into a hallway, many more visual features across that pair of images will be similar (rectilinear shapes, the presence of right angles, etc.) than between randomly paired scene categories in the randomized sequences. We assume that when information from an image is extracted, various feature detectors become activated along the ventral visual pathway. Thus, if a subsequent scene activates the same, or very similar, feature detectors, the combined activity from the first and second images may facilitate identification of the second image, producing the priming effect we observed. Furthermore, the same mechanism could explain the greater within-category than between-categories priming, since there is greater featural overlap within the same scene category than between scene categories. Thus, regardless of any predictions potentially generated by the posited event model, targets may have been identified more accurately in coherent sequences because primes and targets may have been more likely to activate similar feature detectors at each stage in visual processing (Bar & Biederman, 1998). 
Exploratory analysis of conceptual priming and image similarity
While we hypothesized both perceptual and conceptual priming effects a priori, our method of measuring visual similarity was chosen post hoc. We examined the effects of perceptual similarity between the target and its immediately preceding prime versus conceptual priming on scene-categorization performance. We estimated similarity using the spatial-envelope model (Oliva & Torralba, 2001). Spectral information can be used to categorize and describe scenes according to their spatial properties derived from a relatively small set of perceptual dimensions (e.g., roughness, naturalness, openness, expansion, ruggedness). This model has previously been used to investigate priming between scenes presented on different trials (Caddigan et al., 2017). 
We tested priming by examining the perceptual similarity between the target image and its immediately preceding prime regardless of its semantic category. We extracted spectral information over four fixed windows of size 256 × 192 for each image. Within each window, we calculated the response to Gabor filters at four spatial frequencies and eight orientations. Filter responses were concatenated to obtain a feature vector for each image. This resulted in 8 × 4 × 4 × 4 = 512 image features for each image. The Euclidean distance between pairs of images provided a measure of similarity between target and prime. 
We first analyzed image similarity between coherent and randomized sequences using a linear multilevel model with the random effect of image. Prior to analysis, we took the reciprocal of image similarity so that greater values corresponded to larger similarity between pictures. Due to the positive skew of image similarity, we also took its natural log prior to entering it into any analyses. Target images that were first in a sequence were removed from the analysis, since they were not primed. As predicted, image similarity was significantly higher in coherent (M = 6.82, SE = 0.22) than in randomized (M = 4.14, SE = 0.10) sequences, β = 0.21, t = 16.84, p < 0.001. If the similarity between the target and prime is the reason we observed greater recognition performance in the coherent compared to the randomized scene sequences, then image similarity would be expected to predict recognition performance above and beyond the manipulation of the spatiotemporal coherence when included in a multiple regression analysis. Alternatively, if both image similarity and expectations for scene categories influenced recognition performance, then the spatiotemporal coherence would be expected to remain a significant predictor of performance after we controlled for the effect of image similarity. 
To test these alternative hypotheses, we fitted a series of hierarchical multilevel logistic regressions to recognition performance. More details about each of the models we conducted are provided in Supplementary File S1. Each model contained the random effects of participant and image; they differed only in the structure of their fixed effects. We used Akaike information criteria (AICs) and Bayesian information criteria (BICs)—with smaller values indicating better model fit—as well as chi-square tests of significance to compare models (Agresti, 2007; Wagenmakers & Farrell, 2004). A difference in AIC and BIC values of 2–6 can be accepted as moderate evidence for the model, with the smaller value having the better fit, while differences of 6–10 suggest a strong difference between models, and a difference of >10 indicates a very strong difference (Burnham & Anderson, 2004; Raftery, 1995). 
The first model contained only the fixed effect of log of similarity. Scene-categorization accuracy was greater when the scene immediately preceding the target and the target were similar than when they were not, β = 0.87, z = 9.84, p < 0.001, AIC = 2,635.30, BIC = 2,657.80, consistent with the hypothesis that perceptual similarity improved categorization performance to the target. In a second analysis, we included main effects of both image similarity and the spatiotemporal-coherence manipulation. Importantly, the degree of coherence between scene images predicted unique variance not accounted for by image similarity. When included in the model together, both image similarity and the spatiotemporal-coherence manipulation uniquely predicted recognition performance (log of similarity: β = 0.70, z = 7.71, p < 0.001; spatiotemporal coherence: β = 0.37, z = 6.68, p < 0.001, AIC = 2,591.90, BIC = 2,620.00). The model containing both main effects was a significantly better one than the model containing the log of similarity alone, χ2 = 45.36, p < 0.001, ΔAIC = 43.40 ΔBIC = 37.80. As shown in Figure 4, images that had very visually similar primes embedded in coherent sequences were identified more accurately than images that were very similar but shown in randomized sequences. Facilitation was also found for targets with visually similar primes. These results are consistent with hypotheses generated from SPECT that the extent to which facilitation will be found depends upon the degree of spatiotemporal coherence between the contents of the current event model and new scene information (Loschky et al., 2019). 
Discussion
The roles that top-down and bottom-up processing play in image recognition remains critical to theories of cognition and perception, and the unique contributions of each remain largely unknown and heavily disputed (Firestone & Scholl, 2016; Kveraga et al., 2011; Ullman, 1995). According to data-driven, feed-forward processing accounts, expectations for upcoming scenes should not influence their recognition, as perception of scene gist is accomplished by purely feed-forward mechanisms (Fabre-Thorpe et al., 2001; Rousselet et al., 2005). If perception is purely feed-forward, then it is independent of expectations, because visual processing is largely driven by the patterns of light hitting the eyes, not by what one expects to see. Nevertheless, we found that sequential expectations for scene categories influenced recognition performance, and it did so when the same scene category repeated across multiple views (i.e., Figure 3a) as well as when the target image changed to an expected but different scene category (i.e., Figure 3b). Furthermore, priming from similar images based upon shared roughly localized spectral-amplitude information between target and prime enhanced accuracy more for targets within coherent sequences than randomized ones (i.e., Figure 4). These results are consistent with a view that perception is the result of a synthesis between bottom-up and top-down factors (Brandman & Peelen, 2017; Brewer & Loschky, 2005; Rao & Ballard, 1999; Ullman, 1995). Nevertheless, it remains unknown whether images presented in coherent sequences were actually more predictable than images presented in randomized sequences, since we did not measure image predictability in Experiment 1. It also remains unknown whether predictions generated from one's event model while viewing the sequences of images contributed to rapid scene-gist categorization performance independent of target and prime similarity. The results of the exploratory analysis suggest that both similarity between scene views and predictions generated from one's event model contribute to gist perception; however, we did not explicitly measure image predictability in Experiment 1
Experiment 2
Experiment 2 served two purposes. The first was to determine the extent to which the scene categories in coherent and randomized sequences were predictable. One possibility, according to SPECT, is that the presentation of the first scene in a coherent image sequence lays the foundation for the current event model in working memory (Loschky et al., 2019). Once an event model is constructed, upcoming scene categories become predictable. This contrasts with having no event model, as would be the case when the scene sequence order is random. The second purpose of Experiment 2 was to investigate the extent to which scene-category predictability influenced the recognition performance found in Experiment 1, and whether it did so uniquely, as compared to image similarity between target and prime. Two hypotheses therefore suggest themselves. First, if target images presented in coherent sequences were predictable, then prediction accuracy should be greater for target scene categories embedded in spatiotemporally coherent sequences than those in randomized ones. Second, if target images in randomized sequences are less predictable, they should be predicted at no better than chance level, since event models are unable to predict upcoming scenes accurately when the sequence is random.5 If these two hypotheses are supported, other more interesting hypotheses are suggested. If participants in Experiment 1 used predictions of upcoming scene categories to correctly respond in the 8-AFC task, then the predictability of targets in Experiment 2 should predict Experiment 1 performance. An interesting question would then be whether target predictability contributes to recognition performance independent of the effect of visual similarity between the prime and target. If the effects of visual similarity are primarily facilitating low-level information-extraction processes in the front end, and target-predictability effects are primarily due to higher level back-end event-model processes influencing front-end information extraction, then their influence on categorization performance would be expected to be independent. Both would thus be expected to contribute to gist-recognition accuracy in a multiple regression analysis. Alternatively, the sharing of activation of feature detectors between prime and target could influence categorization performance, but predictions generated from one's understanding of the sequence may not (Firestone & Scholl, 2016). 
Method
Participants
Fifty-nine participants from Kansas State University's undergraduate research pool who did not participate in Experiment 1 participated in Experiment 2 for course credit (35 female, 24 male; age: M = 19.63, SD = 2.00). We targeted obtaining data from the same number of participants in both experiments (i.e., 48); however, we collected data from more participants than we had targeted due to scheduling constraints.6 Participants' vision was tested prior to participation and was 20/30 or better as determined by the Freiburg Visual Acuity and Contrast Test (Bach, 2006). All participants were unaware of the purpose of the experiment and signed informed consent prior to participating. After participants completed the experiment, they were instructed not to discuss the experiment with fellow classmates, because we asked about their awareness of the coherence manipulation at the end of the experiment. 
Procedure
We yoked the stimuli and design in Experiment 2 to those in Experiment 1. The only procedural differences between the two experiments were as follows. First, the target image in each trial was removed and replaced by a neutral gray screen for 24 ms. Importantly, the sequences for each participant number were identical across experiments, except for the presence or absence of target images (e.g., the image sequence in Trial 1 of Experiment for Subject 1 was identical to that in Trial 1 of Experiment 2 for Subject 1, except that the target was absent). A second difference (necessitated by the first) was that participants' task was to identify the scene category that they predicted would have been presented prior to the perceptual mask (since there was no target to identify). They did so by selecting it from the 8-AFC array of scene-category labels. A third difference was that after the experiment, participants were asked a series of questions related to whether they noticed the differences in coherence between images in the experiment. Participants were asked the following four questions: 1) “Did anyone tell you anything about this study?” 2) “Did you notice anything in the experiment?” 3) “Did you notice anything about the sequence the images were shown in?” and 4) “Did you notice that some of the images were unexpected?” If participants responded positively to any of the questions, they were asked to type their answers to the questions in a text box on the computer screen. Two raters independently judged from the responses whether each participant had reported anything regarding the coherence of the image sequences. Raters produced strong interrater reliability (Cohen's κ = 0.98). Discrepancies between judges were resolved through thoughtful discussion to produce final coding of participant responses. Surprisingly, only 38.98% of participants reported that they noticed the coherence of the coherent scene presentations. We also investigated whether noticing the coherence was a requirement of good prediction accuracy. That analysis is included in Supplementary File S1
Results
To test the first hypothesis, that images presented in coherent sequences were predictable, we assessed prediction accuracy using a multilevel logistic model. Consistent with our first hypothesis, prediction accuracy was found to be reliably greater for images presented in coherent sequences (M = 26.64%, SE = 1.18%) than those in randomized sequences (M = 15.47%, SE = 0.96%), β = 0.35, z = 7.13, η2p = 0.34, p < 0.001. Surprisingly, even though the on-campus images were recognized more accurately in Experiment 1 (i.e., Figure 2), prediction accuracy was found to be greater for off-campus (M = 23.38%, SE = 1.13%) than on-campus image sequences (M = 18.73%, SE = 1.04%), β = 0.14, z = 2.61, η2p = 0.13, p = 0.009. Spatiotemporal coherence and image location did not interact, β = 0.06, z = 1.25, η2p = 0.07, p = 0.21. We observed the same effects using data from the first 48 participants (i.e., those who saw the same image sequences—minus the targets—as participants in Experiment 1). Importantly, consistent with our second hypothesis, prediction accuracy for images presented in randomized sequences was found to be very close to chance level (12.5%). The estimated mean for the off-campus randomized images was equal to 15.63%, 95% confidence interval [12.95% 18.75%], and for on-campus randomized images it was estimated to be 13.77%, 95% confidence interval [11.25% 16.74%]. Prediction is difficult when no event models can be constructed to accurately represent the ongoing event. 
Figure 2
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the spatiotemporal coherence of the scene sequences and image location (on-campus vs. off-campus). Error bars represent 95% confidence intervals around the estimated mean. The probability of correctly identifying the scene by chance on any given trial was 12.5%, as represented by the dashed line.
Figure 2
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the spatiotemporal coherence of the scene sequences and image location (on-campus vs. off-campus). Error bars represent 95% confidence intervals around the estimated mean. The probability of correctly identifying the scene by chance on any given trial was 12.5%, as represented by the dashed line.
Figure 3
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the number of primes. (a) The example target is the office prior to the mask. The target was preceded by either zero, one, or two primes from the same category as the target (i.e., within-category priming). (b) The example target is the hallway. It was preceded by either one, two, or three primes that were of a different category than the target (i.e., between-categories priming). (Note that the case of zero within-category primes was, logically, either the first image in the 10-image sequence or a between-categories prime.) Error bars represent 95% confidence intervals around the estimated mean probability correct.
Figure 3
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the number of primes. (a) The example target is the office prior to the mask. The target was preceded by either zero, one, or two primes from the same category as the target (i.e., within-category priming). (b) The example target is the hallway. It was preceded by either one, two, or three primes that were of a different category than the target (i.e., between-categories priming). (Note that the case of zero within-category primes was, logically, either the first image in the 10-image sequence or a between-categories prime.) Error bars represent 95% confidence intervals around the estimated mean probability correct.
Figure 4
 
Experiment 1: Probability of correctly categorizing a scene image as a function of image similarity and the spatiotemporal coherence of scene sequences.
Figure 4
 
Experiment 1: Probability of correctly categorizing a scene image as a function of image similarity and the spatiotemporal coherence of scene sequences.
Exploratory analysis of category responses
As in Experiment 1, we conducted a series of analyses examining participant responses. Because primes were processed for 300 ms, participants may have responded with the category label that matched the prime immediately prior to the mask, rather than that of the intended but unseen target. Specifically, if participants made predictions as to the target's identity by selecting the category label that matched the prime, this should be true regardless of the spatiotemporal coherence of sequences. Thus, responses with the category label of the prime should not differ between coherent and randomized sequences. Alternatively, if participants used the spatiotemporal coherence of the scene images to generate predictions of future scene categories, then they should have been more likely to select the category label of the target than the prime in coherent sequences. 
We ran three multilevel logistic regression models with the category label of the participant's response (target vs. prime, target vs. other, prime vs. other) as the outcome variable. As in Experiment 1, we removed all cases when the target was the first image in a sequence and all instances when the category of the prime matched the target. We modeled responses using a fixed effect of spatiotemporal coherence and the random effect of participant. As in Experiment 1, the three models differed only in the outcome variable. In the first analysis, a response with the target was dummy-coded as a 1 and one with the prime category was coded as a 0. Consistent with the recognition results of Experiment 1, participants successfully discriminated predicted (but unseen) targets from primes (which they did see), β = 0.19, z = 2.28, p = 0.02. Thus, even though the target images were never shown in the task, participants were more likely in coherent sequences to respond with the category label that matched the target scene than they were to respond with the label that matched the prime (which they did see). In the second analysis, a response with the target was dummy-coded as a 1 and one with one of the other (nonprime) labels was coded as a 0. Consistent with the recognition results of Experiment 1, participants were significantly more likely to select the category label that matched the target than one of the other six categories in coherent sequences, β = 0.44, z = 6.32, p < 0.0001. Finally, we examined errors by assessing whether participants responded more frequently with either the prime category label or one of the other category labels in coherent versus randomized sequences. In contrast to the results of Experiment 1, when participants made errors they were more likely to do so by responding with the prime than with one of the other six categories in coherent sequences, β = 0.25, z = 3.80, p < 0.001. Although this finding is inconsistent with the recognition results from Experiment 1, it is not surprising. Specifically, participants were more likely to see multiple views of the same scene category sequentially in the coherent than in the randomized sequences. This would make them more likely to predict that they would see the same image as the prime in a coherent sequence than in a randomized sequence. 
Analysis of image predictability and image similarity
The sequences for each participant number were identical across Experiments 1 and 2, with the exception that the targets of Experiment 1 were removed in Experiment 2 and participants were asked to predict what the images would be if they were present. The design therefore allows us to investigate whether targets that were predictable in one sample were recognized more accurately in an independent sample. We conducted a series of hierarchical multilevel logistic regressions to evaluate the extent to which prediction for a scene category contributed to recognition performance independent of prime-to-target-image visual similarity. We again compared models using AIC and BIC values as well as chi-square tests of significance. Further details about each of the models we considered are provided in Supplementary File S1. All scene images were presented in the same sequences in Experiments 1 and 2, with the same target scenes matched at the level of the participant number. Whether target images were predictable (yes or no) in Experiment 2 was effect-coded as either predictable = −1 or not = 1. Participant and image were treated at their intercept as random effects.7 For this analysis, data from the last 11 participants (49–59) in Experiment 2 were removed so that we could match the target images in Experiments 1 and 2. Each target within the experiment was categorized either accurately or not in Experiment 1 and predicted either accurately or not in Experiment 2. As shown in Figure 5, images that were accurately predicted by participants in Experiment 2 (M = 51.88%, SE = 2.36%) were more accurately identified in Experiment 1 than those that were not predictable (M = 39.50%, SE = 1.22%), β = −0.28, z = −4.54, p < 0.001, η2p = 0.25, AIC = 2,758.00, BIC = 2,774.90. Despite differences in the tasks between experiments, and the potentially different cognitive sets that participants had while performing those tasks, targets that were more predictable from one sample of participants were more recognizable to another. 
Figure 5
 
Rapid-scene-categorization accuracy from Experiment 1 as a function of whether the target image was predictable in Experiment 2 (yes vs. no).
Figure 5
 
Rapid-scene-categorization accuracy from Experiment 1 as a function of whether the target image was predictable in Experiment 2 (yes vs. no).
Model 2 contained only the log of image similarity, and Model 3 contained main effects of both log of image similarity and whether the target was predictable (yes vs. no). Consistent with the results of Experiment 1, log image similarity significantly contributed to recognition accuracy, β = 0.87, z = 9.84, p < 0.001, AIC = 2,635.30, BIC = 2,657.80. Importantly, when both were included in the model together, both significantly predicted categorization accuracy—log of similarity: β = 0.84, z = 9.46, p < 0.001; target predictability: β = −0.21, z = −3.30, p = 0.001—AIC = 2,626.40, BIC = 2,654.60). Critically, Model 3, which contained both main effects, was a significantly better model than Model 2, which contained only the main effect of log of image similarity, χ2 = 10.84, p = 0.001, ΔAIC = 8.80, ΔBIC = 3.20. As shown in Figure 6, when images were predictable and visually similar to their immediately preceding prime image, recognition accuracy was better than when they were not predictable and visually similar. Importantly, participants in Experiment 2 never saw the target image, but were asked to identify what it would have been if it were present. The independence between predictability and image visual similarity suggests that image similarity influences rapid scene categorization by sharing the activation of feature detectors (captured by spatial-envelope-model scene descriptors) between both prime and target. This is primarily a front-end perceptually driven process. Nevertheless, predictions generated from one's event model influence recognition performance from the top down. Furthermore, through treating the predictability (yes vs. no) as a categorical variable, image similarity appears to be the stronger predictor of categorization performance. As shown in Figure 6, the estimated difference in performance between predicted versus nonpredicted images was relatively small given the size of the estimated confidence intervals around the mean estimates. Furthermore, the log-of-visual-similarity predictor had a larger z-test statistic. 
Figure 6
 
Probability of correctly rapidly categorizing a scene image in Experiment 1 as a function of image visual similarity and whether the target was predictable in Experiment 2.
Figure 6
 
Probability of correctly rapidly categorizing a scene image in Experiment 1 as a function of image visual similarity and whether the target was predictable in Experiment 2.
Perhaps one reason that the effect of visual similarity was found to be larger than that of image predictability is that the former was a continuous variable, while the latter was categorical. Thus, we made a continuous variable of image predictability as well and investigated the unique contributions of image visual similarity and image predictability in influencing rapid-scene-categorization accuracy through a series of partial correlations. In this analysis, image predictability was treated as a continuous variable by calculating the aggregate of image-predictability accuracy across all 24 sequences from Experiment 2. Across all 59 participants, image-predictability accuracy was aggregated for each of the sequences of images presented in coherent and randomized sequences separately. Rapid-scene-categorization accuracy was also aggregated across all 24 of the sequences using rapid-scene-categorization performance in Experiment 1. This results in each of the randomized and coherent sequences being associated with a value that corresponds to how predictable images were in it and how accurately scene images were categorized in it. Likewise, the average visual similarity between the target and its immediately preceding prime within each of the 24 sequences was also aggregated separately for both coherent and randomized sequences. Zero-order correlations are given in Table 1. Controlling for prediction accuracy, image similarity was found to significantly correlate with recognition performance, r(45) = 0.36, p = 0.01. The more visually similar images were to their primes, the more accurate rapid scene categorization was, even when the effect of image predictability was partialed out. Interestingly, after image similarity was held constant, prediction accuracy remained significantly correlated with rapid-scene-categorization performance, r(45) = 0.33, p = 0.02. Thus, even when the effect of image visual similarity was accounted for, prediction accuracy remained associated with recognition performance, and it did so almost to the same extent as image visual similarity. Similar results were observed when we conducted the partial correlations using data from the first 48 participants. These results suggest that the degree of image overlap influences priming from the bottom up due to the degree that the prime and target activate similar feature detectors. However, the extent to which a prime and target activate similar feature detectors alone does not account for the results we found in Experiment 1. Scene spatiotemporal-coherence priming is the result of scene-category predictions made prior to viewing a scene as well as the visual overlap between prime and target. 
Table 1
 
Zero-order correlations between rapid scene categorization, image prediction accuracy, and image similarity. Notes: Image similarity is the reciprocal of the difference between the gist descriptor of the target image and its immediately preceding prime as output from the results of the spatial-envelope model (Oliva & Torralba, 2001). M = mean; SE = standard error.
Table 1
 
Zero-order correlations between rapid scene categorization, image prediction accuracy, and image similarity. Notes: Image similarity is the reciprocal of the difference between the gist descriptor of the target image and its immediately preceding prime as output from the results of the spatial-envelope model (Oliva & Torralba, 2001). M = mean; SE = standard error.
Discussion
In Experiment 2, we found that images presented in coherent sequences were more predictable and that prediction accuracy for images presented in randomized sequences was at chance. Importantly, predictions for upcoming scenes in the sequence accounted for unique variance in recognition performance beyond that of visual similarity between the target and its immediately preceding prime, despite differences between participants and the task. This suggests that the accuracy advantage for scenes in coherent sequences found in Experiment 1 cannot be explained solely by low-level perceptual processes involved in processing similar features between the prime and target. These results also suggest that participants in Experiment 1 were generating predictions of scene categories to be presented, rather than passively processing the primes. The finding that image predictability influenced recognition performance when the effect of image similarity was controlled for may reflect the influence of the current event model on scene-gist extraction after both small featural changes (e.g., the first and then second view of an office) and large ones (e.g., the last view of an office and then the first view of a hallway). 
There are, however, other possible alternative explanations of our results. For example, the priming effect we observed may be the result of changes in later decision-making stages or maybe even purely intelligent guessing. Nevertheless, we rather doubt the intelligent-guessing explanation, since roughly 60% of participants in Experiment 2 were unaware of the coherence manipulation. On the other hand, if one makes a prediction that one will see a hallway after being shown multiple views of an office, one's prediction may influence how one responds in the task, independent of the target image shown. This possibility was recognized long ago in the word-recognition literature (Goldiamond & Hawkins, 1958). Such an effect would be consistent with prior research finding that objects in consistent scenes (e.g., a mixer in a kitchen) are not identified with greater sensitivity than objects in inconsistent scenes (e.g., a mixer in a farmyard), but that participants have a bias to respond to category labels consistently with their schemas when performing a forced-choice recognition task (Hollingworth & Henderson, 1998). Based on the 8-AFC paradigm we used in Experiment 1, we cannot disambiguate whether the coherent-sequence and predictability effects were due to changes in visual processing of the scenes versus postperceptual processes involved in making a response. 
Furthermore, if event-model construction processes make subsequent scenes more perceptible, the locus of such an effect remains unknown. For instance, predictions have been shown to influence perception as early as V1 (Muckli et al., 2015). Alternatively, top-down effects may operate later in processing. Higher level predictions may operate at the matching stage of recognition by lowering the threshold amount of activation in semantic memory needed to match the perceptual input to a stored representation (Bar & Ullman, 1996). If we conceptualize this matching process as a selection process, whereby the visual system scans representations in memory to find a good match, predictions may serve to reduce the size of the search space. Therefore, rather than influencing visual processing of the target scenes at the level of basic visual properties (e.g., spatial frequencies, line orientations, layout), priming from expectations may make the relevant semantic-memory representations easier to retrieve from semantic memory (Firestone & Scholl, 2015). We conducted Experiment 3 to address these possibilities. 
Experiment 3
We conducted Experiment 3 to evaluate the role that expectations for upcoming scene categories have on the ability to see the target versus what they say about it. Consistent with Hollingworth and Henderson (1998), we used signal-detection theory to evaluate the effects of expectations on sensitivity, a change operating during information extraction, and response bias, a change during selection of a response. Changes in these two parameters reflect the operation of qualitatively different mechanisms (i.e., perception vs. decision-making; Green & Swets, 1966). Importantly, an increase in sensitivity to targets in coherent sequences would suggest that targets were more clearly encoded. Alternatively, a failure to find an increase in sensitivity would suggest that the effects we observed in Experiment 1 were due to changes in how participants responded, regardless of encoding efficiency (Hollingworth & Henderson, 1998). 
To examine these alternative competing hypotheses, we needed a task requiring perception of the target scene images that is independent from expectations as to their scene categories. For this, we adopted a task previously used in studies of rapid scene categorization, which involves discriminating whether a target image is intact or phase-randomized (Caddigan et al., 2017; Greene et al., 2015). Using this task, we can separate identification of the scene from postperceptual changes in response bias. Previous research has shown that perceptual sensitivity (d′) in distinguishing real from phase-randomized scenes is greater for probable than improbable scenes (Greene et al., 2015), and greater for scenes that are good exemplars of their category than poor exemplars (Caddigan et al., 2017). Phase-randomizing the images preserves Fourier amplitude-spectrum information but makes the scenes unrecognizable (Loschky & Larson, 2008; Loschky et al., 2007). Importantly, in contrast to the 8-AFC scene categorization task used in Experiment 1, there is no plausible reason why expectations regarding the upcoming scene category generated by the coherent sequences should produce a bias to respond either “intact” or “phase-randomized,” since primes do not provide information to inform observers of the target type. In fact, the category of the target is irrelevant to the task. This method thus enables us to establish whether the benefits in Experiment 1 in identifying scenes presented in coherent sequences originated from predictions influencing perceptual analysis of the scenes, which would produce an increase in d′, or a from later response-related decision-making stage, which would not influence d′. 
Method
Participants
Fifty-one participants (33 females, 18 male; age: M = 20.01, SD = 2.15) participated in the experiment for course credit. Data from three participants were replaced. One participant had insufficient visual acuity, and another participant dropped out of the experiment before completing it. We replaced data from a third participant because the experimenter made an error when selecting the participant number to begin the experiment (i.e., the order in which the participant was shown the stimuli was not the same as the order the corresponding participant in Experiment 1 saw the stimuli). This was a concern because, as in Experiment 2, we yoked the stimuli in Experiment 3 to Experiment 1 (e.g., the image sequence for Trial 1 of Experiment 1 for Subject 1 was identical to that in Trial 1 of Experiment 3 for Subject 1, except that half of the targets were phase-randomized). Analyses were conducted on 48 participants after replacing data from the three we removed. 
Stimuli and design
The stimuli and design were identical to Experiment 1 with the following exceptions: As shown in Figure 7, images were reduced to gray scale before they were fully phase-randomized. Phase randomization reduces image contrast, so we normalized all images and masks to share the same mean luminance (102 grayscale value) and root mean square contrast (0.20; for details, see Hansen & Hess, 2007). In this way, participants could not discriminate intact from phase-randomized images based simply on their mean luminance or root mean square contrast, and instead would need to discriminate them based on their image phase structure. 
Figure 7
 
Examples of the stimuli used in Experiment 3. All images were equalized for both mean luminance and root mean square contrast. This equalization resulted in a contrast reduction for most images.
Figure 7
 
Examples of the stimuli used in Experiment 3. All images were equalized for both mean luminance and root mean square contrast. This equalization resulted in a contrast reduction for most images.
Procedure
The experiment was composed of two parts: thresholding and experimental trials. As in previous studies using the intact versus phase-randomized image-discrimination task (Caddigan et al., 2017; Greene et al., 2015), we first mitigated individual differences in visual processing speed by thresholding the SOA of target images to 75% accuracy, individually for each participant, using the single-interval adjustment matrix (Kaernbach, 1990). More details of the thresholding procedure are provided in Supplementary File S1. In the experimental trials, each participant's SOA for the target images was fixed based on their threshold, ranging across participants from 48 to 152 ms (M = 72 ms). As in Experiments 1 and 2, target images in the experimental trials were presented in either coherent or randomized sequences. Likewise, as in Experiment 2, the image sequences across trials for the corresponding participant numbers were identical to Experiment 1. As in Experiments 1 and 2, primes and nontargets were processed for a 300-ms SOA (48 ms + 252-ms ISI). Targets were presented for 48 ms, followed by a blank screen for a fixed ISI for each participant (which ranged from 0 to 104 ms), and then a mask for 96 ms (i.e., a mask:target duration ratio of 2:1, as in Experiments 1 and 2). Note that in Experiments 1 and 2 we presented all images for 24 ms; however, we doubled the image presentations to 48 ms in Experiment 3 because pilot testing showed that the mean luminance and contrast normalization (together with grayscaling) of images made them more difficult to perceive at a 24-ms duration. 
As noted, instead of participants having to identify the basic-level category of the target as in Experiment 1, their task in this experiment was to indicate whether the target scene in each trial was intact or phase-randomized. Thus, the category of the target was irrelevant to the task. For this task, on half of the experimental trials the target image was a phase-randomized scene from another category than the target. 
After participants completed the experiment, they were asked six questions about the spatiotemporal coherence of the scene images. They were first asked the same four relatively broad questions as in Experiment 2, and then two additional narrower questions: “Did you notice that some of the sequences appeared as if you were navigating from one scene category to another?” and “Did you notice that some of the sequences were in a random order?” As in Experiment 2, if a participant responded positively to any of the questions, they were asked to type answers to the questions in a text box. Two raters independently judged from the participant responses whether each participant reported that some of the image sequences were coherent. Raters had good interrater reliability (Cohen's κ = 0.86), and discrepancies were resolved afterwards through thoughtful discussion to determine the final coding of participants' responses. Compared to the 38.98% of participants who reported that they noticed the coherence in Experiment 2, we found that 68.75% of participants noticed the spatiotemporal coherence of the coherent sequences in Experiment 3. We again investigated whether those who noticed the coherence performed better than those who did not. In short, as in Experiment 2, they did not. See Supplementary File S1 for details of those results. 
Results
To estimate perceptual sensitivity and response bias, we used a probit multilevel model approach to signal detection (DeCarlo, 1998; Wright, Horry, & Skagerberg, 2009). In this method, the likelihood of a response (intact vs. phase-scrambled) was treated as the outcome variable. It was estimated as a function of the type of stimulus (intact vs. phase-randomized), the spatiotemporal coherence (coherent vs. randomized), and their interaction. A response that the scene was intact was coded as a 1 and a response that the scene was not intact was coded as a 0. There are a few keys to interpreting signal-detection analysis conducted using probit multilevel models. The intercept of the model is the overall response bias. By converting its value to a probability from a z score, one can approximate the probability of a yes response across all participants. The first predictor is the stimulus type (intact vs. phase-randomized). Its slope is used to calculate overall sensitivity, because it can be used to estimate the difference between the hit and false-alarm rates. Importantly, the main effect of spatiotemporal coherence reflects differences in response bias between coherent and randomized sequences, and the Stimulus type × Spatiotemporal coherence interaction represents the change in sensitivity between coherent and randomized sequences. 
Overall, participants were able to discriminate intact from phase-scrambled scenes (i.e., a significant main effect of stimulus type), β = 1.18, z = 20.46, p < 0.0001. Participants overall had a slight liberal bias to respond that the scene was intact (M = 0.54), α = 0.12, z = 2.27, p = 0.02; however, they were no more likely to respond that the scene was intact when the target was in a coherent (c = −0.11) than in a randomized sequence (c = −0.13), β = −0.008, z = −0.30, p = 0.77. Thus, sequence coherence did not affect response bias. Importantly, however, as shown in Figure 8, we observed a significant Stimulus type × Spatiotemporal coherence interaction, β = 0.20, z = 3.53, p < 0.001. Specifically, consistent with the scene-gist priming hypothesis, intact scenes were more efficiently discriminated from phase-randomized scenes in coherent (d′ = 1.38) than in randomized (d′ = 0.98) sequences, ηp2 = 0.16. (Note that the average d′ across conditions was predetermined by initially thresholding all participants' SOAs to 75% accuracy.) By probing the interaction, we found that coherent and randomized sequences differed in estimates of their false-alarm rates (FARCoherent = 0.28, FARRandomized = 0.36), β = −0.22, z = −2.77, p = 0.01, but only marginally differed in their hit rates (HRCoherent = 0.79, HRRandomized = 0.73), β = 0.18, z = 2.24, p = 0.05. Thus, participants perceived scene structure in coherent sequences more readily than they did in randomized sequences. 
Figure 8
 
Experiment 3: The probability of responding that the target scene was intact as a function of the stimulus type and spatiotemporal coherence. Estimated values of d′ calculated from the graph on the left are plotted on the right. Error bars represent 95% confidence intervals around the estimated means.
Figure 8
 
Experiment 3: The probability of responding that the target scene was intact as a function of the stimulus type and spatiotemporal coherence. Estimated values of d′ calculated from the graph on the left are plotted on the right. Error bars represent 95% confidence intervals around the estimated means.
We also evaluated within-scene and between-scenes category priming as in Experiment 1. We ran two separate models, as in Experiment 1. Target scenes were preceded by zero, one, or two images that were in the same category as the target. To evaluate how perceptual sensitivity changed as a function of within-scene category priming, we fitted a probit model with the main effect of stimulus type, the number of within-category primes, and their interaction. Overall, participants were able to discriminate intact from phase-randomized scenes, β = 1.12, z = 9.10, p < 0.001. Bias to respond that the target was intact did not change as the target was preceded by more primes (i.e., there was no main effect of number of within-category primes), β = 0.02, z = 0.31, p = 0.76. Importantly, however, consistent with the within-scene category priming hypothesis, participants became more sensitive as the number of primes from the same scene category as the target increased, β = 0.36, z = 2.97, p = 0.003. Consistent with the results of Experiment 1, perceptual sensitivity was greater if the target was primed by two scenes of its same category (d′ = 1.84) than by one (d′ = 1.48) or zero (d′ = 1.12). 
To evaluate between-categories priming, all cases when the target was of the first scene category in the sequence were removed prior to the analysis. According to this analysis, participants were not sensitive (i.e., there was not a main effect of stimulus type), β = 0.13, z = 0.29, p = 0.77. Response bias did not change as a function of between-scenes category priming, β = −0.02, z = −0.14, p = 0.89. Importantly, however, consistent with the between-scenes category priming hypothesis, participants became increasingly more sensitive in discriminating intact from phase-randomized scenes as the number of primes from the preceding category increased, β = 0.46, z = 2.18, p = 0.03. As shown in Figure 9, target scenes within coherent sequences were more accurately discriminated if they were primed by three scenes from a different category than the target (d′ = 1.51) compared to two (d′ = 1.05) or one (d′ = 0.59). Together with the results for within-scene category priming, these results also support the within > between category priming hypothesis because d′ was larger for within-scene than between-scenes categories. 
Figure 9
 
Estimated d′ as a function of within-category and between-categories priming in Experiment 3.
Figure 9
 
Estimated d′ as a function of within-category and between-categories priming in Experiment 3.
Discussion
The results of Experiment 3 suggest that the priming results in Experiment 1 were not solely due to response bias or intelligent guessing (e.g., Hollingworth & Henderson, 1998). Nevertheless, response bias could still have contributed to scene-gist categorization in Experiment 1. However, we can say that in Experiment 3 there is no easy explanation for how a response bias for a given scene category (e.g., hallway) would map onto a response of either “intact” or “phase-randomized.” And indeed, consistent with that argument, response bias did not differ between coherent and randomized sequences in Experiment 3, both showing very nearly zero response bias. Thus, response bias seems not to have played a role in the results of Experiment 3, which therefore shows that not only do predictions made prior to seeing a scene improve rapid-scene-categorization performance (Experiment 1), but such predictions also facilitate being able to see whether a scene is intact versus phase-randomized (Figure 8). Indeed, scenes shown in randomized sequences in Experiment 3 were more likely to be mistaken for phase-randomized images than scenes shown in coherent sequences. Furthermore, consistent with the results of Experiment 1, discrimination sensitivity increased as the number of primes prior to the target increased (Figure 9). This applies to both when viewers saw multiple views of the same scene and when the scene category changed to an expected different scene category. Nevertheless, consistent with the within > between category priming hypothesis, priming was stronger within the same category, with this advantage likely being explained in terms of a lower level mechanism (e.g., residual activation of feature detectors from a previously viewed similar image). 
Our results suggest that expectations generated from the current event model in the back end facilitate information extraction in the front end, and the locus of such an effect is not primarily due to predictions influencing response bias. Instead, expectations influence scene-gist recognition at a perceptual level. 
General discussion
In our day-to-day lives, the scenes we see are typically spatiotemporally primed. We rarely, if ever, find ourselves confronted with unexpected scene categories. Instead, we tend to either have prior knowledge of the scene categories we will encounter from one moment to the next before seeing them or make unconscious predictions about what we will see. This was the first series of experiments to demonstrate that sequential predictions made prior to viewing a scene can influence scene-gist perception. 
According to SPECT, one's understanding in the back end of what is happening in the current event (e.g., navigating through one's office toward the doorway) influences information extraction in the front end (Loschky et al., 2018; Loschky et al., 2019). In Experiment 1, we found that scenes presented in coherent sequences, which enabled observers to create a spatiotemporally coherent event model, were identified more accurately than scenes presented in randomized sequences. Experiment 2 showed that scenes in coherent sequences are indeed more predictable, and that such predictions contributed to the coherent-sequence advantage when the influence of low-level similarity between prime and target was controlled for. Experiment 3 showed that the coherent-sequence advantage is not simply due to response bias or intelligent guessing. Instead, we found that perceptual sensitivity for scenes was greater when the sequence was coherent. In addition, the results of Experiments 1 and 3 both showed evidence for priming both within and between scene categories, which appear to be driven by both lower level processes involved in information extraction and higher level priming processes. The within-scene category priming effect suggests that information extraction does not begin anew on every fixation when the scene category is not expected to change. 
These results raise the important question of how predictions generated from the event model in the back end facilitate recognition in the front end. A series of processing stages are involved in constructing a scene representation. Predictions may facilitate perception in multiple ways. At a high level, predictions from the current event model may prime relevant representations in semantic memory, facilitating recognition when a scene's constructed visual description is matched to some prestored knowledge representation. This mechanism is consistent with explanations of previous object- and scene-consistency effects (Bar, 2004; Bar & Ullman, 1996). 
Alternatively, it is possible that perceptual predictions may influence perception by facilitating early perceptual analysis (i.e., the extraction of features or the construction of a perceptual token). The mechanism could be as early as V1 (Muckli et al., 2015). Perceptual predictions may aid low-level analysis by filling in missing information or by compensating feature detectors along the ventral visual pathway, making information that is predictable more perceptible. 
It is also plausible that priming lowers recognition thresholds, which then feed back to influence processing at earlier stages. Priming should occur at a level where different exemplars of a category activate similar neural representations. As mentioned previously, scenes that share the same concept also share similar low-level visual features (Oliva & Torralba, 2001). Scene categories can successfully be decoded from neural activity that processes these low-level visual features (Choo & Walther, 2016; Ramkumar et al., 2016). Expectations for an upcoming scene category may therefore lower recognition thresholds, which then feed back and prime category-specific feature information at earlier stages. Such an argument is also consistent with the finding that there is a considerable amount of overlap in the time course of perceptual processing of scene gist and when information becomes available for a response (Caddigan et al., 2017; Ramkumar et al., 2016). 
An interesting question is whether we can infer what features or information is involved in the priming effects we observed in the intact-versus-randomized discrimination task. In Experiment 3, phase-randomized images were from a different category than the target. We assume that the information or features that produce a priming effect must be useful for recognizing a scene. Most importantly here, the feature of global configuration or layout of scenes is removed by phase randomization, while the features of dominant orientation and amplitude slope of scenes remain. One possibility is that amplitude slope and orientation information are primed, and that priming such information facilitates gist recognition when the scene is intact (Guyader, Chauvin, Peyrin, Hérault, & Marendaz, 2004). When the scene is phase-randomized, the mismatch between primed features and the input (note that phase-randomized scenes were from a category different from the target) may have resulted in the greater sensitivity difference between scenes presented in coherent versus randomized sequences. However, viewers are unable to categorize fully phase-randomized scenes at either the basic level or superordinate level, showing that the features of dominant orientation or amplitude slope are insufficient for rapid scene-gist categorization (Hansen & Loschky, 2013; Joubert, Rousselet, Fabre-Thorpe, & Fize, 2009; Loschky, Hansen, Sethi, & Pydimari, 2010; Loschky et al., 2007). Furthermore, categorization is at floor once the phase randomization exceeds 50%, and it doesn't show further decreases as phase randomization increases even up to 100% (Loschky & Larson, 2008). This suggests that priming of features in 100% phase-randomized images cannot occur, because categorization will not rise above floor performance (i.e., chance). If so, then the fact that the phase-randomized images were from a different category than the target does not tell us what features were primed, because nothing will have been primed; however, this remains an empirical question. 
There is good evidence that the global configuration or layout of a scene is both necessary and sufficient for categorization of a scene, and so predictions and image similarity may prime the perception of scene layout. Evidence of priming of layout has been repeatedly shown in studies by Sanocki and colleagues (Sanocki, 2003; Sanocki & Epstein, 1997). Demonstration of the necessity of layout for scene categorization has been shown by many studies finding that manipulations that remove layout make it impossible (or nearly impossible) to categorize scenes. These manipulations include phase randomization, texture synthesis from a scene based on treating it as a whole, and contour shifting (Loschky et al., 2010; Loschky et al., 2007; Walther & Shen, 2014). Demonstration of the sufficiency of layout for scene categorization has been shown by studies that greatly change pixel content but maintain layout and preserve viewers' ability to categorize scenes with at least moderate performance. These manipulations include low-pass filtering scenes, which removes higher-spatial-frequency details but leaves low-spatial-frequency layout (Schyns & Oliva, 1994), and “unbound spatial layout representations” synthesized from noise coerced to match the statistics of scenes captured by the spatial-envelope model (Oliva, 2005), which completely disrupts local details but maintains global layout. 
Future research could examine scene-gist priming using the temporal resolution of electroencephalography. The time course of priming effects can be used to infer the level at which they occur. Unexpected objects in scenes are harder to identify (Bar, 2004; Biederman et al., 1982; Davenport & Potter, 2004), and they have been found to elicit a larger N300/N400 (Ganis & Kutas, 2003; Mudrik, Lamy, & Deouell, 2010; Truman & Mudrik, 2018; Võ & Wolfe, 2013). The N300 is thought to index structural and amodal semantic mismatches generated by the visual system before semantic and associative integration take place (Hamm, Johnson, & Kirk, 2002). On the other hand, the N400 indexes access to semantic memory. It reflects the integration of meaning of an input with a preceding context (Laszlo & Federmeier, 2011). It is possible that predictions may influence even earlier event-related potential components implicated in encoding and analysis of perceptual features (P1, N1, P2). One could manipulate how well a scene category is predicted from expectations generated by the event model while holding the similarity between the prime and target constant, and use various neural markers to evaluate when predictions influence perception. 
Conclusion
Regardless of the precise mechanism, we have shown that expectations made prior to seeing a scene influence one's ability to perceive it. This influence can occur for briefly flashed scenes at the level of rapidly categorizing the scene or even simply perceiving nonrandom structure in it. According to the Scene Perception and Event Comprehension Theory, the foundation of a new event model is laid based on the gist of the first view of a scene (Loschky et al., 2018; Loschky et al., 2019). As more information is mapped onto the event model across multiple views or eye fixations, the event model generates predictions of what will be seen in the near future (Zacks, Speer, Swallow, Braver, & Reynolds, 2007). Because prediction was found to affect simple visual discrimination of scene structure, these results suggest that the event model in the back end (i.e., in working memory) influences information extraction in the front end (i.e., during single eye fixations). Future research should evaluate the time course of such top-down effects on perception. 
Acknowledgments
We thank Thomas Hinkel., Katie Tran, Megan Steele, Yuhang Ma, Kenzie J. Kriss, and Katherine E. Kolze for help in stimulus creation and data collection. We thank Adam M. Larson for a discussion that influenced our priming and masking paradigm. 
This study is based on the MS thesis of Maverick E. Smith for which Lester C. Loschky served as Chair, and Heather Bailey and Thomas Sanocki served as Committee Members. Results of the current study were previously reported in posters presented at the Vision Sciences Society Annual Meetings in, 2017, 2018, and 2019, with the abstracts subsequently published in the Journal of Vision
Commercial relationships: none. 
Corresponding authors: Maverick E. Smith; Lester C. Loschky. 
Address: Psychological Sciences, Kansas State University, Manhattan, KS, USA. 
References
Agresti, A. (2007). Introduction to categorical data analysis (2nd ed.). Hoboken, NJ: Wiley.
Bach, M. (2006). The Freiburg Visual Acuity Test—Variability unchanged by post-hoc re-analysis. Graefe's Archive for Clinical and Experimental Ophthalmology, 245 (7), 965–971, https://doi.org/10.1007/s00417-006-0474-4.
Bacon-Mace, N., Mace, M. J., Fabre-Thorpe, M., & Thorpe, S. J. (2005). The time course of visual processing: Backward masking and natural scene categorisation. Vision Research, 45, 1459–1469.
Bar, M. (2004). Visual objects in context. Nature Reviews Neuroscience, 5 (8), 617–629.
Bar, M. (2007). The proactive brain: Using analogies and associations to generate predictions. Trends in Cognitive Sciences, 11 (7), 280–289.
Bar, M., & Biederman, I. (1998). Subliminal visual priming. Psychological Science, 9 (6), 464–468, https://doi.org/10.1111/1467-9280.00086.
Bar, M., & Ullman, S. (1996). Spatial context in recognition. Perception, 25 (3), 343–352.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2014). lme4: Linear mixed-effects models using Eigen and S4 (Version 1.1-7) [Computer software]. Retrieved from http://CRAN.R-project.org/package=lme4
Biederman, I., Mezzanotte, R., & Rabinowitz, J. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14, 143–177.
Biederman, I., Rabinowitz, J., Glass, A., & Stacy, E. (1974). On the information extracted from a glance at a scene. Journal of Experimental Psychology, 103, 597–600.
Brandman, T., & Peelen, M. V. (2017). Interaction between scene and object processing revealed by human fMRI and MEG decoding. The Journal of Neuroscience, 37 (32), 7700–7710.
Brewer, W. F., & Loschky, L. C. (2005). Top-down and bottom-up influences on observation: Evidence from cognitive psychology and the history of science. In Raftopoulos A. (Ed.), Cognitive penetrability of perception: Attention, action, strategies, and bottom-up constraints (pp. 31–47). Hauppauge, NY: Nova Science Publishers.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: Understanding AIC and BIC in model selection. Sociological Methods & Research, 33 (2), 261–304.
Caddigan, E., Choo, H., Fei-Fei, L., & Beck, D. M. (2017). Categorization influences detection: A perceptual advantage for representative exemplars of natural scene categories. Journal of Vision, 17 (1): 21, 1–11, https://doi.org/10.1167/17.1.21. [PubMed] [Article]
Castelhano, M. S., & Pollatsek, A. (2010). Extrapolating spatial layout in scene representations. Memory & Cognition, 38 (8), 1018–1025.
Choo, H., & Walther, D. B. (2016). Contour junctions underlie neural representations of scene categories in high-level human visual cortex. NeuroImage, 135, 32–44.
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, 27755, https://doi.org/10.1038/srep27755.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum Associates Publishers.
Davenport, J. L., & Potter, M. C. (2004). Scene consistency in object and background perception. Psychological Science, 15 (8), 559–564.
DeCarlo, L. T. (1998). Signal detection theory and generalized linear models. Psychological Methods, 3 (2), 186–205.
Fabre-Thorpe, M., Delorme, A., Marlot, C., & Thorpe, S. (2001). A limit to the speed of processing in ultra-rapid visual categorization of novel natural scenes. Journal of Cognitive Neuroscience, 13 (2), 171–180.
Firestone, C., & Scholl, B. J. (2015). Enhanced visual awareness for morality and pajamas? Perception vs. memory in “top-down” effects. Cognition, 136, 409–416.
Firestone, C., & Scholl, B. J. (2016). Cognition does not affect perception: Evaluating the evidence for “top-down” effects. Behavioral and Brain Sciences, 39, 1–77.
Ganis, G., & Kutas, M. (2003). An electrophysiological study of scene effects on object identification. Cognitive Brain Research, 16 (2), 123–144.
Goldiamond, I., & Hawkins, W. F. (1958). Vexierversuch: The log relationship between word-frequency and recognition obtained in the absence of stimulus words. Journal of Experimental Psychology, 56 (6), 457–463.
Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics (Vol. 1). New York: Wiley.
Greene, M. R., Botros, A. P., Beck, D. M., & Fei-Fei, L. (2015). What you see is what you expect: Rapid scene understanding benefits from prior experience. Attention, Perception & Psychophysics, 77 (4), 1239–1251, https://doi.org/10.3758/s13414-015-0859-8.
Greene, M. R., & Hansen, B. C. (2018). Shared spatiotemporal category representations in biological and artificial deep neural networks. PLoS Computational Biology, 14 (7), e1006327, https://doi.org/10.1371/journal.pcbi.1006327.
Greene, M. R., & Oliva, A. (2009). The briefest of glances: The time course of natural scene understanding. Psychological Science, 20 (4), 464–472.
Guyader, N., Chauvin, A., Peyrin, C., Hérault, J., & Marendaz, C. (2004). Image phase or amplitude? Rapid scene categorization is an amplitude-based process. Comptes Rendus Biologies, 327, 313–318.
Hamm, J. P., Johnson, B. W., & Kirk, I. J. (2002). Comparison of the N300 and N400 ERPs to picture stimuli in congruent and incongruent contexts. Clinical Neurophysiology, 113 (8), 1339–1350.
Hansen, B. C., & Hess, R. F. (2007). Structural sparseness and spatial phase alignment in natural scenes. Journal of the Optical Society of America A:Optics, Image Science, and Vision, 24 (7) 1873–1885.
Hansen, B. C., & Loschky, L. C. (2013). The contribution of amplitude and phase spectra defined scene statistics to the masking of rapid scene categorization. Journal of Vision, 13 (13): 21, 1–21, https://doi.org/10.1167/13.13.21. [PubMed] [Article]
Hollingworth, A., & Henderson, J. M. (1998). Does consistent scene context facilitate object perception? Journal of Experimental Psychology: General, 127 (4), 398–415.
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59 (4), 434–446, https://doi.org/10.1016/j.jml.2007.11.007.
Joubert, O. R., Rousselet, G. A., Fabre-Thorpe, M., & Fize, D. (2009). Rapid visual categorization of natural scene contexts with equalized amplitude spectrum and increasing phase noise. Journal of Vision, 9 (1): 2, 1–16, https://doi.org/10.1167/9.1.2. [PubMed] [Article]
Joubert, O. R., Rousselet, G. A., Fize, D., & Fabre-Thorpe, M. (2007). Processing scene context: Fast categorization and object interference. Vision Research, 47 (26), 3286–3297, https://doi.org/10.1016/j.visres.2007.09.013.
Kaernbach, C. (1990). A single-interval adjustment-matrix (SIAM) procedure for unbiased adaptive testing. The Journal of the Acoustical Society of America, 88 (6), 2645–2655.
Kranczioch, C., & Thorne, J. D. (2013). Simultaneous and preceding sounds enhance rapid visual targets: Evidence from the attentional blink. Advances in Cognitive Psychology, 9 (3), 130–142, https://doi.org/10.5709/acp-0139-4.
Kveraga, K., Ghuman, A. S., Kassam, K. S., Aminoff, E. A., Hämäläinen, M. S., Chaumon, M., & Bar, M. (2011). Early onset of neural synchronization in the contextual associations network. Proceedings of the National Academy of Sciences, USA, 108 (8), 3389–3394.
Larson, A. M. (2012). Recognizing the setting before reporting the action: Investigating how visual events are mentally constructed from scene images (Unpublished doctoral dissertation). Kansas State University.
Larson, A. M., Freeman, T. E., Ringer, R. V., & Loschky, L. C. (2014). The spatiotemporal dynamics of scene gist recognition. Journal of Experimental Psychology: Human Perception and Performance, 40 (2), 471–487, https://doi.org/10.1037/a0034986.
Laszlo, S., & Federmeier, K. D. (2011). The N400 as a snapshot of interactive processing: Evidence from regression analyses of orthographic neighbor and lexical associate effects. Psychophysiology, 48 (2), 176–186.
Loschky, L. C., Hansen, B. C., Sethi, A., & Pydimari, T. (2010). The role of higher-order image statistics in masking scene gist recognition. Attention, Perception & Psychophysics, 72 (2), 427–444.
Loschky, L. C., Hutson, J. P., Smith, M. E., Smith, T. J., & Magliano, J. P. (2018). Viewing static visual narratives through the lens of the Scene Perception and Event Comprehension Theory (SPECT). In Laubrock, J. Wildfeuer, J. & Dunst A. (Eds.), Empirical comics research: Digital, multimodal, and cognitive methods (pp. 217–238). New York, NY: Routledge.
Loschky, L. C., & Larson, A. M. (2008). Localized information is necessary for scene categorization, including the Natural/Man-made distinction. Journal of Vision, 8 (1): 4, 1–9, https://doi.org/10.1167/8.1.4. [PubMed] [Article]
Loschky, L. C., Larson, A. M., Smith, T. J., & Magliano, J. P. (2019). The Scene Perception and Event Comprehension Theory (SPECT). Topics in Cognitive Science. Advance online publication. https://doi.org/10.1111/tops.12455.
Loschky, L. C., Sethi, A., Simons, D. J., Pydimari, T., Ochs, D., & Corbeille, J. (2007). The importance of information localization in scene gist recognition. Journal of Experimental Psychology: Human Perception & Performance, 33 (6), 1431–1450.
Magliano, J. P., Loschky, L. C., Clinton, J. A., & Larson, A. M. (2013). Is reading the same as viewing? An exploration of the similarities and differences between processing text- and visually based narratives. In Miller, B. Cutting, L. & McCardle P. (Eds.), Unraveling the behavioral, neurobiological, and genetic components of reading comprehension (pp. 78–90). Baltimore, MD: Brookes.
Muckli, L., De Martino, F., Vizioli, L., Petro, L. S., Smith, F. W., Ugurbil, K.,… Yacoub, E. (2015). Contextual feedback to superficial layers of V1. Current Biology, 25 (20), 2690–2695, https://doi.org/10.1016/j.cub.2015.08.057.
Mudrik, L., Lamy, D., & Deouell, L. Y. (2010). ERP evidence for context congruity effects during simultaneous object–scene processing. Neuropsychologia, 48 (2), 507–517.
Oliva, A. (2005). Gist of a scene. In Itti, L. Rees, G. & Tsotsos J. K. (Eds.), Neurobiology of attention (pp. 251–256). Burlington, MA: Elsevier Academic Press.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42 (3), 145–175.
Pezdek, K., Whetstone, T., Reynolds, K., Askari, N., & Dougherty, T. (1989). Memory for real-world scenes: The role of consistency with schema expectation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15 (4), 587–595.
Potter, M. C. (1976). Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning & Memory, 2 (5), 509–522.
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological methodology, (pp. 111–163). Washington, DC: American Sociological Association.
Ramkumar, P., Hansen, B. C., Pannasch, S., & Loschky, L. C. (2016). Visual information representation and rapid-scene categorization are simultaneous across cortex: An MEG study. NeuroImage, 134, 295–304, https://doi.org/10.1016/j.neuroimage.2016.03.027.
Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2 (1), 79–87.
Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62 (8), 1457–1506, https://doi.org/10.1080/17470210902816461.
Reinitz, M. T., Wright, E., & Loftus, G. R. (1989). Effects of semantic priming on visual encoding of pictures. Journal of Experimental Psychology: General, 118 (3), 280–297.
Rousselet, G. A., Joubert, O. R., & Fabre-Thorpe, M. (2005). How long to get to the “gist” of real-world natural scenes? Visual Cognition, 12 (6), 852–877.
Salin, P. A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiological Reviews, 75 (1), 107–154.
Sanocki, T. (2003). Representation and perception of spatial layout. Cognitive Psychology, 47, 43–86.
Sanocki, T., & Epstein, W. (1997). Priming spatial layout of scenes. Psychological Science, 8 (5), 374–378.
Schyns, P. G., & Oliva, A. (1994). From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition. Psychological Science, 5, 195–200.
Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences, USA, 104 (15), 6424–6429, https://doi.org/10.1073/pnas.0700622104.
Sperber, R. D., McCauley, C., Ragain, R. D., & Weil, C. M. (1979). Semantic priming effects on picture and word processing. Memory & Cognition, 7 (5), 339–345.
Thorpe, S. J., Fize, D., & Marlot, C. (1996, June 6). Speed of processing in the human visual system. Nature, 381 (6582), 520–522.
Torralba, A., Oliva, A., Castelhano, M. S., & Henderson, J. M. (2006). Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological Review, 113 (4), 766–786.
Truman, A., & Mudrik, L. (2018). Are incongruent objects harder to identify? The functional significance of the N300 component. Neuropsychologia, 117, 222–232.
Tversky, B., & Hemenway, K. (1983). Categories of environmental scenes. Cognitive Psychology, 15 (1), 121–149.
Ullman, S. (1995). Sequence seeking and counter streams: A computational model for bidirectional information flow in the visual cortex. Cerebral Cortex, 5 (1), 1–11.
VanRullen, R., & Thorpe, S. J. (2002). Surfing a spike wave down the ventral stream. Vision Research, 42 (23), 2593–2615.
Võ, M. L.-H., & Wolfe, J. M. (2013). Differential electrophysiological signatures of semantic and syntactic scene processing. Psychological Science, 24 (9), 1816–1823, https://doi.org/10.1177/0956797613476955.
Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection using Akaike weights. Psychonomic Bulletin & Review, 11 (1), 192–196.
Walther, D. B., & Shen, D. (2014). Nonaccidental properties underlie human categorization of complex natural scenes. Psychological Science, 25 (4), 851–860, https://doi.org/10.1177/0956797613512662.
Williams, E. (1949). Experimental designs balanced for the estimation of residual effects of treatments. Australian Journal of Chemistry, 2 (2), 149–168.
Wright, D. B., Horry, R., & Skagerberg, E. M. (2009). Functions for traditional and multilevel approaches to signal detection theory. Behavior Research Methods, 41 (2), 257–267.
Zacks, J., Speer, N., Swallow, K., Braver, T., & Reynolds, J. (2007). Event perception: A mind-brain perspective. Psychological Bulletin, 133 (2), 273–293. https://doi/org/10.1037/0033-2909.133.2.273.
Footnotes
1  Scene gist is an important theoretical construct in theories of scene perception (Rayner, 2009). The gist of a scene influences attentional selection (Torralba, Oliva, Castelhano, & Henderson, 2006), object recognition (Davenport & Potter, 2004), and long-term memory for a scene's contents (Pezdek, Whetstone, Reynolds, Askari, & Dougherty, 1989). Following the convention of prior research, we operationalized gist in terms of rapid basic-level scene categorization (Tversky & Hemenway, 1983). Importantly, the theoretical construct of gist implies more than how it is measured; therefore, when discussing the theoretical construct we refer to scene gist, and when discussing the method and results we use rapid scene categorization or recognition.
Footnotes
2  No more than three images from a repeated category were ever shown to participants; therefore, participants could still predict that there would be a category shift after viewing the third image from a repeated category. They could not predict when there would be a category shift after viewing the first or second image from a repeated category.
Footnotes
3  To provide the reader with an estimate of effect size, we reran our analyses using a repeated-measures analysis of variance with proportion correct as the dependent measure. We calculated η2p using the sum of squares for the effect divided by the sum of the sum-of-squares error and the sum of squares for the effect.
Footnotes
4  Note that the accuracy for zero within-scene category primes is a control condition. Instances of zero within-scene category primes were either the first image within a trial, and thus not primed, or were primed by between-scenes categories.
Footnotes
5  One way in which predictions would be greater than chance in the randomized condition is if a participant were to know the full range of possible categories and the total sequence length, and to keep track of how many instances of each category were shown in the RSVP stream in each trial. One would then have to use this knowledge to calculate the probability that a remaining image would be one of each of the categories. This would have to be done within the RSVP image-sequence presentation rate of one image every 300 ms. This possibility seems highly unlikely.
Footnotes
6  Data collection is scheduled one week at a time, and there is variability between the number of participants who chose to sign up for available times and the number who actually came at the times they signed up for. This degree of uncertainty necessarily creates variability in the total number of participants whose data is collected when targeting a specific number of participants.
Footnotes
7  Ideally, one would like to obtain prediction and recognition data from the same participants; however, doing so would produce order effects, such that if the prediction task was first, those predictions could affect the later recognition task, or vice versa. Thus, a within-subject design would be fatally flawed, and so it was necessary to have different participant groups for the two experiments.
Figure 1
 
Trial schematic of the sequence of events within two example trials. (a) A simplified version of a trial, showing the sequence of screens up to the response. The sequence of scenes in (i) is coherent, beginning with an office and ending with a scene of a parking lot; in (ii) the same images are shown in an example randomized sequence. (b) A more complete version of a trial, showing the continuation of the sequences after participants responded. Such a continuation occurred for any sequence in which the target was less than the 10th image. The continuation was shown so that viewers always saw a full 10-image sequence, regardless of which image in the sequence was the target.
Figure 1
 
Trial schematic of the sequence of events within two example trials. (a) A simplified version of a trial, showing the sequence of screens up to the response. The sequence of scenes in (i) is coherent, beginning with an office and ending with a scene of a parking lot; in (ii) the same images are shown in an example randomized sequence. (b) A more complete version of a trial, showing the continuation of the sequences after participants responded. Such a continuation occurred for any sequence in which the target was less than the 10th image. The continuation was shown so that viewers always saw a full 10-image sequence, regardless of which image in the sequence was the target.
Figure 2
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the spatiotemporal coherence of the scene sequences and image location (on-campus vs. off-campus). Error bars represent 95% confidence intervals around the estimated mean. The probability of correctly identifying the scene by chance on any given trial was 12.5%, as represented by the dashed line.
Figure 2
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the spatiotemporal coherence of the scene sequences and image location (on-campus vs. off-campus). Error bars represent 95% confidence intervals around the estimated mean. The probability of correctly identifying the scene by chance on any given trial was 12.5%, as represented by the dashed line.
Figure 3
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the number of primes. (a) The example target is the office prior to the mask. The target was preceded by either zero, one, or two primes from the same category as the target (i.e., within-category priming). (b) The example target is the hallway. It was preceded by either one, two, or three primes that were of a different category than the target (i.e., between-categories priming). (Note that the case of zero within-category primes was, logically, either the first image in the 10-image sequence or a between-categories prime.) Error bars represent 95% confidence intervals around the estimated mean probability correct.
Figure 3
 
Experiment 1: Rapid-scene-gist categorization accuracy as a function of the number of primes. (a) The example target is the office prior to the mask. The target was preceded by either zero, one, or two primes from the same category as the target (i.e., within-category priming). (b) The example target is the hallway. It was preceded by either one, two, or three primes that were of a different category than the target (i.e., between-categories priming). (Note that the case of zero within-category primes was, logically, either the first image in the 10-image sequence or a between-categories prime.) Error bars represent 95% confidence intervals around the estimated mean probability correct.
Figure 4
 
Experiment 1: Probability of correctly categorizing a scene image as a function of image similarity and the spatiotemporal coherence of scene sequences.
Figure 4
 
Experiment 1: Probability of correctly categorizing a scene image as a function of image similarity and the spatiotemporal coherence of scene sequences.
Figure 5
 
Rapid-scene-categorization accuracy from Experiment 1 as a function of whether the target image was predictable in Experiment 2 (yes vs. no).
Figure 5
 
Rapid-scene-categorization accuracy from Experiment 1 as a function of whether the target image was predictable in Experiment 2 (yes vs. no).
Figure 6
 
Probability of correctly rapidly categorizing a scene image in Experiment 1 as a function of image visual similarity and whether the target was predictable in Experiment 2.
Figure 6
 
Probability of correctly rapidly categorizing a scene image in Experiment 1 as a function of image visual similarity and whether the target was predictable in Experiment 2.
Figure 7
 
Examples of the stimuli used in Experiment 3. All images were equalized for both mean luminance and root mean square contrast. This equalization resulted in a contrast reduction for most images.
Figure 7
 
Examples of the stimuli used in Experiment 3. All images were equalized for both mean luminance and root mean square contrast. This equalization resulted in a contrast reduction for most images.
Figure 8
 
Experiment 3: The probability of responding that the target scene was intact as a function of the stimulus type and spatiotemporal coherence. Estimated values of d′ calculated from the graph on the left are plotted on the right. Error bars represent 95% confidence intervals around the estimated means.
Figure 8
 
Experiment 3: The probability of responding that the target scene was intact as a function of the stimulus type and spatiotemporal coherence. Estimated values of d′ calculated from the graph on the left are plotted on the right. Error bars represent 95% confidence intervals around the estimated means.
Figure 9
 
Estimated d′ as a function of within-category and between-categories priming in Experiment 3.
Figure 9
 
Estimated d′ as a function of within-category and between-categories priming in Experiment 3.
Table 1
 
Zero-order correlations between rapid scene categorization, image prediction accuracy, and image similarity. Notes: Image similarity is the reciprocal of the difference between the gist descriptor of the target image and its immediately preceding prime as output from the results of the spatial-envelope model (Oliva & Torralba, 2001). M = mean; SE = standard error.
Table 1
 
Zero-order correlations between rapid scene categorization, image prediction accuracy, and image similarity. Notes: Image similarity is the reciprocal of the difference between the gist descriptor of the target image and its immediately preceding prime as output from the results of the spatial-envelope model (Oliva & Torralba, 2001). M = mean; SE = standard error.
Supplement 1
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×