Open Access
Article  |   August 2024
Feature binding is slow: Temporal integration explains apparent ultrafast binding
Author Affiliations
  • Lucija Blaževski
    Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
    blazevskilucija@gmail.com
  • Timo Stein
    Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
    timo@timostein.de
  • H. Steven Scholte
    Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
    h.s.scholte@uva.nl
Journal of Vision August 2024, Vol.24, 3. doi:https://doi.org/10.1167/jov.24.8.3
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Lucija Blaževski, Timo Stein, H. Steven Scholte; Feature binding is slow: Temporal integration explains apparent ultrafast binding. Journal of Vision 2024;24(8):3. https://doi.org/10.1167/jov.24.8.3.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Visual perception involves binding of distinct features into a unified percept. Although traditional theories link feature binding to time-consuming recurrent processes, Holcombe and Cavanagh (2001) demonstrated ultrafast, early binding of features that belong to the same object. The task required binding of orientation and luminance within an exceptionally short presentation time. However, because visual stimuli were presented over multiple presentation cycles, their findings can alternatively be explained by temporal integration over the extended stimulus sequence. Here, we conducted three experiments manipulating the number of presentation cycles. If early binding occurs, one extremely short cycle should be sufficient for feature integration. Conversely, late binding theories predict that successful binding requires substantial time and improves with additional presentation cycles. Our findings indicate that task-relevant binding of features from the same object occurs slowly, supporting late binding theories.

Introduction
Within the retina, the visual world is captured by a heterogeneous array of receptors that sample the impact of photons over time. Subsequently, this information is integrated into more complex local features (Wandell, 1995), resulting in a diverse set of spatiotemporal receptive fields (Olshausen & Field, 1996; Rust, Schwartz, Movshon, & Simoncelli, 2005), among them neurons with a preferential tuning for color, motion, and a range of shapes (Desimone & Schein, 1987; Pasupathy & Connor, 2002). The binding problem was first formulated as how different features (color, shape) can subsequently be associated with each other if they belong to the same object (e.g., Rosenblatt, 1961; von der Malsburg, 1994). Treisman and Schmidt (1982) extended the scope of this question by reframing it into how distinct features (color, shape) are correctly and unambiguously assigned to different objects. 
The timing, mechanism, and nature of the binding process remain the subject of ongoing debate, but there is broad agreement that how locally sampled information is combined to form an integrated percept is a central question of vision science. Traditional late binding theories propose that binding occurs at later stages of visual processing, involving attention-dependent recurrent processes in high-level visual and parietal cortex that “glue” together information distributed in the brain (Bouvier & Treisman, 2010; Di Lollo, Enns, & Rensink, 2000; Hochstein & Ahissar, 2002; Koivisto & Silvanto, 2012; Lamme & Roelfsema, 2000; Roelfsema, 2006; Treisman, 1996). According to these theories, low-level feature detectors in early visual cortex initially project to object-processing areas in higher visual cortex in a purely feedforward manner. Conjunctions of features are believed to be generated by subsequent recurrent interactions between these object-processing areas and early visual cortex. This implies that features must be retraced to early visual cortex to ensure accurate binding (Roelfsema, 2006). Reasoning from this perspective, the hallmark of late binding theories is that this is a temporally slow process. 
In contrast, early binding theories propose that feature binding can occur in a purely feedforward manner at early, pre-attentive stages of visual processing, including primary visual cortex (Blaser, Papathomas, & Vidnyánszky, 2005; Holcombe & Cavanagh, 2001; Nakayama & Motoyoshi, 2019; Seymour, Clifford, Logothetis, & Bartels, 2009). In a landmark study providing evidence for early binding, Holcombe and Cavanagh (2001) found that participants accurately reported conjunctions of orientation and luminance for rapidly alternating gratings presented for very short presentation times (∼14 ms) in a continuous stimulus stream. This was interpreted as evidence that feature conjunctions were processed instantaneously on detection of those features at a very early pre-attentive stage of visual processing, eliminating the need for explicit binding processes. In contrast to this effect, they found substantially longer detection times when the features were spatially segregated, highlighting the difference between binding for features in one and multiple objects (Holcombe & Cavanagh, 2001). This view of early rapid binding aligns with the observation that we can perform object detection fast (Thorpe, Fize, & Marlot, 1996). However, one concern with Holcombe and Cavanagh's interpretation is that the apparent ultra-fast feature binding may have resulted from temporal integration over the continuous stimulus stream, rather than reflecting instantaneous binding after a single presentation. Although masking would have interfered with integration by interrupting recurrent processing (Fahrenfort, Scholte, & Lamme, 2007), a mask was presented only at the end of the whole presentation sequence of multiple cycles. Furthermore, masking disrupts but does not necessarily terminate recurrent processing. Indeed, in one study adopting a similar experimental paradigm Nakayama and Motoyoshi (2019) used only two consecutive cycles and found that much longer presentation times (∼67 ms instead of ∼14 ms) were required. Because binding was independent of attentional oscillations in the alpha range, these results were nevertheless interpreted as evidence for early, pre-attentive feature binding. 
In this article, we focus on the claim from Holcombe and Cavanagh (2001) that binding is fast for features that belong to the same object. To address this question more systematically, we used a modified version of Holcombe and Cavanagh's (2001) paradigm, manipulating the number of presentation cycles. The early-binding hypothesis predicts accurate feature integration with one short presentation cycle. On the contrary, the late-binding hypothesis predicts that substantially more presentation time is required with one presentation cycle compared to multiple cycles and that accuracy improves with more cycles, suggesting recurrent processing over an extended presentation sequence. To foreshadow our results, across three experiments we show that it takes a lot of alternations before participants can perform this task reliably with short presentation times, providing strong support for a late-binding account. These findings challenge previous interpretations and highlight the importance of recurrent processes in perceptual binding. 
Experiment 1
Experiment 1 tested whether accuracy in reporting feature conjunctions increased with the number of presentation cycles, as predicted by late binding theories or whether accurate binding could occur even with a single cycle of very briefly presented stimuli (∼17 ms), as predicted by early binding theories. To further test whether binding would be possible based on purely feedforward processing, for the single-presentation cycle, we also compared the standard condition with a backward mask terminating each cycle to an unmasked condition. Masking should leave feedforward processing largely intact while disrupting recurrent processing (Fahrenfort et al., 2007). In addition, we manipulated the stimuli's spatial frequency to differentially engage magnocellular (sensitive to lower spatial frequencies) and parvocellular (sensitive to higher spatial frequencies) pathways (Kauffmann, Ramanoël, & Peyrin, 2014). Given faster transmission along the magnocellular pathway (Maunsell & Gibson, 1992), we expected more accurate binding for stimuli with low spatial frequencies (SFs). 
Methods
Participants
For all experiments, we recruited participants through the University of Amsterdam subject pool. To meet the inclusion criteria, participants had to be fluent in English and have normal or corrected-to-normal vision. All experimental procedures were approved by the University of Amsterdam Ethics Review Board, following the Declaration of Helsinki. Informed consent was obtained following the approved procedures. Eight participants (MAge = 23.25, five female) took part in Experiment 1
Procedure
In Experiment 1 we investigated the effects of the number of cycles and spatial frequency on task performance. We studied their impact on the processing of the conjunction of luminance and orientation in an experimental paradigm similar to the spatially superimposed condition from Holcombe and Cavanagh (2001). The number of cycles refers to the number of times the luminance-orientation conjunctions were presented, with eight levels: one, two, three, four, eight, 16, 32, or 64 cycles. Because we combined two features (luminance and orientation) with two levels each (black/white and left/right), each cycle consisted of the presentation of two alternating stimuli. In the one-cycle condition, a black, oriented grating, and a white, oriented grating were each flashed only once; in the two-cycle condition, this was repeated, and so on. On a given trial, both stimuli had the same SF (high or low, see Materials for a detailed description of stimuli). To verify the task could be solved, we additionally included a condition in which the mask was absent for the one-cycle condition. 
Participants fixated on the cross at the center of the screen (Figure 1). The duration of the fixation period varied between one and two seconds. Next, two luminance orientation pairs were presented in rapid succession. Each stimulus in a pair was presented for the fixed duration of two frames (i.e., 16.7 ms). Thus each presentation cycle lasted 33.3 ms. The presentation sequence contained a fixed number of cycles. In all trials, except in the one-cycle-unmasked condition, the presentation sequence was followed by a white-noise pattern mask for 250 ms. In the one-cycle-unmasked condition, a blank screen was presented instead. Half of the trials consisted of a display alternating between a black grating tilted to the left and a white grating tilted to the right. In the other half of the trials, the pairing of luminance and orientation was reversed. The number of cycles, spatial frequency, presentation locations (upper/lower visual field) and order (first/second pair) were randomized across trials. At the end of a trial, participants reported whether black was paired with right or left orientation. No feedback was provided. Trials were separated by a 250 ms-long blank screen. Every participant completed a total of 900 trials, distributed evenly across different conditions with each condition comprising 50 trials. 
Figure 1.
 
Trial layout. (A) In Experiment 1, participants fixated on a central cross for 1000–2000 ms, followed by rapid, 16.67 ms presentations of luminance-orientation pairs for a fixed number of cycles (N cycles). A 250 ms white-noise pattern mask was then shown, followed by a 250 ms blank screen. In the unmasked condition, a blank screen was immediately presented. Participants indicated whether black was paired with a left or right orientation using arrow keys. Experiment 2 followed the same procedure but with variable stimulus presentation times and a mask in all trials. (B) In Experiment 3, the layout resembled Experiment 2, but with medium spatial frequency stimuli and a dynamic noise mask of high or low SF.
Figure 1.
 
Trial layout. (A) In Experiment 1, participants fixated on a central cross for 1000–2000 ms, followed by rapid, 16.67 ms presentations of luminance-orientation pairs for a fixed number of cycles (N cycles). A 250 ms white-noise pattern mask was then shown, followed by a 250 ms blank screen. In the unmasked condition, a blank screen was immediately presented. Participants indicated whether black was paired with a left or right orientation using arrow keys. Experiment 2 followed the same procedure but with variable stimulus presentation times and a mask in all trials. (B) In Experiment 3, the layout resembled Experiment 2, but with medium spatial frequency stimuli and a dynamic noise mask of high or low SF.
Before the main experimental session started, participants completed three practice sessions, where they received immediate feedback. In the first, “easy,” practice session, participants had to accurately respond in 10 consecutive trials (out of a total of 30 trials) in a row to proceed. This session included only the one-cycle condition with individual stimuli being presented for 730 ms each. All participants completed this practice session on the first attempt. In the second practice session, trials included longer cycle durations (i.e., two cycles, four cycles, etc.) with stimulus duration fixed at 182 or 364 ms. The practice session ended when participants achieved 80% correct responses on a minimum of 10 and a maximum of 30 trials. All participants completed this practice session on the first attempt. The third practice session closely resembled the actual experiment, except that participants received feedback and stimuli were presented for 33.3 ms. Participants completed eight trials with randomly selected numbers of cycles. 
Materials
The study was conducted in the University of Amsterdam's behavioral laboratories using a 27″ monitor with a frame rate of 120 Hz (resolution of 2560 × 1440 pixels), at a viewing distance of approximately 75 cm. The experiment was created using PsychoPy (Peirce et al., 2019). Stimuli were two luminance-orientation pairs consisting of square-wave grating patterns of a full contrast. They were presented on a gray background (45.53 cd/m2). The size of the stimuli was 2.2° of visual angle. SFs of the square wave gratings were 1 or 5 cpd. The spatial phase was randomized across the trials. The gratings were either black (0.27 cd/m2) or white (193.69 cd/m2) and orientated either leftward (135°) or rightward (45°). Stimuli were offset 0.3° of visual angle above the center of the fixation point. 
Analyses
We calculated d′, with hits defined as correct trials in which black was paired with leftward orientation, and false alarms as incorrect trials in which black was paired with rightward orientation. Hit and false alarm rates equal to 0 or 1 were adjusted to 1/(2N) and 1 − 1/(2N), respectively, where N represents the total number of trials on which the proportion is based (Macmillan & Creelman, 2005). 
We report both frequentist statistics and Bayes factors (BFs), calculated in JASP software (version 0.18; JASP Team, 2024) with the default prior (Cauchy distribution with scale 0.707). Results were visualized using R (version 4.2.3, R Core Team, 2023). For analyses of variance (ANOVAs), when Mauchly's test for sphericity indicated a violation of the assumption of sphericity, degrees of freedom were Greenhouse-Geisser corrected. 
Data, materials, and software availability
The experiment scripts, data, and analyses for all experiments are available in Blaževski, Stein, & Scholte (2024)
Results and discussion
A repeated-measures ANOVA with the factors number of cycles and SF revealed that sensitivity increased with the number of cycles, F(2.79, 19.51) = 126.90, p < 0.001, ηp2 = 0.948, BF10 = 5.96e24 (Figure 2). Thus reporting feature conjunctions appears to benefit from the integration over time. Furthermore, the sensitivity was, for the one-cycle-masked condition substantially lower than that reported by Holcombe and Cavanagh (2001). Together these findings strongly indicate that the finding of ultra-fast binding was driven by their use of a continuous stimulus presentation sequence. The ANOVA also revealed that sensitivity was higher for low SFs, F(1, 7) = 50.50, p < 0.001, ηp2 = 0.878, BF10 = 33.04, possibly reflecting the faster magnocellular transmission, and an interaction between the number of cycles and spatial frequency was significant, F(7, 49) = 8.02, p < 0.001, ηp2 = 0.534, BF10 = 5.30e5. As can be seen in Figure 2, this interaction reflected a greater advantage of low over high SF at intermediate presentation cycles than at the shortest and longest cycles, possibly reflecting floor and ceiling effects, respectively. 
Figure 2.
 
Sensitivity (d′) as a function of the number of cycles, the spatial frequency of stimuli, and the mask in Experiment 1 (N = 8). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 2.
 
Sensitivity (d′) as a function of the number of cycles, the spatial frequency of stimuli, and the mask in Experiment 1 (N = 8). Note: Error bars represent the standard error of the mean across participants (SEM).
For the one-cycle condition, a repeated measures ANOVA with the presence of mask and SF as factors revealed better performance in the unmasked condition, F(1, 7) = 49.44, p < 0.01, ηp2 = 0.876, BF10 = 152.77. Thus, as expected, a mask interrupts recurrent processing and the formation of afterimages interferes with the binding process. Furthermore, also here performance was better for low SF, F(1, 7) = 8.16, p = 0.024, ηp2 = 0.538, BF10 = 3.08. The interaction was not significant, F(1, 7) = 4.152, p = 0.08, ηp2 = 0.372, BF10 = 1.44. These results provide initial evidence against the early binding hypothesis and support the role of recurrent processing in feature binding. 
Experiment 2
To determine the time required for feature binding we kept, in Experiment 2, the performance level of 75%, and used adaptive staircases to measure how temporal thresholds for feature binding depended on the number of presentation cycles and SF. 
Methods
Participants
We recruited 11 participants. The staircase procedure did not converge for one participant (See Staircase), so the final sample consisted of 10 participants (MAge = 21.2, seven female). 
Procedure
The general procedure and trial layout followed Experiment 1, except that presentation time was controlled by ten adaptive staircases (crossing one, two, three, four, or six presentation cycles with low and high SF). 
Materials
The apparatus and stimuli were identical to those in Experiment 1, except that the monitor refresh rate was set to 165 Hz. 
Staircase
We aimed to determine the presentation time needed to reach 75% accuracy in responses using the weighted 3-up-1-down method (Kaernbach, 1991). Correct responses shortened stimulus duration by one frame, while incorrect ones extended it by three frames. The staircase procedure had two phases, made of five and 25 reversals, respectively. To complete the first phase, participants had to complete five reversals. The experiment lasted until 25 reversals, or 100 trials were reached for each condition. The final threshold value was calculated as a median of the stimulus duration at reversals in the second phase. 
One participant exhibited a substantially low number of reversals in low spatial frequency conditions. They completed seven, five, and zero reversals in the second phase of the three-, four- and six-cycle conditions, respectively. Because of the inability to calculate the reliable threshold, this participant was excluded from the analysis. 
The remaining participants completed at least 15 reversals in the second phase except in the six-cycle-low-SF condition, where the staircase procedure reached the floor. Because of the high accuracy and the constraint that the stimulus duration could not be lower than the monitor's frame rate (165 Hz or 6.06 ms), it was not feasible to calculate a reliable threshold for achieving 75% accuracy in this condition. The six-cycle low-SF condition was dropped from the analysis, together with the six-cycle high-SF condition to ensure a balanced design. 
Results and discussion
Results largely replicated the findings from Experiment 1. Increasing the number of cycles significantly decreased the minimum stimulus duration, F(1.46, 13.12) = 32.33, p < 0.001, ηp2 = 0.782, BF10 = 3.82e6, and low SF was associated with shorter presentation times, F(1, 9) = 8.06, p = 0.019, ηp2 = 0.472, BF10 = 4.01 (Figure 3). The interaction was not significant, F(2.03, 18.27) = 0.87, p = 0.44, ηp2 = 0.088, BF10 = 0.26. 
Figure 3.
 
Minimum stimulus duration as a function of the number of cycles and the spatial frequency of stimuli in Experiment 2 (N = 10). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 3.
 
Minimum stimulus duration as a function of the number of cycles and the spatial frequency of stimuli in Experiment 2 (N = 10). Note: Error bars represent the standard error of the mean across participants (SEM).
For every number of cycles, our thresholds were notably greater than the ones reported in Holcombe and Cavanagh (2001), emphasizing the significance of temporal integration across multiple cycles in their study. For a single cycle, our threshold was 100 ms, in contrast to the 14 ms from the previous study's continuous cycles. Moreover, the threshold in the one-cycle condition was considerably higher than in all other cycle conditions, with all t > 5.43, pholm < 0.001, Cohen's d > 1.536, BF10 > 3.65e3. These results demonstrate that completing the task requires either an extended presentation duration or multiple presentation cycles to allow for temporal integration. We also again observe that the task takes a longer time to solve with high spatial frequency. This phenomenon may stem from the fact that neurons sensitive to low spatial frequencies transmit information more rapidly than their counterparts tuned to high spatial frequencies. Alternatively, it could be attributed to the mask's efficacy, as it primarily comprises high spatial frequencies. 
Experiment 3
Experiment 3 was designed as a replication of Experiment 2 and to further investigate the significance of spatial frequency (SF). Additionally, we aimed to explore whether our findings might be contingent on the spatial frequency of the masking stimulus. We therefore added stimuli with medium SF and manipulated the SF of the mask to rule out potential interactions between stimulus and mask SF. 
Methods
Participants
Twelve participants took part in Experiment 3 (MAge= 22.83; nine female). 
Procedure
The general procedure, trial layout and staircase procedure were the same as in Experiment 2, but three factors were manipulated: the number of presentation cycles (1, 2, or 3), the SF of the stimuli (low, medium, high), and the SF of the mask (low and high). 
Materials
Spatial frequencies of the square wave gratings were 1, 3, or 5 cpd. To create a dynamic noise mask of either high or low SF, we generated images using the method by Gatys, Ecker, and Bethge (2015). A convolutional neural network trained on object recognition (VGG-19, Simonyan & Zisserman, 2014) was used to obtain textures from the original stimuli for each combination of color, orientation, and SF. These synthesized images were spatially constrained by VGG-19’s conv1_1 layer. Each dynamic mask was made of four different synthesized images of the same spatial frequency (Figure 1B), which were presented in a random sequence for 41 frames (248 ms), with each image being presented for one frame. 
Results and discussion
A three-way repeated measures ANOVA with the number of cycles, the SF of the stimuli, and the SF of the mask as factors revealed a decrease in the minimum presentation time with an increase in cycles, F(1.35, 14.85) = 33.62, p < 0.001, ηp2 = 0.753, BF10 = 7.62e4, faster binding for stimuli with lower SFs, F(2, 22) = 29.87, p < 0.001, ηp2 = 0.731, BF10 = 3.44e4, and somewhat faster binding for low SF masks, F(1, 11) = 13.39, p = 0.004, ηp2 = 0.549, but BF10 = 1.097 (Figure 4). There were no significant interactions, all F < 1.76, p > 0.20, ηp2 < 0.138, BF10 < 0.41. Post-hoc pairwise comparisons controlling for the false discovery rate using Holm's procedure confirmed the direction of these effects in more detail, with all t-tests comparing cycles and SFs being significant, all t > 3.22, pholm < 0.004, Cohen's d > 0.389, BF10 > 1.23e3, except between two- and three-cycle conditions, t = 1.49, pholm = 0.15, Cohen's d = 0.302, but BF10 = 22.92. 
Figure 4.
 
Stimulus duration as a function of the number of cycles, the spatial frequency of stimuli, and the spatial frequency of masks in Experiment 3 (N = 12). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 4.
 
Stimulus duration as a function of the number of cycles, the spatial frequency of stimuli, and the spatial frequency of masks in Experiment 3 (N = 12). Note: Error bars represent the standard error of the mean across participants (SEM).
These results again support the late-binding hypothesis; in the one-cycle condition, thresholds were well above 100 ms. Furthermore, these results indicate that the advantage of low over high SF stimuli cannot simply be explained by backward masks interfering more with high SF processing. 
General discussion
Contrary to theories proposing rapid feature binding, our findings showed that binding requires a significant amount of time. As the number of presentation cycles increased, the window for temporal integration increased, leading to higher accuracy in reporting feature conjunctions and a lower required presentation duration for individual stimuli. A manipulation of the speed of feedforward processing with low- and high-frequency stimuli shows that the frequency of spatial information influences processing time, but this does not interact with the number of cycles. Together these observations indicate that binding in this situation is not resolved via feed-forward processing. Moreover, backward masking impaired binding performance. Therefore, these results strongly support late-binding theories which suggest that long-lasting recurrent processing is a critical mechanism in feature binding (Bouvier & Treisman, 2010; Di Lollo et al., 2000; Hochstein & Ahissar, 2002; Koivisto & Silvanto, 2012; Lamme & Roelfsema, 2000; Roelfsema, 2006; Treisman, 1996), even for features that belong to the same object. 
Re-evaluating existing literature on binding, we can understand how Holcombe and Cavanagh (2001) could have interpreted their results as being in favor of early binding: temporal integration was taking place over cycles that were, by themselves, short. Some other experiments, that have been interpreted as supporting fast binding, like Seymour et al. (2009), show that binding can be decoded from V1 based on BOLD-MRI. These later results, obtained with MRI, are agnostic concerning the timing of the binding process and could easily result from recurrent processing (Lamme, 1995; Roelfsema, Lamme, & Spekreijse, 1998). 
One can argue that both early-stage feedforward and later-stage recurrent processing would benefit from longer stimulus durations. However, based on our findings, and late-binding theories in general, we believe that the significant improvements in binding performance are primarily associated with recurrent processing. Specifically, we believe that our results do not reflect solely feedforward accumulation over time for two reasons. Firstly, longer stimulus durations would after the initial response lead to the dispersion of activity to higher neurons in the feedforward pathway. This dispersion activates lateral connections, presenting a form of recurrence beyond simple feedforward mechanisms, albeit with interactions with lower-tier areas. Additionally, in Experiment 1, we demonstrate that masking disrupts binding performance when stimuli are presented for one cycle. This disruption in performance, given that masking predominantly interferes with recurrent processing and spares feedforward mechanisms (Fahrenfort et al., 2007), suggests the involvement of an upper-tier integration mechanism. 
But how could the binding of local features be slow if object detection is fast (Thorpe et al., 1996)? Object detection differs from the current task in potentially two crucial aspects. First, for the detection of an object, it is not necessary to explicitly integrate the features from an object and segregate them from the background; this task can be solved based on the likelihood of the features in an object (Zhang, Jin, & Zhou, 2010) or even using the statistics of the rest of the scene (Loke et al., 2024; Seijdel, Tsakmakidis, Haan, de Bohte, & Scholte, 2020). In this perspective, the rapid detection of objects (e.g., animal vs. non-animal) could be the result of the feed-forward detection of relevant feature constellations of the scene or background, with more recurrent processing resulting in better solutions (Sörensen, Bohté, De Jong, Slagter, & Scholte, 2023). In contrast, in the current study solving the task requires explicit binding. Second, although humans have a lot of experience with natural images and may learn to recognize certain feature combinations that occur frequently during development, in our and similar experiments, the features and their binding do not occur in high frequency in the natural world. Although previous research has shown that humans are capable of learning to bind even arbitrary features effectively with practice (Frank, Reavis, Tse, & Greenlee, 2014; Yashar & Carrasco, 2016), this was not a factor in the current study. 
It is worth emphasizing that the suggested involvement of recurrent processes in feature binding for features within an object does not necessarily reflect specific attentional processes “gluing” features together. Such slow attentional processes seem to be required for binding when features need to be sampled at different locations. Indeed, when orientation and luminance features are presented separately above and below fixation rather than superimposed as in the present study, binding is an order of magnitude slower (Nakayama & Motoyoshi, 2019), even for continuous presentation sequences (Holcombe & Cavanagh, 2001). Interestingly, for spatially separate features, binding performance slowly fluctuates at rates of around 8 Hz (Nakayama & Motoyoshi, 2019), similar to the attentional sampling of visual information in other tasks, whereas no such periodicity has been observed for the binding of superimposed features. Thus, although our results indicate a role of time-consuming recurrent processes for feature binding, binding of superimposed features seems independent of attentional fluctuations, consistent with the notion that recurrent processing and attention are fully dissociable. 
However, in the current setup, we do not explicitly separate feature detection from feature binding. We presented stimuli at the center of fixation where feature detection is relatively easy (Carrasco, Evert, Chang, & Katz, 1995; Wolfe, O'Neill, & Bennett, 1998). Furthermore, subjects did not report problems with the visibility of the features. Finally, performance without masking was substantially higher, in line with the idea that the detection of simple features should not be hampered by masking, while binding might. Regardless, we did not test explicitly if subjects fail at feature detection. To attribute errors more accurately to either binding or detection issues, especially at wider eccentricities, future studies should additionally test detection performance across different stimulus durations. 
Given that stimuli were presented close to fixation, our findings on feature binding times are specific to the central visual field. The central location of the stimulus eliminated any advantage of executing eye movements. Moreover, the brief stimulus duration prevented the planning and execution of eye movements following its presentation, precluding the integration over eye movements. Notably, existing research indicates that feature binding is significantly worse outside the fovea (Yashar, Wu, Chen, & Carrasco, 2019). Therefore we expect that while results may vary with eccentricity, they would reveal even slower binding times in the periphery and further support our conclusion that feature binding is not a rapid process. 
Moreover, we focused on evaluating the accuracy of participants' responses. Although the paradigm is effective for assessing error rates, it is not suitable for evaluating speed-accuracy trade-offs, as participants could only respond after a blank screen, introducing delays that complicated the interpretation of reaction time data. To address these trade-offs more effectively, future research should use paradigms that allow for assessing both speed and accuracy, enabling participants to respond immediately after the stimulus presentation. Thus our findings should be understood primarily in terms of accuracy, with further studies needed to explore speed-accuracy trade-offs using different setups. 
We also found that the success of the binding process increased as the SF of the stimuli decreased. Further research is necessary to clarify the implications of this finding. One suggestion could be that the speed of the feedforward response (via parvo- vs. magnocellular pathways) also determined the total time necessary for binding. Given that parvo- and magnocellular pathways have significant overlap in their spatial frequency tuning (Edwards, Goodhew, & Badcock, 2021), future studies could more selectively activate these pathways by using isoluminant colors (Derrington & Lennie, 1984) and by manipulating the location of the stimuli within the visual field (Dacey, 1993; Merigan & Maunsell, 1993; Perry & Cowey, 1985). 
The current results support the idea that local visual feature binding happens over a long timescale, at least for artificial stimuli. Based on these findings, we suggest that this type of binding does not occur in the early stages of visual processing but involves (local) recurrent processing. 
Acknowledgments
Commercial relationships: none. 
Corresponding author: Lucija Blaževski. 
Email: blazevskilucija@gmail.com. 
Address: Department of Psychology, University of Amsterdam, Nieuwe Achtergracht 129B, Amsterdam, 1001 NK, The Netherlands. 
References
Blaser, E., Papathomas, T., & Vidnyánszky, Z. (2005). Binding of motion and colour is early and automatic. European Journal of Neuroscience, 21(7), 2040–2044, https://doi.org/10.1111/j.1460-9568.2005.04032.x. [PubMed]
Blaževski, L., Stein, T., & Scholte, H.S. (2024). Feature binding is slow: Temporal integration explains apparent ultrafast binding, https://doi.org/10.17605/OSF.IO/GEK5X.
Bouvier, S., & Treisman, A. (2010). Visual feature binding requires reentry. Psychological Science, 21(2), 200–204, https://doi.org/10.1177/0956797609357858. [PubMed]
Carrasco, M., Evert, D., Chang, I. & Katz, S. M. (1995). The eccentricity effect: Target eccentricity affects performance on conjunction searches. Perception & Psychophysics, 57, 1241–1261, https://doi.org/10.3758/BF03208380. [PubMed]
Dacey, D. M. (1993). The mosaic of midget ganglion cells in the human retina. The Journal of neuroscience, 13(12), 5334–5355, https://doi.org/10.1523/JNEUROSCI.13-12-05334.1993. [PubMed]
Derrington, A. M., & Lennie, P. (1984). Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. The Journal of Physiology, 357, 219–240, https://doi.org/10.1113/jphysiol.1984.sp015498. [PubMed]
Desimone, R., & Schein, S. J. (1987). Visual properties of neurons in area V4 of the macaque: Sensitivity to stimulus form. Journal of Neurophysiology, 57(3), 835–868, https://doi.org/10.1152/jn.1987.57.3.835. [PubMed]
Di Lollo, V., Enns, J. T., & Rensink, R. A. (2000). Competition for consciousness among visual events: The psychophysics of reentrant visual processes. Journal of Experimental Psychology: General, 129(4), 481–507, https://doi.org/10.1037/0096-3445.129.4.481. [PubMed]
Edwards, M., Goodhew, S. C., & Badcock, D. R. (2021). Using perceptual tasks to selectively measure magnocellular and parvocellular performance: Rationale and a user's guide. Psychonomic Bulletin & Review, 28(4), 1029–1050, https://doi.org/10.3758/s13423-020-01874-w. [PubMed]
Fahrenfort, J. J., Scholte, H. S., & Lamme, V. A. (2007). Masking disrupts reentrant processing in human visual cortex. Journal of Cognitive Neuroscience, 19(9), 1488–1497, https://doi.org/10.1162/jocn.2007.19.9.1488. [PubMed]
Frank, S. M., Reavis, E. A., Tse, P. U., & Greenlee, M. W. (2014). Neural mechanisms of feature conjunction learning: Enduring changes in occipital cortex after a week of training. Human Brain Mapping, 35(4), 1201–1211, https://doi.org/10.1002/hbm.22245. [PubMed]
Gatys, L., Ecker, A. S., & Bethge, M. (2015). Texture synthesis using convolutional neural networks. Advances in Neural Information Processing Systems, 28. https://proceedings.neurips.cc/paper/2015/hash/a5e00132373a7031000fd987a3c9f87b-Abstract.html.
Hochstein, S., & Ahissar, M. (2002). View from the top: Hierarchies and reverse hierarchies in the visual system. Neuron, 36(5), 791–804, https://doi.org/10.1016/S0896-6273(02)01091-7. [PubMed]
Holcombe, A., & Cavanagh, P. (2001). Early binding of feature pairs for visual perception. Nature Neuroscience, 4, 127–128, https://doi.org/10.1038/83945. [PubMed]
JASP Team. (2024). JASP (Version 0.18.3) [Computer software].
Kaernbach, C. (1991). Simple adaptive testing with the weighted up down method. Perception & Psychophysics, 49 (3), 227–229, https://doi.org/10.3758/BF03214307. [PubMed]
Kauffmann, L., Ramanoël, S., & Peyrin, C. (2014). The neural bases of spatial frequency processing during scene perception. Frontiers in Integrative Neuroscience, 8. https://www.frontiersin.org/articles/10.3389/fnint.2014.00037.
Koivisto, M., & Silvanto, J. (2012). Visual feature binding: The critical time windows of V1/V2 and parietal activity. NeuroImage, 59(2), 1608–1614, https://doi.org/10.1016/j.neuroimage.2011.08.089. [PubMed]
Lamme, V. A. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. The Journal of Neuroscience, 15(2), 1605–1615, https://doi.org/10.1523/JNEUROSCI.15-02-01605.1995.
Lamme, V. A., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neurosciences, 23(11), 571–579, https://doi.org/10.1016/s0166-2236(00)01657-x. [PubMed]
Loke, J., Seijdel, N., Snoek, L., Sörensen, L. K. A., van de Klundert, R., Cappaert, N., … Scholte, H. S. (2024). Human visual cortex and deep convolutional neural network care deeply about object background. Journal of Cognitive Neuroscience, 36(3), 551–566, https://doi.org/10.1162/jocn_a_02098. [PubMed]
Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A user's guide (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.
Maunsell, J. H., & Gibson, J. R. (1992). Visual response latencies in striate cortex of the macaque monkey. Journal of Neurophysiology, 68(4), 1332–1344, https://doi.org/10.1152/jn.1992.68.4.1332. [PubMed]
Merigan, W. H., & Maunsell, J. H. (1993). How parallel are the primate visual pathways? Annual Review of Neuroscience, 16, 369–402, https://doi.org/10.1146/annurev.ne.16.030193.002101. [PubMed]
Nakayama, R., & Motoyoshi, I. (2019). Attention periodically binds visual features as single events depending on neural oscillations phase-locked to action. The Journal of Neuroscience, 39(21), 4153–4161, https://doi.org/10.1523/JNEUROSCI.2494-18.2019.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609, https://doi.org/10.1038/381607a0. [PubMed]
Pasupathy, A., & Connor, C. E. (2002). Population coding of shape in area V4. Nature Neuroscience, 5(12), Article 12, https://doi.org/10.1038/972.
Peirce, J. W., Gray, J. R., Simpson, S., MacAskill, M. R., Höchenberger, R., Sogo, H., … Lindeløv, J. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, https://doi.org/10.3758/s13428-018-01193-y.
Perry, V. H., & Cowey, A. (1985). The ganglion cell and cone distributions in the monkey's retina: implications for central magnification factors. Vision Research, 25(12), 1795–1810, https://doi.org/10.1016/0042-6989(85)90004-5. [PubMed]
R Core Team. (2023). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, https://www.R-project.org/.
Roelfsema, P. R., Lamme, V. A., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395(6700), 376–381, https://doi.org/10.1038/26475. [PubMed]
Roelfsema, P. R. (2006). Cortical algorithms for perceptual grouping. Annual Review of Neuroscience, 29, 203–227, https://doi.org/10.1146/annurev.neuro.29.051605.112939. [PubMed]
Rosenblatt, F. (1961). Principles of neurodynamics: Perceptrons and the theory of brain mechanisms (Vol. 55). Washington, DC: Spartan Books.
Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. P. (2005). Spatiotemporal elements of macaque V1 receptive fields. Neuron, 46(6), 945–956, https://doi.org/10.1016/j.neuron.2005.05.021. [PubMed]
Seijdel, N., Tsakmakidis, N., Haan, E. H. F., de Bohte, S. M., & Scholte, H. S. (2020). Depth in convolutional neural networks solves scene segmentation. PLOS Computational Biology, 16(7), e1008022, https://doi.org/10.1371/journal.pcbi.1008022. [PubMed]
Seymour, K., Clifford, C. W., Logothetis, N. K., & Bartels, A. (2009). The coding of color, motion, and their conjunction in the human visual cortex. Current Biology, 19(3), 177–183, https://doi.org/10.1016/j.cub.2008.12.050.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sörensen, L. K., Bohté, S. M., De Jong, D., Slagter, H. A., & Scholte, H. S. (2023). Mechanisms of human dynamic object recognition revealed by sequential deep neural networks. PLOS Computational Biology, 19(6), e1011169, https://doi.org/10.1371/journal.pcbi.1011169. [PubMed]
Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522, https://doi.org/10.1038/381520a0. [PubMed]
Treisman, A. (1996). The binding problem. Current Opinion in Neurobiology, 6(2), 171–178, https://doi.org/10.1016/s0959-4388(96)80070-5. [PubMed]
Treisman, A., & Schmidt, H. (1982). Illusory conjunctions in the perception of objects. Cognitive Psychology, 14(1), 107–141, https://doi.org/10.1016/0010-0285(82)90006-8. [PubMed]
von der Malsburg, C. (1994). The correlation theory of brain function. In Domany, E., van Hemmen, J. L., Schulten, K. (Eds.), Models of Neural Networks: Temporal Aspects of Coding and Information Processing in Biological Systems (pp. 95–119). Berlin: Springer, https://doi.org/10.1007/978-1-4612-4320-5_2.
Wandell, B. A. (1995). Foundations of vision. Sunderland, MA: Sinauer Associates.
Wolfe, J. M., O'Neill, P., & Bennett, S. C. (1998). Why are there eccentricity effects in visual search? Visual and attentional hypotheses. Perception & Psychophysics, 60(1), 140–156, https://doi.org/10.3758/bf03211924. [PubMed]
Yashar, A., & Carrasco, M. (2016). Rapid and long-lasting learning of feature binding. Cognition, 154, 130–138, https://doi.org/10.1016/j.cognition.2016.05.019. [PubMed]
Yashar, A., Wu, X., Chen, J., & Carrasco, M. (2019). Crowding and binding: Not all feature dimensions behave in the same way. Psychological Science, 30(10), 1533–1546, https://doi.org/10.1177/0956797619870779. [PubMed]
Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1), 43–52, https://doi.org/10.1007/s13042-010-0001-0.
Figure 1.
 
Trial layout. (A) In Experiment 1, participants fixated on a central cross for 1000–2000 ms, followed by rapid, 16.67 ms presentations of luminance-orientation pairs for a fixed number of cycles (N cycles). A 250 ms white-noise pattern mask was then shown, followed by a 250 ms blank screen. In the unmasked condition, a blank screen was immediately presented. Participants indicated whether black was paired with a left or right orientation using arrow keys. Experiment 2 followed the same procedure but with variable stimulus presentation times and a mask in all trials. (B) In Experiment 3, the layout resembled Experiment 2, but with medium spatial frequency stimuli and a dynamic noise mask of high or low SF.
Figure 1.
 
Trial layout. (A) In Experiment 1, participants fixated on a central cross for 1000–2000 ms, followed by rapid, 16.67 ms presentations of luminance-orientation pairs for a fixed number of cycles (N cycles). A 250 ms white-noise pattern mask was then shown, followed by a 250 ms blank screen. In the unmasked condition, a blank screen was immediately presented. Participants indicated whether black was paired with a left or right orientation using arrow keys. Experiment 2 followed the same procedure but with variable stimulus presentation times and a mask in all trials. (B) In Experiment 3, the layout resembled Experiment 2, but with medium spatial frequency stimuli and a dynamic noise mask of high or low SF.
Figure 2.
 
Sensitivity (d′) as a function of the number of cycles, the spatial frequency of stimuli, and the mask in Experiment 1 (N = 8). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 2.
 
Sensitivity (d′) as a function of the number of cycles, the spatial frequency of stimuli, and the mask in Experiment 1 (N = 8). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 3.
 
Minimum stimulus duration as a function of the number of cycles and the spatial frequency of stimuli in Experiment 2 (N = 10). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 3.
 
Minimum stimulus duration as a function of the number of cycles and the spatial frequency of stimuli in Experiment 2 (N = 10). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 4.
 
Stimulus duration as a function of the number of cycles, the spatial frequency of stimuli, and the spatial frequency of masks in Experiment 3 (N = 12). Note: Error bars represent the standard error of the mean across participants (SEM).
Figure 4.
 
Stimulus duration as a function of the number of cycles, the spatial frequency of stimuli, and the spatial frequency of masks in Experiment 3 (N = 12). Note: Error bars represent the standard error of the mean across participants (SEM).
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×