Object recognition is often thought to be feedforward and hierarchical (DiCarlo, Zoccolan, & Rust,
2012; Hubel & Wiesel,
1962; Hung, Kreiman, Poggio, & DiCarlo,
2005; Riesenhuber & Poggio,
1999; Serre, Kouh, Cadieu, & Knoblich,
2005; Serre, Kreiman, et al.,
2007; Serre, Oliva, & Poggio,
2007; Thorpe, Delorme, & Van Rullen,
2001). The analysis of a visual scene starts with the extraction of basic features (e.g., lines and contours) in the early visual cortex and proceeds to more and more complex features (e.g., shapes, faces, and objects) in higher visual areas. Complex feature detectors are created by pooling outputs from more basic feature detectors. For example, a hypothetical square-detecting neuron receives input from neurons sensitive to its constituting vertical and horizontal lines. Accordingly, neural receptive field sizes along the processing hierarchy increase from step to step simply because a square covers more space than its constituting lines. Therefore, receptive fields need to be larger. One consequence of pooling is that neurons are sensitive to context. Hence, a prediction of such models is that elements neighboring a target element impair target processing because features of the target and flankers are pooled, and thus target information is lost. Indeed, this is the case for crowding (Flom, Heath, & Takahashi,
1963; Levi,
2008; Strasburger & Wade,
2015; Whitney & Levi,
2011). For this reason, pooling models have become the standard in crowding research (Balas, Nakano, & Rosenholtz,
2009; Dakin, Cass, Greenwood, & Bex,
2010; Freeman, Chakravarthi, & Pelli,
2012; Freeman & Simoncelli,
2011; Greenwood, Bex, & Dakin,
2009,
2010; Parkes, Lund, Angelucci, Solomon, & Morgan,
2001; van den Berg, Roerdink, & Cornelissen,
2010; Wilkinson, Wilson, & Ellemberg,
1997).