Work from Rosenholtz and colleagues (
Balas et al., 2009;
Ehinger & Rosenholtz, 2016;
Keshvari & Rosenholtz, 2016;
Rosenholtz, Huang, & Ehinger, 2012;
Rosenholtz, Huang, Raj, et al., 2012;
X. Zhang, Huang, Yigit-Elliot, & Rosenholtz, 2015), has developed and tested an HD pooling model that we call the Texture Tiling Model (TTM).
1 The model consists of two stages. In the first stage, TTM implements a V1-like representation consisting of responses to oriented, multiscale feature detectors. In the second stage, the model computes a large set of second-order correlations from the responses of that stage, taking the average over local pooling regions (TTM also computes more basic first-order summary statistics within each color band;
Balas et al., 2009). These pooling regions grow linearly with eccentricity, in accord with Bouma's law, and overlap and tile the visual field. The information encoded in the second stage, where pooling happens, has been associated with the information encoded physiologically, post-V1 (e.g.,
Freeman, Ziemba, Heeger, Simoncelli, & Movshon, 2013;
Yamins & DiCarlo, 2016). In addition, standard models of hierarchical visual processing (e.g.,
Fukushima, 1980;
Riesenhuber & Poggio, 1999) often have as a second stage the computation of co-occurrence of combinations of features from the first stage; second-order correlations are merely co-occurrence computations pooled over significantly larger regions. The set of statistics we measure are those identified by
Portilla and Simoncelli (2000), because that set has been successful at capturing the appearance of textures for human perception. Specifically, textures synthesized using this set of statistics are often difficult to discriminate from the original (
Balas, 2006). Mounting evidence supports TTM as a good candidate HD pooling model for the peripheral encoding underlying crowding. We have shown that it predicts performance at a range of peripheral recognition tasks involving arrays of letters and other symbols (
Balas et al., 2009;
Keshvari & Rosenholtz, 2016;
Rosenholtz, Huang, & Ehinger, 2012;
Rosenholtz, Huang, Raj, et al., 2012). The same model predicts the difficulty getting the gist of a scene when fixating—that is, when forced to use extrafoveal vision—compared to when free-viewing that scene (
Ehinger & Rosenholtz, 2016). With the same image statistics but a somewhat different arrangement of pooling regions, (
Freeman & Simoncelli 2011) have predicted the critical spacing of crowding. They have also shown that equating those local summary statistics creates synthetic metamer images that are difficult to distinguish one from another when viewed with the same fixation as used by the model (though see
Wallis, Bethge, & Wichmann, 2016). While in all of these studies there has remained variance unexplained by the model, and thus room for improvement, these HD pooling models have so far proven quite powerful at capturing crowding and related visual phenomena.