Our visual world is composed of complex information that is continually changing from moment to moment. Any given scene contains a wealth of visual information—pebbles on a beach, leaves on a tree, faces in a crowded room—yet limitations on our attention and short-term memory prevent us from processing every detail (Duncan, Ward, & Shapiro,
1994; Luck & Vogel,
1997; Myczek & Simons,
2008). One way in which the visual system is able to efficiently process this information is by extracting summary statistics (e.g., the average) of a given stimulus feature across an array of objects through a process known as ensemble perception (for reviews, see Alvarez,
2011; Fischer & Whitney,
2011; Haberman, Harp, & Whitney,
2009; Haberman & Whitney,
2011). A large body of evidence has shown that the visual system can rapidly extract the mean of stimulus features such as orientation (Ariely,
2001; Dakin & Watt,
1997; Parkes, Lund, Angelucci, Solomon, & Morgan,
2001), size (Ariely,
2001; Carpenter,
1988; Chong & Treisman,
2003), and motion direction (Watamaniuk & Sekuler,
1992). In recent years, further research on the topic has demonstrated that observers can perceive the mean features from complex objects, such as crowd heading from point-light walkers (Sweeny, Haroz, & Whitney,
2013), emotions from sets of faces (Haberman et al.,
2009; Haberman & Whitney,
2007; Ji et al.,
2013; Ji, Chen, & Fu,
2014; Jung et al.,
2013; Yang, Yoon, Chong, & Oh,
2013), facial identity (de Fockert & Wolfenstein,
2009; Haberman & Whitney,
2007; Yamanashi Leib et al.,
2014; Yamanashi Leib et al.,
2012), crowd gaze direction (Cornelissen, Peters, & Palmer,
2002; Sweeny & Whitney,
2014), and auditory tone (Piazza, Sweeny, Wessel, Silver, & Whitney,
2013). However, it remains a debated question whether ensemble perception of high-level visual stimuli, such as faces, can be accomplished covertly or if it requires overt, sequential foveation of objects before an ensemble representation can be extracted.