The visual system groups similar features, objects, and motion (e.g., Gestalt grouping). Recent work suggests that the computation underlying perceptual grouping may be one of summary statistical representation. Summary representation occurs for low-level features, such as size, motion, and position, and even for high level stimuli, including faces; for example, observers accurately perceive the average expression in a group of faces (J. Haberman & D. Whitney, 2007, 2009). The purpose of the present experiments was to characterize the time-course of this facial integration mechanism. In a series of three experiments, we measured observers' abilities to recognize the average expression of a temporal sequence of distinct faces. Faces were presented in sets of 4, 12, or 20, at temporal frequencies ranging from 1.6 to 21.3 Hz. The results revealed that observers perceived the average expression in a temporal sequence of different faces as precisely as they perceived a single face presented repeatedly. The facial averaging was independent of temporal frequency or set size, but depended on the total duration of exposed faces, with a time constant of ∼800 ms. These experiments provide evidence that the visual system is sensitive to the ensemble characteristics of complex objects presented over time.

^{2}).

*p*= 0.20). Thus, while emotions may unfold nonlinearly in emotion space (Russell, 1980), this did not affect discriminability of our stimulus set. In addition, Figure 3 also verifies that the set members (separated by at least six emotional units from one another) were discriminable.

*F*(3, 6) = 0.12,

*p*> 0.5) suggesting that participants were equally sensitive to the set mean regardless of the rate at which faces were presented (Figure 4B). Set size trended toward significance,

*F*(2, 4) = 5.72,

*p*= 0.09, suggesting that the perception of ensemble facial expression may be more precise for larger set sizes.

*t*-tests examining set size 4, set size 20, and collapsed across set size (the closest test to significance was set size 4,

*t*(2) = 1.68,

*p*= 0.23). Thus, the mean representation derived from a set of sequentially presented different faces can be as precise as that derived from a set of identical faces. These results also confirm that the unique facial expressions in the sets ( Figure 2) were discriminable.

*χ*

^{2}= 1.33 (1),

*p*= 0.25). Despite losing or lacking constituent information, observers were still able to derive an accurate representation of the mean expression (although three observers had not participated in Experiment 1, all of them participated in subsequent experiments and had precise mean representations). This reveals an efficient heuristic at work, one that favors the computationally simplistic extraction of the mean over the more cumbersome (although equally valid) representation of every individual set member.

*t*(6) = 0.95,

*p*= 0.38). This suggests a precise representation of mean expression ( Figure 7B).

*f*(

*x*) =

*a*[exp(−

*bx*)] +

*c,*to performance as a function of overall set duration. Because Experiments 1 and 2 both measured mean discrimination performance (albeit on slightly different tasks), and showed comparable levels of performance, we fit the decay function to the combined data set, collapsed across set sizes ( Figure 7C). This procedure allowed us to identify the time constant of the temporal integration process (1/b is the time constant, tau, which is the time it takes to reach 63% of the asymptotic threshold). The fit of the decay function was significant (

*r*

^{2}= 0.29,

*p*< 0.01), suggesting that longer exposure to the set generally improved sensitivity to average facial expression. The time constant of ensemble face perception was 818 ms, an integration period comparable to that required for biological motion discrimination (Blake & Shiffrar, 2007; Kourtzi, Krekelberg, & van Wezel, 2008; Neri, Morrone, & Burr, 1998).

*n*units away from the actual set mean. We fit a Von Mises curve to the response distribution to concretely characterize observer performance. The Von Mises is a circular Gaussian; given our circle of emotions, this is the appropriate distribution to use. The Von Mises equation was formalized as

*f*(

*x*) =

*a*) was the location of the peak (i.e. where along the circle did the points cluster), and (

*k*) was the concentration (i.e. inversely related to standard deviation, so the larger the number, the more concentrated the distribution). We used the standard deviation of the curve (derived from

*k*) as an estimate of the precision with which observers represented the set mean—the smaller the standard deviation, the more precise the representation. Observers could precisely adjust to the mean expression of a set of sequentially presented faces, indicated by the small standard deviations of the Von Mises curves (see Figure 9A for an example curve). Additionally, the

*a*parameter was not significantly different from 0 (i.e. the mean) in 3 out of 4 of the observers (TH had a slight bias, M = −3.61,

*t*(4) = 10.71,

*p*< 0.001), suggesting that they were adjusting the test face to the mean expression of the set and not some other point on the distribution.

*F*(2, 9) = 1.86,

*p*= 0.21. If there is any improvement in sensitivity to the average facial expression with larger set sizes, this is unlikely to be due to the higher probability of a face occurring in a particular location, because we controlled the probability of a face occurring within a given area (equating average separation among faces in all sets). Therefore, our results cannot be attributed to larger set sizes containing more information in a specific region of the screen than smaller set sizes.

*F*(2,9) = 0.16,

*p*= 0.85). This suggests that overall set duration was a more important factor than the number of faces presented. Consistent with Figure 7C, increasing overall set duration seemed to improve mean representation precision. This is not to say that different set sizes are all processed in the same manner. It is conceivable that observers could extract more information from the multiple viewings of the faces in a larger set. However, any such effect appears to be trumped by the effect of overall set duration.