Research using the multiple-object tracking (MOT) paradigm has demonstrated that people are able to track four or more targets simultaneously (see Cavanagh & Alvarez,
2005, for a review). The finding is consistent with the observation that in a daily environment people are able to divide their attention among multiple regions of interest. The research has also generated less intuitive findings, particularly concerning the issue of dynamic binding of identity and spatiotemporal information. Objects are commonly distinguished by features. Yet relative to object locations, object features such as color and shape are often lost easily during tracking (Bahrami,
2003; Saiki,
2003a,
2003b). Intuitively, individuating objects by colors should facilitate tracking. However, this does not always have an advantage over tracking physically identical objects (Klieger, Horowitz, & Wolfe,
2004). Although latest evidence does show that unique object shapes facilitate tracking, identity recognition remains worse than tracking performance (Horowitz et al.,
2007). Consistent with this poor binding of object features and location, research has found that people are poor at identifying correctly tracked objects (Pylyshyn,
2004). This seems rather puzzling because successful tracking requires correct target tagging. To explain this paradoxical phenomenon, Pylyshyn suggests that multiple-object tracking may be implemented by low-level vision, where the information about individual identity is encapsulated and inaccessible from higher level cognition.
While identity–location binding often has little consequence in multiple-object tracking, it can be vital in social interactions. It is the responsibility of a caretaker to scrutinize the activities of children in the playground, noting the specific events linked with each individual. An eyewitness of a crime scene has to correctly associate the acts and locations with each involved suspect although this is by no means an easy task. A key motivation behind MOT studies is to understand how identity and spatiotemporal events are bound together in this kind of situations. To address this issue, we employ the MOT paradigm to examine the facial identity and location binding in this study. Although every part of the human body carries useful clues to an identity, the human face is arguably the most salient source of information for person identification. The principal objective of this study is to investigate whether processing of facial identity is to some extent obligatory in multiple-face tracking and how attentional resources are used for identity processing and location tracking. It is difficult to infer the relationship between tracking and processing of facial identity from the existing MOT literature because most studies to date have employed nonface stimuli such as simple geometric shapes and line drawings of objects. Identity processing for faces and objects may not be the same. Research has shown that face identification relies on both featural and configural information, and there are notable differences between entry-level object recognition and face recognition (Bruce & Humphreys,
1994).
To our knowledge, only a recent study by Oksama and Hyönä (
2008) has employed face stimuli in MOT. The main purpose of their study was quite different from ours. They examined how familiar/famous faces affect tracking performance and found that familiar faces were easier to track than pseudo-faces. Our study, on the other hand, used unfamiliar faces as stimuli. There is evidence that identity processing for familiar and unfamiliar faces demands different level of attentional resources (Jackson & Raymond,
2006). The main difference, however, is that our study focuses on whether identity processing of unfamiliar faces is to some extent spontaneous or mandatory without deliberate intentions and whether voluntary control modulates the outcome of dynamic binding of location and identity. In addition, instead of a small number of line drawings, we employed a large number of photographic images to improve the chance of generalization. Line drawings may be limited for understanding face processing because they lack reflectance cues and surface information that are important for face recognition (Bruce, Hanna, Dench, Healey, & Burton,
1992; Russell, Biederman, Nederhouser, & Sinha,
2007; Vuong, Peissig, Harrison, & Tarr,
2005). Line drawings are also known to impair configural processing in face recognition (Leder,
1996).
Although the role of attention in processing facial identity has been studied extensively in recent years, no study to date has employed the MOT paradigm for this purpose. However, the paradigm has obvious advantages for the study of attention in face processing. In reality, the location of a face is rarely fixed. Moreover, it is often necessary to achieve dynamic binding of multiple faces and locations. The role of attention in multiple-face tracking is clearly important for understanding social interactions. Because the current theories would make different predictions about the role of attention in this type of tasks, we first briefly outline some main theoretical positions.