Abstract
Predicting motion in noisy environments is essential to everyday behavior, for instance when participating in traffic. Although many objects provide multisensory information, it remains unknown how humans use multisensory information to track moving objects, and how this depends on sensory interruption or interference (e.g., occlusion). In four experiments, we systematically investigated localization performance for auditory, visual, and audiovisual targets in three situations. That is, (1) locating static target objects, (2) locating moving target objects, and (3) predicting the location of target objects moving under occlusion. Performance for audiovisual targets was compared to performance predicted by Maximum Likelihood Estimation (MLE). In Experiment 1, a substantial multisensory benefit was found when participants localized static audiovisual target objects, showing near-optimal audiovisual integration. In Experiment 2, no multisensory precision benefits were found when participants localized moving audiovisual target objects. Yet, localization estimates were in line with MLE predictions. In Experiment 3A, moving targets were occluded by an audiovisual occluder at an unpredictable timepoint, and participants had to infer the final target location from target speed and occlusion duration. In this case, participants relied exclusively on the visual component of the audiovisual target, even though the auditory component demonstrably provided useful location information when presented in isolation. In contrast, when a visual-only occluder was used in Experiment 3B, participants relied primarily on the auditory component of the audiovisual target (which remained available during visual occlusion), even though the visual component demonstrably provided useful location information during occlusion when presented in isolation. In sum, observers use both hearing and vision when tracking moving objects and localizing static objects, but use only unisensory input when predicting motion under occlusion, perhaps to minimize short-term memory load. Moreover, observers can flexibly prioritize one sense over the other, in anticipation of modality-specific interference.