In everyday life, our brains decide about the relevance of huge amounts of sensory input. Further complicating this situation, this input is distributed over different modalities. This raises the question of how different sources of information interact for the control of overt attention during free exploration of the environment under natural conditions. Different modalities may work independently or interact to determine the consequent overt behavior. To answer this question, we presented natural images and lateralized natural sounds in a variety of conditions and we measured the eye movements of human subjects. We show that, in multimodal conditions, fixation probabilities increase on the side of the image where the sound originates showing that, at a coarser scale, lateralized auditory stimulation topographically increases the salience of the visual field. However, this shift of attention is specific because the probability of fixation of a given location on the side of the sound scales with the saliency of the visual stimulus, meaning that the selection of fixation points during multimodal conditions is dependent on the saliencies of both auditory and visual stimuli. Further analysis shows that a linear combination of both unimodal saliencies provides a good model for this integration process, which is optimal according to information-theoretical criteria. Our results support a functional joint saliency map, which integrates different unimodal saliencies before any decision is taken about the subsequent fixation point. These results provide guidelines for the performance and architecture of any model of overt attention that deals with more than one modality.

*watch and listen carefully to the images and sounds*. No information about the presence of the speakers at both sides of the monitor or the lateralization of the auditory stimuli was provided to the subjects.

^{2}, a velocity above 30 deg/s, a motion threshold of 0.1°, and a duration of more than 4 ms. The intervening episodes were defined as fixation events. The result of applying these parameters was plotted and was visually assessed to check that they produce reasonable results.

*p*

_{ s, i, c}(

*x,*

*y*), for a given subject

*s,*image

*i,*and condition

*c*was calculated as in Equation 1,

*δ*(

*x*) as the discrete Dirac function (the Dirac function is equal to zero unless its argument is zero, and it has a unit integral).

*x*

_{ f},

*y*

_{ f}, and

*t*

_{ f}are the coordinates and time of the

*f*th fixation.

*F*is the total number of fixations. We distinguish three different pdfs for a given condition with respect to how these individual pdfs were averaged:

*subject,*

*image,*and

*spatiotemporal*pdfs. Subject pdfs

*p*

_{ s, c}(

*x,*

*y*) for a given subject

*s*and condition

*c*were built by averaging all the pdfs obtained from a given subject over the images, without mixing the conditions, according to Equation 2,

*p*(

*x,*

*y*)) and spatiotemporal pdfs (

*p*(

*x,*

*t*)) were similarly computed by averaging over the appropriate dimensions. Image pdfs inform us about consistent biases that influence subjects' gaze and are therefore empirically determined saliency maps specific to a given image.

*σ*with a value of 0.6° unless otherwise stated. This spatial scale is twice as large as the maximal calibration error and maintains sufficient spatial structure for data analysis.

*center of gravity*is measured according to Equation 3 in order to quantify the global shift of subject and image pdfs.

*μ*

_{ c}is the center of gravity along the

*X*-axis for condition

*c,*

*p*

_{ c}(

*x*) is the marginal distribution of a pdf along the

*X*-axis. In condition V, fixation probabilities were usually distributed symmetrically on both sides of the visual field with centralized center of gravity values. This simple statistic successfully quantifies any bias toward the sound location,

*spread*of a given pdf was measured from a subject pdf under condition V using Equation 4 in order to quantify how explorative that subject's scanning behavior was.

*σ*is the spread along the

*X*-axis,

*p*

_{V}(

*x*) is the marginal distribution along the

*X*-axis of the subject pdf, and

*μ*

_{V}is the center of gravity computed as in Equation 3. The marginal distributions arising from condition V were well-behaved ( Figure 1B inset), thus allowing us to examine the explorative behavior of subjects. Seven of 42 subjects did not engage in explorative viewing during the analysis of the images, resulting in small spread values. These subjects were excluded from further analysis.

*p*

_{ c}(

*x,*

*t*), contain information about how the fixation density along the horizontal axis varies over time. We assessed the statistical differences of pdfs from different conditions in a time-localized manner; that is, we obtained a

*p*value as a function of time for three pairs of conditions (AVL and V; AVR and V; AL and AR). A two-sided Kolmogorov–Smirnov goodness-of-fit hypothesis test in corresponding temporal portions of the pdfs was used with significance level

*α*set to .001. This was done after binning the probability distribution

*p*

_{ c}(

*x,*

*t*) over the time axis with a bin size of 240 ms, yielding 25 time intervals for comparison per pair of conditions. A temporal interval over which the null hypothesis was rejected in at least in two of the condition pairs was considered as a temporal interval of interest.

*p*

_{ i,V}(

*x,*

*y*) of image

*i*and condition V and

*p*

_{ i,AV}(

*x,*

*y*) of the same image and AV conditions (i.e., either AVL or AVR)—was evaluated by computing

*r*

_{V−AV}

^{2}. Before the coefficients were calculated, the AV image pdfs were first normalized to

*p*

_{ i,AV}

^{ N}according to Equation 5,

*r*

_{V,AV}

^{2}values was then compared to a control distribution of

*r*

^{2}values, which measure the baseline correlation between image pdfs coming from the same condition pairs (i.e., V and AVL, or V and AVR) but differing images.

*D*

_{KL}(

*p*

_{ i, c 1},

*p*

_{ i, c 2}) denotes the Kullback–Leibler divergence measure between two pdfs,

*p*

_{ i, c 1}(

*x,*

*y*),

*p*

_{ i, c 2}(

*x,*

*y*) in bits,

*c*= 10

^{−9}) was added to all entries in the pdf. The precise choice of this constant did not make a difference to the results of our analysis.

*p*

_{ i,AV},

*p*

_{ i,V}, and

*p*

_{ i,A}are the image pdfs of image

*i*at audiovisual, visual, and auditory conditions, respectively. The interaction term

*p*

_{ i,VA}is supposed to approximate the image pdf that would arise from a multiplicative cross-modal interaction. It is created by the element-wise multiplication of both unimodal image pdfs and renormalized to a unit integral.

*x*–

*y*location was extracted from the 32 image pdfs of visual, auditory, and bimodal conditions, yielding a triplet of values representing the saliency. A given triplet defines the saliency of a given image location in the multimodal condition as a function of the saliency of the same location in both unimodal conditions, represented by the point (

*p*

_{V}(

*x,*

*y*),

*p*

_{A}(

*x,*

*y*),

*p*

_{AV}(

*x,*

*y*)) in the integration plot. These points were irregularly distributed and filled the three-dimensional space unevenly. We discarded the 15% of the values which lay in sparsely distributed regions, and we concentrated instead on the region where most of the data were located. The data points inside this region of interest were then binned, yielding an expected value and variance for each bin. Weighted least square analysis was carried out to approximate the distribution by estimating the coefficients of the following equation:

*g*(

*p*

_{ c})), which normalizes for its individual range thus allowing a direct comparison of the regression coefficients.

*actual*contrast distribution. This distribution was compared to a

*control*distribution to evaluate a potential bias at the fixation points. An unbiased pool of fixation coordinates served as the control distribution—for a given image, this was constructed by taking all fixations from all images other than the image under consideration. This control distribution takes the center bias of the subjects' fixations into account, as well as any potential systematic effect in our stimulus database (Baddeley & Tatler, 2006; Tatler, Baddeley, & Gilchrist, 2005). The

*contrast effect*at fixation points was computed by taking the ratio of the average contrast values at control and actual fixations. In order to evaluate the luminance contrast effect over time, the actual and control fixation points were separated according to their occurrences in time. The analysis was carried out using different temporal bin sizes ranging from 100 to 1000 ms.

*μ*

_{V}) is located at 505 and 541 pixels, respectively (white crosses), for these two subjects, in the close vicinity of the center of the screen (located at 512 pixels). In the multimodal conditions (AVL and AVR), both subjects show a change in their fixation behavior. The horizontal distance between

*μ*

_{AVL}and

*μ*

_{AVR}is 221 and 90 pixels for the two subjects, respectively. Thus, in these two subjects, combined visual and auditory stimulation introduces a robust bias of fixation toward the side of sound presentation.

*μ*

_{AVL}−

*μ*

_{V}and

*μ*

_{AVR}−

*μ*

_{V}) for all subjects over all images is skewed ( Figure 3A). In most subjects, we observe a moderate effect of lateralized sound presentation. A small group of subjects showed only a small influence; one subject had an extreme effect. The medians of the two distributions are both significantly different from zero (sign test,

*p*< 10

^{−5}). Crosses indicate the positions of the two example subjects described above. They represent the 70th and 90th percentiles of the distributions. Hence, the complete statistics support the observations reported for the two subjects above.

*t*test,

*p*< 10

^{−5}). In both of the panels A and B, the distributions flank both sides of zero, with mean values of −37 and 36 pixels for

*μ*

_{AVL}−

*μ*

_{V}and

*μ*

_{AVR}−

*μ*

_{V}, respectively. Thus, auditory stimulation introduces a robust bias of fixation toward the side of sound presentation for all natural visual stimuli investigated.

*α*= .001) are indicated by vertical black bars. Comparing conditions AL and AR ( Figure 4, right panel), we observe an increase in fixation probability on the half of the horizontal axis corresponding to the side of the sound location. This difference decays only slowly over time.

*r*

^{2}statistic between saliency maps (such as those in Figure 5) belonging to unimodal and cross-modal conditions. For the examples shown in Figure 5, we obtain a KL divergence of 0.88, 1.32, 0.94, 1.02, and 1.09 bits between V and AVL conditions and 1.38, 1.79, 0.87, 1.09, and 0.79 bits between V and AVR.

*r*

^{2}statistics range from .4 to .97, indicating that a substantial part of the total variance of the multimodal conditions is explained by the distribution of fixation points in the unimodal visual condition.

*SEM*). This is significantly different to the mean of the control distribution (3.45 ± 0.02 bits), which was created using 3200 randomly selected nonmatched V–AV pairs. The control distribution provides the upper limit for KL divergence values given our data set. Hence, given the distribution of fixation points on an image in the visual condition, the amount of information necessary to describe the distribution of fixation points on this image in the multimodal conditions is about one third of the information necessary to describe the difference in fixation points on different images in these conditions.

*r*

^{2}values calculated between image pdfs from multimodal and unimodal conditions originating from the same image. The distribution is centered at .71 ± .13 and the difference between this measure and a control

*r*

^{2}measure calculated from shuffled image pairs is highly significant (

*t*test,

*p*< 10

^{−5}). This implies that for most images, the unimodal fixation pdfs account for more than half of the variance in the observed distribution of fixation points in multimodal conditions. Hence, the bias of gaze movements toward the side of the auditory stimulus largely conserves the characteristics of the visual saliency distribution. Therefore, the behavior of the subjects under the simultaneous presence of auditory and visual stimuli is an integration of both modalities.

*SD*; Figure 7, circles). The contribution of unimodal auditory saliency is smaller (0.16 ± 0.12). The coefficient of the cross-product interaction term is, however, slightly negative with mean −0.05 ± 0.10. We repeated the same analysis for a subset of subjects (

*n*= 14, 40%) for whom the lateralized auditory stimulus had the strongest effect on fixations in terms of gravity center shift in the unimodal auditory conditions. In these subjects, the contribution of auditory coefficients was increased (.32 ± .17) at the expense of the visual ones (.53 ± .20), without any apparent effect on the interaction term (−.06 ± .11,

*t*test,

*p*= .59; Figure 7, crosses). In both cases, the intercept was very close but still significantly different from zero. These results suggest that biggest contributions to the multimodal pdfs originate from the linear combinations of the unimodal pdfs.

*r*

^{2}values having a median value of .72 over all images—as expected from the previous section. Additionally including unimodal auditory pdfs increased the median

*r*

^{2}only slightly by 3%. Repeating this analysis using only the subset of subjects showing the strongest auditory lateralization effect, we obtained a median value of 0.36 with the sole visual regressor. The subsequent introduction of the unimodal auditory pdfs as second dependent variable increased the goodness of fit by 21% over all images. Further including the cross-interaction term as the third dependent variable, the goodness of fit increased slightly, by 5%. Therefore, we can argue that mechanism linearly combining the unimodal saliencies can well account for the observed behavior in the multimodal conditions.

**-**free approach, we compute integration plots using saliencies obtained from different conditions. It should be noted that no assumptions are made regarding the calculation of the saliency; that is, these saliency values are empirically determined by the gaze locations of many subjects. Integration plots are constructed by plotting the saliency of a given spatial location in the multimodal pdfs as a function of unimodal saliencies of the same spatial location. The specific distribution within this three-dimensional space describes the integration process. In Figures 8A, 8B, and 8C, the height of each surface depicts the corresponding salience of the same location during the multimodal condition as a function of the saliency of the same location in unimodal conditions. The three hypotheses about the integration of auditory and visual information make different predictions (see Introduction): Early interaction leads to a facilitatory effect and an expansive nonlinearity ( Figure 8A). The landscape predicted in the case of linear integration is planar and is shown in Figure 8B. Late combination gives rise to a compressive nonlinearity ( Figure 8C).

*t*test,

*p*< 10

^{−6}) except for the intercept coefficient (

*t*test,

*p*= .23).

*r*

^{2}is equal to .89 suggesting a good fit. We repeated the same analysis with the interaction term included after normalizing each regressor with its geometric mean in order to have the same exponent range, thus permitting an evaluation of the contribution of different regressors to the multimodal saliency. This yielded .57 ± .08, .29 ± .08, and .029 ± .05 for visual, auditory, and interaction terms, respectively. The linear contributions of unimodal saliencies were highly significant whereas the intercept and the interaction terms were not statistically different to zero.

*r*

^{2}of the fitted data at this resolution was .87, ensuring that the fit was still reasonably good. The values of coefficients were practically the same, and the only noticeable change during the incremental increase of the resolution was that the interaction term reached the significance level (

*p*< .05) at the resolution of 20 × 20, thus demonstrating a slight facilitatory effect. These results support the conclusion that linear integration is the dominating factor in cross-modal integration during overt attention, with an additional small facilitatory component.

*r*

^{2}values varied within the range of .81 and .9, decreasing with higher binning resolutions. As above, increasing the number of bins revealed a slight but significant facilitatory effect. Within this subset of subjects, the contribution of auditory saliency (.36 ± .08) was again shown to increase at the expense of the visual contribution (.50 ± .08). Removing the interaction term from the regression analysis caused a maximum drop of only 2.5% in the goodness of fit for all tested bin resolutions within this subset of subjects.

*Y*-axis are the data points originating from the region of the screen from which the sound emanates, during the temporal interval of interest. We performed separate regression analyses for these halves of the resulting interaction map. In the lower part (

*p*(A) < 0), the best predictor was the visual saliencies, as can be seen from the contour lines. In the upper part (

*p*(A) > 0), a linear additive model incorporating auditory and visual saliencies well approximates the surface. The results derived in an analysis of the effects of different bin sizes were comparable to the above results; that is, a model combining linearly unimodal saliencies along with a slight facilitatory component was sufficient to explain a major extent of the observed data.

*before*the nonlinearities involved in the process of fixation point selection are applied.