Free
Research Article  |   August 2010
The potency of people in pictures: Evidence from sequences of eye fixations
Author Affiliations
Journal of Vision August 2010, Vol.10, 19. doi:https://doi.org/10.1167/10.10.19
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Katherine Humphrey, Geoffrey Underwood; The potency of people in pictures: Evidence from sequences of eye fixations. Journal of Vision 2010;10(10):19. https://doi.org/10.1167/10.10.19.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Does the presence of people in a natural scene affect the way that we inspect that picture? Previous research suggests that we have a natural tendency to look at the social information before other items in a scene. There is also evidence that accuracy of visual memory and the way we move our eyes are related. This experiment investigated whether eye movements differed when participants correctly and incorrectly identified stimuli at recognition, and how this is affected by the presence of people. Eye movements were recorded from 15 participants while they inspected photographs at encoding and during a recognition memory test. Half of the pictures contained people and half did not. The presence of people increased recognition accuracy and affected average fixation duration and average saccadic amplitude. Accuracy was not affected by the size of the Region of Interest (RoI), the number of people in the picture, or the distance of the person from the center. Analyses of the order and pattern of fixations showed a high similarity between encoding and recognition in all conditions, but the lack of relationship between string similarity and recognition accuracy challenges the idea that the reproduction of eye movements alone is enough to create a memory advantage.

Introduction
Does the presence of people in a natural scene affect the way we move our eyes? Previous research suggests that we have a natural tendency to look at the social information before other items in a scene. Yarbus (1967) showed participants a picture of the Repin painting “An Unexpected Visitor” and found that there was a tendency to look at the heads and faces of the people. This focus on heads and faces could be to try and work out where the people are attending (e.g., Baron-Cohen, 1994) and has been shown to occur in children as young as 3 months (Hood, Willen, & Driver, 1998), in 3- to 5-year-old children (Ristic, Friesen, & Kingstone, 2002), and in adults (Friesen & Kingstone, 2008; Langton & Bruce, 1999). These studies suggest that we do indeed look at social information in a scene. However, Birmingham, Bischof, and Kingstone (2008) argue that the effect of social information/cuing in these (and other similar) studies might be so strong because the stimuli show only a face, therefore restricting what the participant looks at. In response to this, Birmingham et al. used complex real-world scenes containing people and found that participants still fixated on the eyes more frequently than other objects/regions. 
This repeated finding that viewers focus on people in a scene (and more specifically faces) has led to the suggestion that faces have a biological significance that attracts attention (e.g., Ro, Russell, & Lavie, 2001). Ro et al. found that, using a flicker paradigm, changes to faces were noticed both more accurately and more quickly than changes to other objects. Participants have also been shown to detect a change made to a scene sooner when an individual appearing in the scene was gazing at the changing object than when the individual was absent, gazing straight ahead, or gazing at a non-changing object (Langton, O'Donnell, Riby, & Ballantyne, 2006). Fletcher-Watson, Findlay, Leekam, and Benson (2008) showed participants two scenes next to each other, one of which contained a person. Participants exhibited a preferential viewing to the person-present rather than the person-absent scenes and instructions to identify the person's gender increased viewing of the person-present scenes from the first fixation. 
Despite the abundance of evidence for people being fixated more often than other areas, few studies have looked at the effect that the presence of people has on measurements such as saccadic amplitude, recognition accuracy, or scanpaths (the sequence of fixations and saccades). Previous research has found similarities between scanpaths at encoding (initial inspection of the scene) and recognition (e.g., Humphrey & Underwood, 2009); are these similarities affected by the presence of people? It would also be interesting to find out whether scanpaths differ at recognition test depending on whether participants correctly or incorrectly identify the stimuli at recognition test. Noton and Stark first proposed a “Scanpath Theory” in 1971 (Noton & Stark, 1971), whereby reproducing the same eye movements at second viewing of a stimulus should enhance recognition. It was originally suggested that oculomotor movements and neural mechanisms in the brain were directly related to an internal cognitive-spatial model; however, these assumptions are largely unsupported and have attracted criticism (e.g., Henderson, 2003). Despite this, there is some evidence that scanpaths are similar during encoding (initial inspection) of a picture and when viewing that picture for a second time (Foulsham & Underwood, 2008; Humphrey & Underwood, 2009; Stark & Ellis, 1981; Underwood, Foulsham, & Humphrey, 2009; Underwood, Humphrey, & Foulsham, 2008; Walker-Smith, Gale, & Findlay, 1977). For example, Foulsham and Underwood (2008) showed participants a set of pictures at encoding and then later during a recognition test. Results revealed a high similarity in the first five or six fixations over repeated viewings. Similarly, Harding and Bloj (2010) found above chance similarity of scanpaths at encoding and recognition, despite manipulation of low-level image properties. 
Scanpaths are an important tool in eye movement research, as they tell us more than simple fixation location/duration; they allow insight into picture inspection over a period of several seconds. Scanpath analyses take into account not only where in a scene a viewer fixates but in what order and pattern. These patterns of inspection have been found to be highly similar not only over multiple viewings of a picture but also after extended periods of time (Underwood et al., 2008). 
To test whether Scanpath theory holds any truth in its predictions between similarity in scanpaths and recognition accuracy, eye movements during encoding in the current study will be compared to eye movements during a recognition memory test. The similarity scores will then be related to accuracy of identifying whether a picture has been seen before or not. If the presence of people in a scene creates very similar eye movements at encoding and recognition due to attention being drawn toward social cues, it is predicted that accuracy should increase for pictures containing people. 
Previous research has suggested that accuracy of recognition and recall is affected by eye movements within a scene. For example, Underwood, Chapman, Berger, and Crundall (2003) recorded eye movements while participants watched video recordings taken from a moving vehicle. Hazardous events (e.g., a pedestrian stepping out into the road) were fixated more often than non-hazardous objects/events. When memory was tested immediately after a hazardous event, there was evidence of attentional focusing and reduced availability of details about incidental objects. This demonstrates that where people look in a scene can affect how accurate they are at recognizing or recalling details from the scene at a later time. One aim of the current experiment is to investigate if eye movements (including fixation duration, saccadic amplitude, and scanpaths) differ when participants correctly and incorrectly identify stimuli at recognition, and whether this is affected by the presence of people in the natural scene. 
Furthermore, experts have been found to exhibit enhanced memory for pictures from their own domain and fixate more on domain-specific areas of interest. For example, experienced radiologists are reliably faster and more accurate at identifying abnormalities in X-rays than novices are (De Valk & Eijkman, 1984; Nodine, Mello-Thoms, Kundel, & Weinstein, 2002). Skilled musicians exhibit faster recognition of notes (Bean, 1938; Salis, 1980; Sloboda, 1978) and enhanced encoding of musical information (Clifton, 1986; Halpern & Bower, 1982; Sloboda, 1976; Thompson, 1987), and sports enthusiasts demonstrate increased recall of domain-specific knowledge (Chase & Erikson, 1982; Voss, Vesonder, & Spilich, 1980). This experience-dependant variation in eye movements and memory has also been documented in the domains of driving, industry, and leisure. 
In addition to domain-specific memory advantages, experts have been found to produce more similar scanpaths at encoding and recognition when viewing pictures from their own domain. Furthermore, low-level visual saliency has been found to influence eye movements during scene inspection, but this is reliably reduced when viewing pictures relevant to one's specialist knowledge. Humphrey and Underwood (2009) report that actual scanpaths are least similar to scanpaths predicted by saliency when experts view pictures from their own domain. 
It could be argued that we are all experts at recognizing people and social cues, and therefore, the presence of people in pictures should increase recognition accuracy and scanpath similarity over multiple viewings. One aim of the current experiment is to test this theory using complex real-world scenes. 
Methods
Participants
Fifteen participants took part in the experiment, all of whom were students at Nottingham University (undergraduates and postgraduates). The age range was 18–39 and the mean age was 21.5. The sample comprised 10 females and 5 males. All participants had normal or corrected-to-normal vision. Inclusion in the study was contingent on reliable eye tracking calibration and datum from one participant had to be excluded. 
Materials and apparatus
Eye position was recorded using an SMI iVIEW X Hi-Speed eye tracker that uses an ergonomic chin rest and provides very precise data within a gaze position accuracy of 0.2 degrees. The system parses samples into fixations and saccades based on velocity across samples, with a spatial resolution of 0.01°, a processing latency of less than 0.5 ms, and a sampling rate of 240 Hz. An eye movement was classified as a saccade when its velocity reached 30 deg/s or when its acceleration reached 8000 deg/s2
A set of 200 high-resolution digital photographs was prepared as stimuli, sourced from a commercially available CD-ROM collection, and taken using a 5-MP digital camera. The photos were of agricultural scenes and of this set of 200, 100 contained people and 100 did not (see Figures 1 and 2). The regions of interest were defined as rectangular box around the person or persons inside the picture, meaning that every “people picture” had its own unique RoIs. Half the pictures contained one person and half contained more than one person. When the scenes contained more than one person, the people were generally in the same region of the picture and the RoI was defined around them. However, when the scene contained multiple people that were not in the same region, RoIs were defined around each individual and an average RoI size and distance from the center was calculated. In 54% of the pictures, the RoIs were less than 10 degrees of visual arc from the center, with the remaining 46% being 10 degrees or more from the center of the picture. In half the pictures, the “people” RoIs were less than 20 pixels, and in half, they were 20 pixels or more. 
Figure 1
 
An example of a “no people” stimulus.
Figure 1
 
An example of a “no people” stimulus.
Figure 2
 
An example of a “people” stimulus.
Figure 2
 
An example of a “people” stimulus.
Half of each stimulus category (people/no people) were designated “old” and shown in both encoding and test phases, while the other half were labeled “new” and were shown only as fillers at test. Pictures were presented on a color computer monitor at a resolution of 1600 by 1200 pixels. The monitor measured 43.5 cm by 32.5 cm, and a fixed viewing distance of 98 cm gave an image that subtended 25.03 by 18.83 degrees of visual angle. 
Procedure
Following a 9-point calibration procedure, participants were shown written instructions asking them to inspect the following pictures in preparation for a memory test. 
In a practice phase designed to familiarize participants with the equipment, the displays, and the task, they were shown a set of five photographs that were similar to the ones in the experimental set. Participants were not told to look for anything in particular in any of the pictures but were asked to look at them in preparation for a memory test. Following the practice phase, the first stage of the experiment began. One hundred stimuli (50 with people, 50 without) were presented in a randomized order. Each picture was preceded by a fixation cross, which ensured that fixation at picture onset was in the center of the screen. Each picture was presented for 2000 ms, during which time participants moved their eyes freely around the screen. A presentation time of 2000 ms was long enough to get an average of 7 fixations but short enough to make the task quite challenging. The task was designed to be difficult in order to decrease the accuracy rate, so that eye movements from correct responses at recognition could be compared to incorrect responses. This was achieved by shortening the presentation time and also using a large number of pictures (200 pictures at recognition). 
After all 100 stimuli had been presented, participants were informed that they were going to see a second set of pictures and had to decide whether each picture was new (never seen before) or old (from the previous set of pictures). Participants were instructed to press “N” on the keyboard if the picture was new, and “O” on the keyboard if the picture was old. During this phase, 200 stimuli were presented in a random order; 100 of these were old and 100 new (though the participants were not informed of this fact). In order to facilitate an ideal comparison between encoding and test phases, each picture was again shown for 2000 ms and participants could only make a response after this time. Accuracy was emphasized over speed. This was to encourage scanning of the whole picture so that scanpaths from the first and second phases of the experiment could be compared. At the start of the second phase, participants were given a practice of the task, using 10 photographs that were similar to the ones in the experimental set, 5 of which were the practice photographs from the first part of the experiment. Feedback was given in the practice phase as to whether or not the participant gave the correct response of “old” or “new.” No feedback was given in the experimental phase. 
Results
There were 2 main types of data, recognition memory data (accuracy) and eye tracking measures—average fixation durations, average saccadic amplitude, and string analyses (encoding compared to second viewing). 
Trials were excluded where the fixation at picture onset was not within the central region (the central square around the fixation cross when the picture was split into a 5 × 5 grid at analysis), when participants looked away from the screen (e.g., to the keyboard), or when calibration was temporarily interrupted (e.g., if the participant sneezed, therefore removing their head from the eye tracker). Overall, 7% of trials were excluded. 
Recognition memory
Accuracy
Participants were more accurate when the stimuli contained people (76%) compared to when no people were present in the stimuli (71%). Accuracy was measured by the percentage of pictures correctly identified at recognition (see Figure 3). 
Figure 3
 
A bar chart showing the recognition accuracy rates for pictures containing people and those containing no people. Participants are more accurate when pictures contain people. The error bars represent the standard error of the mean.
Figure 3
 
A bar chart showing the recognition accuracy rates for pictures containing people and those containing no people. Participants are more accurate when pictures contain people. The error bars represent the standard error of the mean.
A one-way ANOVA was conducted and found the difference to be statistically reliable, F(1, 26) = 5.356, MS = 0.014, p < 0.05. An overall d prime (d′) measure was calculated as 1.281. The mean d′ for “people pictures” was 1.500, and the mean d′ for “no people pictures” was 1.248. A one-way ANOVA found no reliable difference in d′ between the stimulus types. So although the relatively low d′ value represents some response bias (e.g., to respond “new” at recognition test), this was not reliably different for people and no people pictures. 
Three further one-way ANOVAs were carried out to take into account other variables of the stimuli that could affect recognition accuracy, namely the size of the Region of Interest (RoI), the number of people in the picture, and the distance of the RoI from the center of the screen (see Figures 4A4C). 
Figure 4
 
(A) A bar chart illustrating the accuracy rates for pictures with a Region of Interest of less than 20 pixels and for pictures with a Region of Interest of 20 pixels or more. The error bars represent the standard error of the mean. (B) A bar chart illustrating the accuracy rates for pictures containing one person and for pictures containing more than one person. The error bars represent the standard error of the mean. (C) A bar chart illustrating the recognition accuracy rates for pictures where the RoI is less than 10 degrees of visual arc from the center of the screen (initial fixation) and where the RoI is 10 degrees of visual arc or more from the center of the screen. The error bars represent the standard error of the mean.
Figure 4
 
(A) A bar chart illustrating the accuracy rates for pictures with a Region of Interest of less than 20 pixels and for pictures with a Region of Interest of 20 pixels or more. The error bars represent the standard error of the mean. (B) A bar chart illustrating the accuracy rates for pictures containing one person and for pictures containing more than one person. The error bars represent the standard error of the mean. (C) A bar chart illustrating the recognition accuracy rates for pictures where the RoI is less than 10 degrees of visual arc from the center of the screen (initial fixation) and where the RoI is 10 degrees of visual arc or more from the center of the screen. The error bars represent the standard error of the mean.
There was no reliable difference in recognition accuracy due to the number of people, the size of the RoI, or the distance of the RoI from the center of the screen. 
Eye tracking measures
Average fixation duration
Fixations under 70 ms were counted as corrective fixations and were not included in the analyses. The mean average fixation durations are shown in Table 1
Table 1
 
Average fixation durations for pictures shown at recognition memory test.
Table 1
 
Average fixation durations for pictures shown at recognition memory test.
Stimulus Fixation duration (ms)
Incorrect, no people, old 340.7
Incorrect, no people, new 331.2
Incorrect, people, old 357.4
Incorrect, people, new 338.5
Correct, no people, old 349.5
Correct, no people, new 335.9
Correct, people, old 357.9
Correct, people, new 335.1
A 2 × 2 × 2 repeated measures ANOVA found no reliable difference in fixation duration between correctly and incorrectly identified stimuli or between people and no people stimuli. There was a reliable difference in fixation duration between old stimuli and new stimuli, F(1, 13) = 14.223, MSE = 518.773, p < 0.05. There were no reliable interactions. 
Regions of interest analysis
Regions of Interest (RoI) analyses were conducted using a toolbox in Matlab called “ilab” (Gitelman, 2002). On average, the RoI occupied 19.9% of the “people picture,” and 58% of fixations fell inside the RoI (i.e., were focused on the person/persons in the picture). Additionally, an average of 45% of all the time spent on each “people picture” was focused on the RoI. In order to calculate a chance baseline, each “people picture” was randomly paired with a “no people picture.” The RoI from the “people picture” was applied to the “no people picture” and the number of fixations and time spent in that area was calculated (see the Discussion section for a critical evaluation of this method). Paired samples T-tests were carried out and found that a reliably greater number of fixations fell in the RoIs in the people pictures than in the same areas on the randomly assigned “no people pictures.” Furthermore, a reliably longer amount of time was spent in the RoIs in the people pictures than in the same areas on the randomly assigned “no people pictures.” It can be concluded that more fixations fell inside the RoIs of the people pictures than would be expected by chance: t(13) = 9.709, p < 0.05; and that more time was spent fixating within the RoIs of the people pictures than would be expected by chance: t(13) = 12.755, p < 0.05. 
Average saccadic amplitude
Mean saccadic amplitudes are shown in Table 2
Table 2
 
Average saccadic amplitudes for pictures shown at recognition memory test.
Table 2
 
Average saccadic amplitudes for pictures shown at recognition memory test.
Stimulus Saccadic amplitude (deg arc)
Incorrect, no people, old 3.7
Incorrect, no people, new 4.3
Incorrect, people, old 4.6
Incorrect, people, new 4.4
Correct, no people, old 4.3
Correct, no people, new 4.2
Correct, people, old 4.1
Correct, people, new 4.7
A 2 × 2 × 2 repeated measures ANOVA found no reliable difference in saccadic amplitude between correctly and incorrectly identified stimuli, people and no people stimuli, or between old and new stimuli. There were no reliable two-way interactions, although the interaction between people/no people and old/new stimuli was nearing significance. There was a reliable 3-way interaction, F(1, 13) = 10.454, MSE = 0.317, p < 0.05 (see Figure 5). 
Figure 5
 
A line graph to illustrate the 3-way interaction between the factors people/no people, correct/incorrect, and old/new pictures.
Figure 5
 
A line graph to illustrate the 3-way interaction between the factors people/no people, correct/incorrect, and old/new pictures.
Post-hoc T-tests showed a reliable difference between: “incorrect old no people” and “incorrect new no people,” t(13) = 2.747, SEM = 0.32824, p < 0.05; “incorrect old no people” and “correct old no people,” t(13) = 2.708, SEM = 0.21263, p < 0.05; and between “incorrect old people” and “correct old people,” t(13) = 2.424, SEM = 0.21592, p < 0.05. 
Scanpaths: String editing
String editing was used to analyze the similarity between scanpaths produced at encoding and second viewing. This technique is described in detail elsewhere (Brandt & Stark, 1997; Choi, Mosley, & Stark, 1995; Foulsham & Underwood, 2008; Hacisalihzade, Allen, & Stark, 1992) and involves turning a sequence of fixations into a string of characters by segregating the stimulus into labeled regions. The similarity between two strings is computed by calculating the minimum number of editing steps required to turn one into the other. 
Here, strings were cropped to seven letters and were computed for each subject viewing each stimulus in the experiment. Seven letters were used because this was the mean number of fixations made on each stimulus. This gave a more standardized and manageable data set and was long enough to display any emerging similarity. In those trials where fewer than seven fixations remained after condensing gazes, any comparison strings were trimmed to the same length. Once the strings had been produced for all trials, they were compared using the editing algorithm and an average string similarity was produced across trials. String similarity scores can range from zero to one, with zero being no similarity at all (no shared letters between the strings) and one being the perfect replication of eye movements (identical strings). 
Chance was calculated by comparing eye movements on each picture a participant viewed to eye movements that the participant produced on another randomly selected picture. This analysis was split into two categories—pictures that contained people and pictures that did not contain people. This was repeated for all 14 participants and average similarity scores were calculated; 0.20969 for pictures that contained people and 0.232419 for pictures that did not contain people. Scanpaths were not compared between participants as previous research has found that there is significantly less similarity between participants than between multiple scanpaths from the same participant. This might have created an artificially low chance baseline. 
Encoding vs. recognition
The scanpaths generated from encoding (initial inspection) of a picture were compared to those on second viewing during the recognition test (see Figure 6). 
Figure 6
 
A bar chart illustrating the mean similarity of scanpaths between encoding (initial inspection) and recognition. A score of 1 would indicate identical scanpaths. The error bars represent the standard error of the mean.
Figure 6
 
A bar chart illustrating the mean similarity of scanpaths between encoding (initial inspection) and recognition. A score of 1 would indicate identical scanpaths. The error bars represent the standard error of the mean.
When participants were incorrect, scanpaths were more similar at encoding and recognition if the pictures contained people (mean similarity 0.3) compared to when no people were present (mean similarity 0.25). 
A 2 × 2 repeated measures ANOVA found no reliable difference in string similarity scores between correct and incorrect stimuli, i.e., that eye movements were not reliably more similar (at encoding and recognition) when pictures were correctly identified compared to incorrectly identified. This suggests that the replication of scanpaths alone is not enough to produce a memory advantage. 
There was no reliable interaction between correct/incorrect and people/no people. However, there was a reliable difference in string similarity scores between people and no people pictures, F(1, 13) = 7.184, MSE = 0.002, p < 0.05. It can be concluded that scanpaths are reliably more similar at encoding and recognition when pictures contain people (mean similarity 0.3) compared to when no people were present (mean similarity 0.27). 
For three of the conditions, scanpaths at encoding and recognition were reliably more similar than would be expected by chance: for Correct People vs. chance, t(13) = 4.462, p < 0.001; for Incorrect People vs. chance, t(13) = 5.071, p < 0.001; for Correct No People vs. chance, t(13) = 2.517, p < 0.05. The Incorrect No People condition was not statistically reliably greater than chance. 
Discussion
Does the presence of people in a picture affect recognition memory? Our findings show that it does. Participants were reliably more accurate when the stimuli contained people compared to when no people were present in the stimuli. Consistent with previous research, semantically informative scene items could have aided memory in the later recognition test (e.g., Henderson, 2003) and the people in the scene could have acted as socially informative items (Birmingham et al., 2008). When encoding the pictures, participants may use these social cues to form conclusions such as “picking bananas” or “driving the tractor,” to aid memory at recognition. These conclusions would be harder to form in “no people” pictures, consisting of fields, hay bails, farm houses, cattle, tractors, etc. 
This said, it should be noted that the task, with 200 pictures at recognition, was far from easy (mean accuracy was 70% for people pictures and 61% for no people pictures). Both the “people” and “no people” pictures had others in the set that were very similar—in some cases, it was the same scene taken from a different camera angle (participants were warned of these possibilities and were told to treat them as “new” unless they were identical to previously seen pictures). The “no people” pictures had other inanimate objects of a similar size to persons in the “people” pictures, and in both cases, objects/people were sometimes in the foreground and sometimes the background. The task was deliberately designed to be difficult in order to compare the differences in eye data on the old pictures that participants correctly identified and those that they incorrectly identified. The task difficulty did not overshadow the fact that participants were reliably more accurate when the stimuli contained people compared to when no people were present in the stimuli. 
The large and varied nature of the stimulus set helped make the task difficult and the stimuli unpredictable. Further analysis of the data revealed that accuracy was not affected by the number of people in the pictures, the size of the Region of Interest, or the distance of the RoI from the center. This suggests that differences in fixation duration, saccadic amplitude, and scanpaths are not due to variations within the stimulus set. More importantly, it suggests that no matter how many people there are in a picture, how large or small those people are, or how far away they are from initial fixation, the presence of people in a natural scene affects eye movements and recognition memory. 
Average Fixation Duration analyses showed that when participants viewed old stimuli, they made longer fixations. One possible explanation is that the participants fixated on an area/object that looked familiar and looked at it for longer to be sure that they had seen it before. On new pictures, participants may have looked around the stimuli trying to find familiar areas, but since there were no familiar areas, the duration of each fixation was shorter. This coincides with the accuracy data that suggest “old” pictures are harder to identify than “new” ones. 
The Region of Interest analyses revealed that more fixations fell inside the RoIs of the people pictures than would be expected by chance and that more time was spent fixating within the RoIs of the people pictures than would be expected by chance. In other words, participants did look at the people in the pictures, providing evidence that the presence of people in natural scenes does affect the way we move our eyes. We understand that there is no agreed way to calculate a chance baseline and many different methods have been considered (e.g., Fletcher-Watson et al., 2008, paired a person-present scene with a person-absent scene in the same display). The problem lies in the visual complexity of people. Even if this study was repeated using the same scenes with and without people, there is no objective way of measuring the complexity of the object(s) that would replace the person in the paired scene. We therefore maintain that the method of randomly pairing people and no people pictures and comparing RoIs was a fair and reliable way of measuring chance. However, it should also be acknowledged that the saliency of low-level image features such as spatial frequency, contrast, and color cannot be entirely ruled out of having influence in explaining why fixation appears to be drawn to people, since it is hard to provide a baseline where these features stay constant but the social aspects are removed. 
Saccadic amplitude analyses revealed a 3-way interaction. First, for old pictures that were incorrectly identified, when no people were present in the pictures saccadic amplitude was reliably smaller, and when people were present, saccadic amplitude was reliably greater. Second, when pictures were old and contained no people an incorrect response was related to reliably smaller saccadic amplitude and a correct response was related to reliably greater saccadic amplitude. Third, when pictures were old and contained people, an incorrect response was related to reliably greater saccadic amplitude. It seems that when the picture contains people, if participants ignore this social information but instead search more widely (i.e., increased saccadic amplitude), they are more likely to incorrectly identify the picture. One possible explanation is that when no social information is present, participants have to search more widely to find familiar objects/areas. If they fail to do this (i.e., decreased saccadic amplitude), then they are more likely to incorrectly identify the stimuli. The accuracy data and fixation duration analyses suggest that it is harder to correctly identify an old picture than a new one, which could explain why these interactions are found only on old pictures. 
Comparisons of scanpaths at encoding and recognition showed that similarity between the two was high in all conditions. According to Scanpath Theory, the scanpaths at encoding were similar to those at recognition because they were stored and recalled top down, to determine the scanning sequence. Birmingham et al. (2008) suggest another reason for the similarity in scanpaths at encoding and recognition for pictures containing people. They found that when participants were asked to encode scenes for a memory test, they fixated on the eye area within “people scenes” more frequently than when they were simply asked to freely view the pictures. This was true at both encoding and recognition and suggests that the eyes are scanned strategically by observers who are aware that they will have to remember the scenes. Participants who were not told of the memory test fixated the eyes more strongly in the (surprise) test session than in the free-viewing study session. Thus, Birmingham concludes, the eyes appear to be informative for both deliberately encoding scenes and for spontaneously trying to recognize them. 
An aim of this study was to investigate whether scanpaths differed at encoding and recognition depending on accuracy. Previous research has found reduced accuracy for stimuli that had reduced scanpath similarity between encoding and recognition (Harding & Bloj, 2010). However, although results from the current study suggest some evidence for the replication of eye movement sequences over multiple viewings, the lack of relationship between string similarity and accuracy challenges the idea that the reproduction of eye movements alone is enough to create a memory advantage. People pictures could have been easier to recognize because participants formed conclusions about the semantic content/gist, assigning mental labels to these pictures, e.g., “driving the tractor.” Therefore, even when eye movements were not perfectly reproduced at recognition, people pictures were still easier to identify. 
An interesting finding of this study was that when participants incorrectly identified “old” pictures as “new,” scanpaths at encoding and recognition were reliably more similar when the pictures contained people. This suggests that even though participants were making very similar eye movements at encoding and recognition on the “people” pictures, they were still incorrectly identifying them. This could be because participants did not use the social information in the people pictures at either encoding or recognition, making it harder to correctly identify the picture. This coincides with the greater saccadic amplitudes on incorrectly identified people pictures, suggesting that they were not looking at the people, but rather at the wider scene. It is also worth considering an alternative explanation; that there were low-level saliency differences between people and no people pictures. For example, if some of the people pictures contained visually salient objects (e.g., in the background), then attention could be distracted from the higher level social information. On the other hand, it might then be expected that those salient objects would act as memory aids at recognition. 
For the incorrectly identified pictures that contained no people, lower scanpath similarities imply that participants did not always look in the same places at encoding and recognition, which could have impeded successful recognition. 
Conclusions
In conclusion, the presence of people in natural scenes increases recognition accuracy regardless of the size of the RoI, its distance from initial fixation, or the number of people in the picture. When people were present in the scene, increased saccadic amplitude was related to a reliable decrease in accuracy, possibly due to participants ignoring important social cues. Scanpath analyses showed a high similarity between encoding and recognition, supporting previous findings on the replicability of eye movements over repeated presentations; however, the lack of relationship between string similarity and accuracy challenges the idea that the reproduction of eye movements alone is enough to create a memory advantage. 
Acknowledgments
We are grateful to the UK Engineering and Physical Sciences Research Council (EPSRC) for support (award EP/E006329/1). We would also like to thank Tom Foulsham for use of his string-editing computer program. 
Commercial relationships: none. 
Corresponding author: Katherine Humphrey. 
Email: katherine.humphrey@ntu.ac.uk. 
Address: Division of Psychology, Chaucer Building, Nottingham Trent University, Burton Street, NG1 4BU, Nottingham, UK. 
References
Baron-Cohen S. (1994). How to build a baby that can read minds: Cognitive mechanisms in mind reading. Cahiers de Psychologie Cognitive, 13, 513–552.
Bean K. L. (1938). An experimental approach to the reading of music. Psychological Monographs, 50, 1–80. [CrossRef]
Birmingham E. Bischof W. F. Kingstone A. (2008). Gaze selection in complex social scenes. Visual Cognition, 15, 341–355. [CrossRef]
Brandt S. A. Stark L. W. (1997). Spontaneous eye movements during visual imagery reflect the content of the visual scene. Journal of Cognitive Neuroscience, 9, 27–38. [CrossRef] [PubMed]
Chase W. G. Erikson K. A. (1982). Skill and working memory. In Bower G. H. (Ed.), The psychology of learning and motivation (vol. 16, pp. 1–58). New York: Academic Press.
Choi Y. S. Mosley A. D. Stark L. W. (1995). String editing analysis of human visual search. Optometry and Vision Science, 72, 439–451. [PubMed] [CrossRef] [PubMed]
Clifton J. V. (1986). Cognitive components in music reading and sight reading performance. Unpublished doctoral dissertation, University of Waterloo.
De Valk J. P. J. Eijkman E. G. J. (1984). Analysis of eye fixations during the diagnostic interpretation of chest radiographs. Medical & Biological Engineering & Computing, 22, 353–360. [PubMed] [CrossRef] [PubMed]
Fletcher-Watson S. Findlay J. M. Leekam S. R. Benson V. (2008). Rapid detection of person information in a naturalistic scene. Perception, 37, 571–583. [PubMed] [CrossRef] [PubMed]
Foulsham T. Underwood G. (2008). What can saliency models predict about eye movements Spatial and sequential aspects of fixations during encoding and recognition. Journal of Vision, 8, (2):6, 1–17, http://www.journalofvision.org/content/8/2/6, doi:10.1167/8.2.6. [PubMed] [Article] [CrossRef] [PubMed]
Friesen C. K. Kingstone A. (2008). The eyes have it! Reflexive orienting is triggered by non-predictive gaze. Psychonomic Bulletin & Review, 5, 490–495. [CrossRef]
Gitelman D. R. (2002). ILAB: A program for post experimental eye movement analysis. Behavioral Research Methods, Instruments and Computers, 34, 605–612. [CrossRef]
Hacisalihzade S. S. Allen J. S. Stark L. (1992). Visual perception and sequences of eye movement fixations: A stochastic modelling approach. IEEE Transactions on Systems, Man and Cybernetics, 22, 474–481. [CrossRef]
Halpern A. R. Bower G. H. (1982). Musical expertise and melodic structure in memory for musical notation. American Journal of Psychology, 95, 31–50. [CrossRef]
Harding G. Bloj M. (2010). Real and predicted influence of image manipulations on eye movements during scene recognition. Journal of Vision, 10, (2):8, 1–17, http://www.journalofvision.org/content/10/2/8, doi:10.1167/10.2.8. [PubMed] [Article] [CrossRef] [PubMed]
Henderson J. M. (2003). Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7, 498–504. [PubMed] [CrossRef] [PubMed]
Hood B. M. Willen J. D. Driver J. (1998). Adult's eyes trigger shifts of visual attention in human infants. Psychological Science, 9, 131–134. [CrossRef]
Humphrey K. Underwood G. (2009). Domain knowledge moderates the influence of visual saliency in scene recognition. British Journal of Psychology, 100, 377–398. [PubMed] [CrossRef] [PubMed]
Langton S. R. H. Bruce V. (1999). Reflexive visual orienting in response to the social attention of others. Visual Cognition, 6, 541–567. [CrossRef]
Langton S. R. H. O'Donnell C. Riby D. M. Ballantyne C. J. (2006). Gaze cues influence the allocation of attention in natural scene viewing. Quarterly Journal of Experimental Psychology, 59, 2056–2064. [PubMed] [CrossRef]
Nodine C. F. Mello-Thoms C. Kundel H. L. Weinstein S. P. (2002). Time course of perception and decision making during mammographic interpretation. American Journal of Roentgenology, 179, 917–923. [PubMed] [CrossRef] [PubMed]
Noton D. Stark L. (1971). Scanpaths in saccadic eye movements while viewing and recognizing patterns. Vision Research, 11, 929. [CrossRef] [PubMed]
Ristic J. Friesen C. K. Kingstone A. (2002). Are eyes special It depends on how you look at it. Psychonomic Bulletin & Review, 9, 507–513. [PubMed] [CrossRef] [PubMed]
Ro T. Russell C. Lavie N. (2001). Changing faces: A detection advantage in the flicker paradigm. Psychological Science, 12, 94–99. [PubMed] [CrossRef] [PubMed]
Salis D. L. (1980). Laterality effects with visual perception of musical chords and dot patterns. Perception & Psychophysics, 28, 284–292. [PubMed] [CrossRef] [PubMed]
Sloboda J. A. (1976). Visual perception of musical notation: Registering pitch symbols in memory. Quarterly Journal of Experimental Psychology, 28, 1–16. [PubMed] [CrossRef] [PubMed]
Sloboda J. A. (1978). Perception of contour in music reading. Psychology of Music, 6, 3–20. [PubMed] [CrossRef]
Stark L. Ellis S. R. (1981). Scanpaths revisited: Cognitive models direct active looking. In Fisher, D. F. Monty, R. A. Senders J. W. (Eds.), Eye movements: Cognition and visual perception (pp. 193–227). Hillsdale, NJ: Lawrence Erlbaum.
Thompson W. B. (1987). Music sight-reading skill in flute players. Journal of General Psychology, 114, 345–352. [CrossRef]
Underwood G. Chapman P. Berger Z. Crundall D. (2003). Driving experience, attentional focusing, and the recall of recently inspected events. Transportation Research Part F, 6, 289–304. [CrossRef]
Underwood G. Foulsham T. Humphrey K. (2009). Saliency and scan patterns in the inspection of real-world scenes: Eye movements during encoding and recognition. Visual Cognition, 17, 812–834. [CrossRef]
Underwood G. Humphrey K. Foulsham T. (2008). Knowledge-based patterns of remembering: Eye movement scanpaths reflect domain experience. Lecture Notes in Computer Science, 5298, 125–144.
Voss J. F. Vesonder G. T. Spilich G. J. (1980). Text generation and recall by high-knowledge and low-knowledge individuals. Journal of Verbal Learning and Verbal Behaviour, 19, 651–667. [CrossRef]
Walker-Smith G. J. Gale A. G. Findlay J. M. (1977). Eye movement strategies involved in face perception. Perception, 6, 313–326. [PubMed] [CrossRef] [PubMed]
Yarbus A. (1967). Eye movements and vision. New York: Plenum Press.
Figure 1
 
An example of a “no people” stimulus.
Figure 1
 
An example of a “no people” stimulus.
Figure 2
 
An example of a “people” stimulus.
Figure 2
 
An example of a “people” stimulus.
Figure 3
 
A bar chart showing the recognition accuracy rates for pictures containing people and those containing no people. Participants are more accurate when pictures contain people. The error bars represent the standard error of the mean.
Figure 3
 
A bar chart showing the recognition accuracy rates for pictures containing people and those containing no people. Participants are more accurate when pictures contain people. The error bars represent the standard error of the mean.
Figure 4
 
(A) A bar chart illustrating the accuracy rates for pictures with a Region of Interest of less than 20 pixels and for pictures with a Region of Interest of 20 pixels or more. The error bars represent the standard error of the mean. (B) A bar chart illustrating the accuracy rates for pictures containing one person and for pictures containing more than one person. The error bars represent the standard error of the mean. (C) A bar chart illustrating the recognition accuracy rates for pictures where the RoI is less than 10 degrees of visual arc from the center of the screen (initial fixation) and where the RoI is 10 degrees of visual arc or more from the center of the screen. The error bars represent the standard error of the mean.
Figure 4
 
(A) A bar chart illustrating the accuracy rates for pictures with a Region of Interest of less than 20 pixels and for pictures with a Region of Interest of 20 pixels or more. The error bars represent the standard error of the mean. (B) A bar chart illustrating the accuracy rates for pictures containing one person and for pictures containing more than one person. The error bars represent the standard error of the mean. (C) A bar chart illustrating the recognition accuracy rates for pictures where the RoI is less than 10 degrees of visual arc from the center of the screen (initial fixation) and where the RoI is 10 degrees of visual arc or more from the center of the screen. The error bars represent the standard error of the mean.
Figure 5
 
A line graph to illustrate the 3-way interaction between the factors people/no people, correct/incorrect, and old/new pictures.
Figure 5
 
A line graph to illustrate the 3-way interaction between the factors people/no people, correct/incorrect, and old/new pictures.
Figure 6
 
A bar chart illustrating the mean similarity of scanpaths between encoding (initial inspection) and recognition. A score of 1 would indicate identical scanpaths. The error bars represent the standard error of the mean.
Figure 6
 
A bar chart illustrating the mean similarity of scanpaths between encoding (initial inspection) and recognition. A score of 1 would indicate identical scanpaths. The error bars represent the standard error of the mean.
Table 1
 
Average fixation durations for pictures shown at recognition memory test.
Table 1
 
Average fixation durations for pictures shown at recognition memory test.
Stimulus Fixation duration (ms)
Incorrect, no people, old 340.7
Incorrect, no people, new 331.2
Incorrect, people, old 357.4
Incorrect, people, new 338.5
Correct, no people, old 349.5
Correct, no people, new 335.9
Correct, people, old 357.9
Correct, people, new 335.1
Table 2
 
Average saccadic amplitudes for pictures shown at recognition memory test.
Table 2
 
Average saccadic amplitudes for pictures shown at recognition memory test.
Stimulus Saccadic amplitude (deg arc)
Incorrect, no people, old 3.7
Incorrect, no people, new 4.3
Incorrect, people, old 4.6
Incorrect, people, new 4.4
Correct, no people, old 4.3
Correct, no people, new 4.2
Correct, people, old 4.1
Correct, people, new 4.7
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×