**The Vanderbilt Expertise Test for cars (VETcar) is a test of visual learning for contemporary car models. We used item response theory to assess the VETcar and in particular used differential item functioning (DIF) analysis to ask if the test functions the same way in laboratory versus online settings and for different groups based on age and gender. An exploratory factor analysis found evidence of multidimensionality in the VETcar, although a single dimension was deemed sufficient to capture the recognition ability measured by the test. We selected a unidimensional three-parameter logistic item response model to examine item characteristics and subject abilities. The VETcar had satisfactory internal consistency. A substantial number of items showed DIF at a medium effect size for test setting and for age group, whereas gender DIF was negligible. Because online subjects were on average older than those tested in the lab, we focused on the age groups to conduct a multigroup item response theory analysis. This revealed that most items on the test favored the younger group. DIF could be more the rule than the exception when measuring performance with familiar object categories, therefore posing a challenge for the measurement of either domain-general visual abilities or category-specific knowledge.**

*SD*= 3.0), and we noted that age was not reported in several studies (see supplementary materials). The importance of the difference in age between the typical student and online sample may be underestimated by researchers, who may assume that a test functions in the same way if it is reliable in both types of samples. As a prerequisite for the use of IRT in DIF detection, we also assessed the dimensionality of VETcar using exploratory factor analyses (EFA).

**Figure 1**

**Figure 1**

*n*= 38 items; profile,

*n*= 10 items).

**Table 1**

*t*(1423) = 21.402,

*p*< 0.001. When split by the median for both samples together (27), 402 subjects (87.2%) from the lab sample were younger than 27 years and 59 subjects (12.8%) were aged 27 years or older. In contrast, 295 subjects (30.6%) from the online sample were in the younger group and 669 subjects (69.4%) were in the older group. There were no missing data in item responses in 1,425 subjects.

**Table 2**

**Figure 2**

**Figure 2**

**Table 3**

^{1}All items except Item 43 loaded significantly on the general factor. Specific factors were clustered mainly by car brand. In summary, the dimensionality analysis suggests evidence of multidimensionality in the VETcar.

*χ*

^{2}(49) = 575.94,

*p*< 0.001. Information criteria supported this result, as the Akaike's information criterion (AIC) (79433.51) and Bayesian information criterion (BIC) (80191.23) for the 3PL unidimensional model were less than those of the 2PL unidimensional model (AIC = 79911.45; BIC = 80411.34).

^{2}

**Figure 3**

**Figure 3**

**Table 4**

*z*statistics (Raju, 1990), and the LRT method (Thissen, Steinberg, & Wainer, 1988). Lord's test and Raju's test were done using the difR package in R (Magis, Beland, & Raiche, 2013). The LRT was computed using IRTLRDIF Version 2.0 (Thissen, 2001). A 5% significance level was used for all three methods: 7.815 critical value of chi-square distribution with

*df*= 3 for Lord's chi-square statistic, 1.96 critical value for Raju's

*z*statistics, and 3.85 critical value of the chi-square distribution with

*df*= 1. We considered items significant for any of the three methods as DIF items.

*d*. Therefore, this metric can be interpreted using the guideline by Cohen (1988). The calculation of STDS, UTDS, and ETSSD is carried out using Visual DIF software (Meade, 2010). Below, results of three IRT DIF detection methods and DIF effect sizes are presented for each subgroup.

**Table 5**

*anchor items*in DIF literature). In the current analysis, we constrained the item parameters to be equal to the corresponding item parameters between the younger and older groups for anchor items. Based on DIF results in Step 2, Items 38 and 48 were used as the anchor items for which all three IRT detection methods yielded no DIF between the age groups. Item parameters for all DIF items (from Step 2 analyses) remained unconstrained. Consequently, the item parameter estimates of the anchor items were identical and those of 40 DIF items were different between the two groups. In the presence of the anchor items, the multigroup item response model can be identified by setting a standard normal distribution of the latent variable for the reference group. The mean and variance of the latent variable for the focal group can be estimated.

**Table 6**

**Figure 4**

**Figure 4**

*t*test (with Welch's formula for unequal variance; alpha = 0.05) was implemented to test whether the mean of item parameter estimates without a specific item attribute (

*no*; blank cells in Table 1) was higher than the mean with a specific item attribute (

*yes*; “Y” in Table 1; H0: difference = mean

*no*minus mean

*yes*; H1: difference > 0). Table 7 presents the results of the two-sample

*t*test for each item parameter by age groups. The following was observed. First, the mean of item discrimination and difficulty estimates for transfer items was significantly higher than that of identical items in the younger group. Second, for the older group, the means of all item parameter estimates were higher for transfer items than for identical items. Third, the mean of item guessing parameter estimates was higher for the “different” items than for the “same” items in both groups. Fourth, the mean of item difficulty estimates for “nonfront” items was significantly higher than that for “front” items in both groups. The front view usually includes distinctive features such as an emblem and the shape of headlights.

**Table 7**

**Figure 5**

**Figure 5**

*p*values were smaller than 0.025 or larger than 0.975 extreme values. WinBUGS 1.4.3 was used to calculate the posterior predictive

*p*value. The multigroup item response model yielded limited evidence of subject misfit: One subject out of 697 in the younger group had a posterior predictive

*p*value larger than 0.975, and six subjects out of 728 in the older group had predictive

*p*values that were smaller than 0.025 or larger than 0.975. There was one item with misfit (Item 28) for the younger group and two items with misfit (Items 20 and 45) for the older group. These subject and item fit results indicate that the (unidimensional 3PL) multigroup item response model fits the data well.

*between*age groups, although they can be used to compare subjects

*within*an age group (including men and women).

**Figure 6**

**Figure 6**

*(pp. 3–23). Hillsdale, NJ: Erlbaum.*

*Differential item functioning**. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.*

*The basics of item response theory**, 107 (2), 238–246.*

*Psychological Bulletin**(pp. 433–448). New York, NY: Springer-Verlag.*

*Handbook of modern item response theory**, 72 (7), 1865–1874.*

*Attention, Perception, & Psychophysics**, 27, 552–566.*

*Psychological Assessment**. Hillsdale, NJ: Erlbaum.*

*Statistical power analysis for the behavioral sciences*(2nd ed.)*, 8 (3), e57410.*

*PloS One**, 44 (2), 587–605, doi:10.3758/s13428-011-0160-2.*

*Behavior Research Methods**, 44 (4), 576–585.*

*Neuropsychologia**, 14 (8): 7, 1–12, doi:10.1167/14.8.7.*

*Journal of Vision**, 118 (2), 201–210.*

*Cognition**, 19 (5), 847–857.*

*Psychonomic Bulletin and Review**, 27 , 217–233.*

*Applied Psychological Measurement**, 21 , 347–360.*

*Journal of Educational Measurement**, 6 (1), 1–55.*

*Structural Equation Modeling**, 76 (4), 537–549.*

*Psychometrika**, 77, 442–454.*

*Psychometrika**, 44 (2), 93–116.*

*Journal of Educational Measurement**. Hillsdale, NJ: Erlbaum.*

*Applications of item response theory to practical testing problems**. Retrieved from https://cran.r-project.org/web/packages/difR*

*difR: Collection of methods to detect dichotomous differential item functioning (DIF) in psychometrics*(R Package version 4.5) [Computer software]*, 63 , 135–144.*

*Neuropsychologia**, 69 , 10–22.*

*Vision Research**, 25, 2610–2622.*

*Cerebral Cortex**, 95 (4), 728–743.*

*Journal of Applied Psychology**Mplus [Computer program]*. Los Angeles, CA: Muthén & Muthén.

*, 23 (3), 184–188.*

*Current Directions in Psychological Science**, 46 (4), 1023–1031.*

*Behavior Research Methods**, 14 , 197–207.*

*Applied Psychological Measurement**, 92 , 544–559.*

*Journal of Personality Assessment**, 17 (4), 551–566.*

*Psychological Methods**, 42 (4), 375–394.*

*Journal of Educational Measurement**. Cambridge, UK: MRC Biostatistics Unit.*

*BUGS: Bayesian inference using Gibbs sampling,*version .05*. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.*

*WinBUGS user manual**, 9 (3), 305–318.*

*Perspectives on Psychological Science**. Paper presented at the Annual Spring Meeting of the Psychometric Society, Iowa City, IA.*

*Statistically based tests for the number of common factors**IRTLRDIF (Version 2.0b) [Computer software]*. Chapel Hill, NC: L. L. Thurstone Psychometric Laboratory.

*. Hillsdale, NJ: Erlbaum.*

*Test validity*(pp. 147–149)*, 38 , 1–10.*

*Psychometrika**, 99 (3), 530–548.*

*Journal of Subjectality and Social Psychology**, 29 (5–6), 360–392.*

*Cognitive Neuropsychology**, 107 (11), 5238–5241.*

*Proceedings of the National Academy of Sciences, USA**. University of California, Los Angeles.*

*Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes*(Unpublished doctoral dissertation)^{2}The following priors on item parameters were imposed in WinBUGS: a lognormal distribution with a mean of 0 and a variance of 0.25 for item discriminations (as default prior in BILOG-MG IRT software; Zimowski et al., 1996), a standard normal distribution for item difficulty, and a beta distribution with a mean of 0.33 and a variance of 0.06.