September 2015
Volume 15, Issue 13
Free
Methods  |   September 2015
Differential item functioning analysis of the Vanderbilt Expertise Test for cars
Author Affiliations
Journal of Vision September 2015, Vol.15, 23. doi:10.1167/15.13.23
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to Subscribers Only
      Sign In or Create an Account ×
    • Get Citation

      Woo-Yeol Lee, Sun-Joo Cho, Rankin W. McGugin, Ana Beth Van Gulick, Isabel Gauthier; Differential item functioning analysis of the Vanderbilt Expertise Test for cars. Journal of Vision 2015;15(13):23. doi: 10.1167/15.13.23.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

The Vanderbilt Expertise Test for cars (VETcar) is a test of visual learning for contemporary car models. We used item response theory to assess the VETcar and in particular used differential item functioning (DIF) analysis to ask if the test functions the same way in laboratory versus online settings and for different groups based on age and gender. An exploratory factor analysis found evidence of multidimensionality in the VETcar, although a single dimension was deemed sufficient to capture the recognition ability measured by the test. We selected a unidimensional three-parameter logistic item response model to examine item characteristics and subject abilities. The VETcar had satisfactory internal consistency. A substantial number of items showed DIF at a medium effect size for test setting and for age group, whereas gender DIF was negligible. Because online subjects were on average older than those tested in the lab, we focused on the age groups to conduct a multigroup item response theory analysis. This revealed that most items on the test favored the younger group. DIF could be more the rule than the exception when measuring performance with familiar object categories, therefore posing a challenge for the measurement of either domain-general visual abilities or category-specific knowledge.

Introduction
The Vanderbilt Expertise Test (VET) for cars (VETcar) is part of a test battery designed to measure performance in learning to visually recognize objects from different object categories (Gauthier et al., 2014; McGugin et al., 2012). In the original battery (VET 1.0), several categories were included because performance with each category is likely to be influenced by domain-specific experience, whereas the common variance for several categories (or a latent dimension) may better tap into a domain-general visual ability. However, the VETcar has also been used on its own in a number of studies as a domain-specific measure of performance for the recognition of cars (McGugin, Newton, Gore, & Gauthier, 2014; McGugin, Van Gulick, Tamber-Rosenau, Ross, & Gauthier, 2014). By itself, or aggregated with other car tasks, VETcar scores have been used to test hypotheses about perceptual expertise, including the prediction that behavioral and neural hallmarks of face recognition can be obtained with cars, as a function of behavioral performance with cars (e.g., McGugin, Newton, et al., 2014; McGugin, Richler, Herzmann, Speegle, & Gauthier, 2012; McGugin, Van Gulick et al., 2014). 
In the present work we evaluate the VETcar using item response theory (IRT). Even with increasing interest in individual differences in face and object recognition (McGugin et al., 2012; Wilhelm et al., 2010; Wilmer et al., 2010), studies aiming at evaluating and optimizing relevant measures are still quite rare (but see Cho et al., 2015; Wilmer et al., 2012). IRT offers a powerful framework for understanding and optimizing tests. Whereas classical psychometric approaches limit themselves to summary statistics such as the total score on the test, IRT uses information about performance on individual test items, which are then characterized in terms of separate item parameters for (pseudo-) guessing, discriminability, and difficulty. Within a two-parameter IRT measurement model with item discrimination and difficulty, for example, two subjects with the same total score on a test may not have the same ability level if one subject missed items that were much more discriminating than those missed by the other subject. 
IRT provides an analytic procedure to ask whether a test functions the same way in different settings or for different populations. Consider age effects: If young and old groups produce similar distributions of total scores on a test, standard practice based on total scores would be to assume the test measures the same underlying construct. But within an IRT framework, even when total scores are comparable, differential item functioning (DIF) analysis can assess whether individual items function the same way in the two groups. Imagine a test item that only young subjects with the highest ability level can solve but that the majority of old subjects get correct, even at more moderate ability levels. If a test includes a substantial number of such items, it may not provide a fair method for comparing the ability of the young and old groups. It would be, essentially, as if they had taken different tests. 
In a recent study, Cho et al. (2015) conducted a DIF analysis of the Cambridge Face Memory Test (CFMT), one of the most commonly used tests for measuring ability in face recognition (Duchaine & Nakayama, 2006), and found little evidence of differential functioning of the test in men and women, in younger and older subjects, and in online testing compared with testing in the laboratory. It may be tempting to generalize these results to other high-level visual recognition tests simply because the format of such tests is similar to that of the CFMT. On each subtest of the VET battery (McGugin et al., 2012), just as in the CFMT, subjects first study images of six targets (e.g., six faces on the CFMT, six cars on the VETcar) and later have to recognize the same image, or a different image of the same target, in a series of three-alternative forced-choice trials. It may seem reasonable to expect that if the CFMT functions the same way in the lab and online, then a test such as the VETcar should also measure the same ability in both settings. This is an important assumption for cognitive psychologists at a time when online testing is rapidly growing in popularity (Paolacci & Chandler, 2014; Peer, Vosgerau, & Acquisti, 2014). While many studies have shown that effects of classical experiments can be replicated online (Crump, McDonnell, & Gureckis, 2013; Germine et al., 2012), Cho et al. (2015) is the only case we know of that uses IRT to ask whether the same measurement model can be applied in both settings. 
One reason why some tests may not function the same in an online setting as in a lab setting could be differences in the samples that are typically recruited in the two settings. For instance, most cognitive studies performed in the laboratory use samples of undergraduate students, whereas the volunteers on one of the most popular online platforms, Amazon Mechanical Turk, are more variable in age and, on average, much older. Sample differences can matter more for some tests than others. For instance, while face recognition ability may change with age (Germine, Duchaine, & Nakayama, 2011), finding no evidence of DIF driven by age suggests that it might not change in qualitative ways. A poor performer in an age group that performs better on average may produce test responses that are similar to a good performer in an age group that performs worse on average. Such equivalences may be violated for other categories. In the VETcar, each target and foil is by necessity a model that experienced popularity during a certain period. Individuals in different age groups may have different levels of familiarity with cars of different periods, and this could change the similarity relationships between different cars in the test. If so, a test such as the VETcar may present with substantial evidence of DIF across age groups. 
In this study, we evaluated whether the VETcar functions in the same way when used with different testing platforms (i.e., lab vs. online) that differed in age (we also considered gender groups) using IRT DIF detection methods. Importantly, the testing platform and age differences are to some extent confounded, and we wanted to compare the sorts of samples that are often used (rather than to match the age of online and lab samples to compare the methods of data collection). In a survey of 22 recent studies using online data collection, we found the mean age of subjects to be 33.6 years (SD = 3.0), and we noted that age was not reported in several studies (see supplementary materials). The importance of the difference in age between the typical student and online sample may be underestimated by researchers, who may assume that a test functions in the same way if it is reliable in both types of samples. As a prerequisite for the use of IRT in DIF detection, we also assessed the dimensionality of VETcar using exploratory factor analyses (EFA). 
Method
Description of the data
Measures
The VETcar was designed as a test of visual car recognition ability. Details about the test procedures are provided in McGugin et al. (2012). Initially, six initial target cars are presented for study in unlimited viewing. All the cars used in this task are sedan or coupe styles, shown in a three-quarter view and in grayscale (see Figure 1). Forty-eight three-alternative forced-choice test items follow, each consisting of one target and two cars as foils. The target in the first 12 items is identical to one of the six examples in the study panel. After these trials, in which subjects receive corrective feedback, the study panel is presented again for study, and the instructions specify that in the next items the correct answers will be different images of the same six cars (transfer items). The targets in the transfer items are the same models as in the learning phase but differ in direction, color, and model year. For several items a background environment may also be visible, which would also differ. (Care was taken to ensure that backgrounds were not diagnostic.) 
Figure 1
 
The six car target images.
Figure 1
 
The six car target images.
Table 1 lists the stimuli and item attributes. The list of stimuli contains the brand, model, model year, and three other attributes of the cars or images. First, the items were identical to the studies images (12 items) or transfer items (36 items). A second attribute was the similarity between the target and foils. We marked items with this attribute (20 items) if at least one of the foils was the same brand as one of the other targets in the learning phase (e.g., Honda Civic vs. Honda Accord). Therefore, the subjects needed to generalize from one example of the target to new examples and to discriminate the target from highly visually similar cars. A third attribute was the orientation of the car in the target image (three quarter, n = 38 items; profile, n = 10 items). 
Table 1
 
Items and item attributes. Notes: “Y” indicates yes; blank cell indicates no.
Table 1
 
Items and item attributes. Notes: “Y” indicates yes; blank cell indicates no.
In addition to the test items, three catch items were included in the test phase (in positions 17, 33, and 47 on the test). The foils in the catch items were antique cars or jeeps that are highly distinctive from all targets. Catch items were used to confirm that the subjects paid attention to the task. The data from subjects who missed at least one catch item were removed from our analysis. Correct responses were coded as 1, and false responses were coded as 0. Because a three-alternative forced-choice task was used, the guessing probability that subjects could correctly identify the items by chance was one in three. 
Subjects
The data were collected from two sources: a laboratory setting and an online setting. The laboratory sample consisted of 461 subjects, and the online sample consisted of 964 subjects. The subjects in the online sample were recruited via Amazon Mechanical Turk. Demographic information and descriptive statistics of the two samples are presented in Table 2. Figure 2 shows the age distribution of each sample group. As expected, subjects in the lab sample were younger than those in the online sample (age = 22.19 vs. 33.68 years), t(1423) = 21.402, p < 0.001. When split by the median for both samples together (27), 402 subjects (87.2%) from the lab sample were younger than 27 years and 59 subjects (12.8%) were aged 27 years or older. In contrast, 295 subjects (30.6%) from the online sample were in the younger group and 669 subjects (69.4%) were in the older group. There were no missing data in item responses in 1,425 subjects. 
Table 2
 
Descriptive statistics of samples for two studies (N = 1425).
Table 2
 
Descriptive statistics of samples for two studies (N = 1425).
Figure 2
 
Age distribution of each sample group.
Figure 2
 
Age distribution of each sample group.
Analysis outline
Before proceeding with DIF analyses, it is important to verify certain assumptions (i.e., dimensionality and local independence) for an item response model. Thus, we first explored dimensionality (e.g., whether the data were best described by multiple general dimensions or by one general dimension plus a few specific dimensions) in Step 0. Based on the results of Step 0, an item response model was selected in Step 1. In Step 1, we investigated whether it is necessary to consider multidimensionality (when multidimensionality is found in Step 0; Step 1a) and to add item guessing parameters to the item response model to describe the data adequately (Step 1b). Once an item response model was established in Step 1, IRT DIF analyses were implemented in Step 2 analyses using the selected item response model for the sample (lab vs. online), gender (male vs. female) group, and age group. If there was no concern about DIF, IRT analyses from Step 1 served to provide item and subject score information of VETcar. However, if DIF could not be ignored for any groups we considered, multigroup IRT (Bock & Zimowski, 1997) analyses were implemented to obtain separate item and subject information for each group in Step 3. 
Results
Step 0: Dimensionality of VETcar
For IRT analyses, the dimensionality of the VETcar was investigated using eigenvalues of the sample's tetrachoric correlation matrix and an EFA. The eigenvalues for the first 12 factors were 8.601, 3.536, 2.336, 2.124, 1.913, 1.732, 1.655, 1.419, 1.354, 1.205, 1.147, and 1.118. The ratio of the first to second eigenvalue was 2.432, providing evidence for one dominant factor (e.g., Reise, Moore, & Haviland, 2010). Thus, a bifactor EFA (Jennrich & Bentler, 2011, 2012) was chosen for EFA with more than two factors instead of a regular EFA. The bifactor model consists of multiple dimensions including one general factor and specific factors. The general factor explains the common variance of all items, and the specific factors (orthogonal to the general factor) explain the variance that the general factor did not account for within the subset of the items. Mplus Version 7.11 (Muthén & Muthén, 1998) was used to fit the (bifactor) EFA using tetrachoric correlation (specifically, weighted least square with adjusted means and variance with BI-GEOMIN rotation for the bifactor EFA). 
Four fit indices were compared across (bifactor) EFA models with different numbers of factors. Specifically, a model fits well if the root-mean-square error of approximation index (Steiger & Lind, 1980) is less than 0.06, the root-mean-square residual is less than 0.08, and the comparative fit index (Bentler, 1990) and Tucker-Lewis index (Tucker & Lewis, 1973) are larger than 0.95 (Hu & Bentler, 1999; Yu, 2002). According to fit indices presented in Table 3 and theoretical interpretation (one expected car recognition ability and the fact that each trial is associated with one of six targets), a seven-factor solution (one general factor and six specific factors) was selected. 
Table 3
 
Results of fit indices. Notes: Values in parentheses are lower and upper limits of 95% confidence interval for root-mean-square error of approximation. Selected EFA solution and results of fit indices for the solution were in bold.
Table 3
 
Results of fit indices. Notes: Values in parentheses are lower and upper limits of 95% confidence interval for root-mean-square error of approximation. Selected EFA solution and results of fit indices for the solution were in bold.
We investigated how each item loaded on the extracted seven factors, based on the BI-GEOMIN rotated standardized loading.1 All items except Item 43 loaded significantly on the general factor. Specific factors were clustered mainly by car brand. In summary, the dimensionality analysis suggests evidence of multidimensionality in the VETcar. 
Step 1: IRT analyses
Step 1a: Comparisons between unidimensional and bifactor (multidimensional) item response models
Because item response variance was accounted for by one general factor and several specific factors from Step 0, a bifactor item response model was selected. The bifactor item response model consists of one general dimension and specific dimensions. Item discriminations for the general dimension can be considered as discriminations for the purified dimension controlling for specific dimensions. Item discriminations from a general dimension extracted from an exploratory two-parameter bifactor model (or multidimensional item response model) and those from a unidimensional two-parameter item response model were highly correlated (0.939). This suggests that the strength of the relation between an item and a (unidimensional) construct is similar between the two models. Further, the correlation coefficient between the IRT scale scores was 0.970 for the two models, which indicates that relative ordering of subjects on the latent continuum did not substantially differ between models. Therefore, because the general dimension was not distorted by multidimensionality, we conclude that a unidimensional model is sufficient to capture car recognition ability on the VETcar. 
Step 1b: Comparison between unidimensional two-parameter and three-parameter item response models
Unidimensional item response models were considered because one dominant dimension was sufficient to explain item variances in Step 1a. In the present study, a two-parameter logistic (2PL) item response model was compared with a three-parameter logistic (3PL) item response model. The 2PL model has item characteristics with an item discrimination parameter and an item location parameter. An item discrimination parameter represents an item's ability to distinguish high-ability subjects from low-ability subjects. An item location parameter represents item difficulty. The 3PL model has an item guessing parameter in addition to the item discrimination and location parameters. An item guessing parameter represents the probability that a subject responds to an item correctly without any knowledge. Because the task used in VETcar is a three-alternative forced-choice task, we have reason to consider the item guessing parameter in the model. The unidimensional 3PL model fit better than the unidimensional 2PL model according to the likelihood ratio test (LRT), χ2(49) = 575.94, p < 0.001. Information criteria supported this result, as the Akaike's information criterion (AIC) (79433.51) and Bayesian information criterion (BIC) (80191.23) for the 3PL unidimensional model were less than those of the 2PL unidimensional model (AIC = 79911.45; BIC = 80411.34). 
Based on results of the unidimensional 3PL model, we examined item characteristic curves and actual subject locations. We found that Item 43 had a negative item discrimination parameter (also found in the bifactor EFA results in Step 0 analysis), revealing that subjects with lower ability had a higher probability of getting the item correct than subjects with higher ability. To illustrate the problematic item, Figure 3 shows the item characteristic curve of Item 43 and the observed subject's responses. Thus, Item 43 was excluded from subsequent analyses. Because of high item discrimination estimates, it was difficult to calculate standard errors of item parameter estimates using a marginal maximum likelihood estimation in the irtoys package (Partchev, 2014). Therefore, we re-estimated item parameters of the unidimensional 3PL model using empirical Bayes analysis in WinBUGS 1.4.3 (Spiegelhalter, Thomas, Best, & Gilks, 2006).2 
Figure 3
 
Item characteristic curve (line) and observed subject's responses (dots) of Item 43.
Figure 3
 
Item characteristic curve (line) and observed subject's responses (dots) of Item 43.
Item characteristics
The item parameter estimates of the unidimensional 3PL model are presented in Table 4. There was a large variation in item discrimination estimates (range from 0.539 to 5.037) and two items (Items 23 and 28) were below 0.64, indicating low item discriminations according to a guideline by Baker (2001). Item difficulty parameter estimates covered a large range of the ability levels (ranging from −1.727 to 2.278). There was considerable variability in item guessing parameter estimates (ranging from 0.089 to 0.510). Ten items had higher item guessing parameter estimates than expected by chance (0.333). 
Table 4
 
Item parameter estimates (standard errors) of a three-parameter unidimensional item response model with 47 items. Results were based on 47 items after Item 43 was excluded because of negative item discrimination estimates in initial analyses.
Table 4
 
Item parameter estimates (standard errors) of a three-parameter unidimensional item response model with 47 items. Results were based on 47 items after Item 43 was excluded because of negative item discrimination estimates in initial analyses.
IRT scale scores and reliability
The internal consistency reliability based on IRT results (Green, Bock, Humphreys, Linn, & Reckase, 1984) was 0.815, which is considered satisfactory (>0.8 as a rule of thumb). The test information curve shows maximum information between 1 and 2 on the latent ability continuum; this means that IRT scale scores are the most accurate within this range. 
Step 2: IRT DIF analysis
Based on the best-fitting model in Step 1 (i.e., unidimensional 3PL), IRT DIF analysis was conducted to examine whether item response functions performed similarly or differently between samples (lab vs. online), gender (male vs. female) groups, and age groups. The data were split into two age groups at the median. The number of subjects in the younger group (<27 years) was 697, and the number of subjects in the older group (≥27 years) was 728. 
DIF detection methods are designed to match groups on the ability levels measure by items. If the matching criterion is biased to some degree, then the results of DIF detection methods can be flawed. Thus, it is important to have a reliable and valid matching criterion. Because VETcar is nearly unidimensional, the total scores on the test could be used. However, the total scores do not account for variability of item discriminations (over items) as in IRT scale scores. Thus, we chose to use the IRT scale scores instead of the total score as a matching criterion in the current study by using IRT DIF detection methods. 
Three kinds of IRT DIF detection methods were used for all subgroups: Lord's chi-square test (Lord, 1980), Raju's z statistics (Raju, 1990), and the LRT method (Thissen, Steinberg, & Wainer, 1988). Lord's test and Raju's test were done using the difR package in R (Magis, Beland, & Raiche, 2013). The LRT was computed using IRTLRDIF Version 2.0 (Thissen, 2001). A 5% significance level was used for all three methods: 7.815 critical value of chi-square distribution with df = 3 for Lord's chi-square statistic, 1.96 critical value for Raju's z statistics, and 3.85 critical value of the chi-square distribution with df = 1. We considered items significant for any of the three methods as DIF items. 
DIF testing based on the chi-square statistic is highly sensitive to sample size (e.g., Kim, Cohen, Alagoz, & Kim, 2007). When sample size is large, statistical significance can emerge even when DIF is actually quite small. DIF effect sizes can be investigated to alleviate this concern. We report two DIF effect size measures: the signed and unsigned test difference between the groups (STDS and UTDS, respectively; Meade, 2010). The interpretation of STDS and UTDS is the difference in expected total IRT scale scores across subjects in the focal group that is due to DIF. In DIF analyses, the focal group refers to the particular group of interest, whereas the reference group refers to the group with whom the focal group is to be compared (Angoff, 1993). In the current study, one group is arbitrarily chosen as a focal group and the other is chosen as a reference group. When there is DIF for both item discrimination and item difficulty (or nonuniform DIF), different subjects in the focal group present with larger or smaller effect sizes than subjects in the reference group with the same ability scores. As a result, DIF effect sizes as a whole could equal zero because of the averaging of positive and negative effect size differences (called cancellation). STDS allows cancellation of DIF across both items and subjects, whereas UTDS does not allow such cancellation across items or subjects. In addition to STDS and UTDS, we report the expected test score standardized difference (ETSSD; Meade, 2010). This statistics corresponds to Cohen's d. Therefore, this metric can be interpreted using the guideline by Cohen (1988). The calculation of STDS, UTDS, and ETSSD is carried out using Visual DIF software (Meade, 2010). Below, results of three IRT DIF detection methods and DIF effect sizes are presented for each subgroup. 
Lab versus online DIF
As the first step in DIF analysis, the unidimensional 3PL model (selected in Step 1) was fitted to each sample group separately. Five items in the online group (Items 5, 12, 18, 19, and 39) had negative item discrimination parameters. The brand of these five items was Audi. All five items also had negative item difficulty parameter estimates, and four of these estimates were smaller than −2.7. Therefore, these items seem to be too easy for all subjects in the online group. DIF analyses were conducted based on the remaining 42 items using LRT, Lord's chi-square test, and Raju's test. All items except Item 40 were detected as DIF items by at least one of the three DIF detection methods (see Table 5). When the reference group was the online sample and the focal group was the lab sample, the STDS was 1.258 based on 42 items, indicating that a subject from the lab sample would be expected to have a total score that was 1.258 higher than that of a subject from the online sample when they were assumed to have the same latent ability. The UTDS was 3.446, which can be interpreted like STDS, but when cancellation of DIF across items is not allowed. The ETSSD at the test level was 0.450, indicating a medium effect size. 
Table 5
 
DIF test results. Notes: Blank cells indicate that DIF results are not statistically significant at the 5% level. NA = items were excluded for DIF analysis. Significance in bold.
Table 5
 
DIF test results. Notes: Blank cells indicate that DIF results are not statistically significant at the 5% level. NA = items were excluded for DIF analysis. Significance in bold.
Age DIF
When the unidimensional 3PL model was fit to each age group data, there were four items yielding negative item discrimination estimates (Items 5, 12, 18, and 39) in both groups and an item discrimination estimate close to 0 (0.030 for Item 19) in the older group. The problematic items were also found in the online group. Forty-two items (excluding these five items) were used for IRT DIF detection. The younger group was used as the reference group, and the older group was used as the focal group. All items except Items 38 and 48 were detected as DIF items by at least one of the three DIF detection methods (see Table 5). The STDS was −2.578 based on 42 items, which indicates that a subject from the old sample would be expected to have a total score that was 2.578 lower than that of a subject from the younger sample when they were assumed to have the same latent ability. When the cancellation is not allowed in UTDS, the DIF effect size was 3.021. ETSSD at the test level was −0.476, which indicates a medium effect size. 
Gender DIF
A separate group analysis using the unidimensional 3PL model fits well to the gender group data using 47 items. Gender DIF analysis was carried out to examine whether item characteristics were identical between males and females. Twenty items were detected as DIF items by one of the three DIF detection methods (see Table 5). When the reference group was the male group and the focal group was the female group, the STDS was −0.466 and the UTDS was 0.885. The ETSSD at the test level was −0.072, which also indicates a small effect size. 
Step 3: Multigroup analysis
In Step 2, we found a substantial number of items showing DIF at a medium effect size for the comparisons between the lab and online samples and between the younger and older subjects. This result suggests that test scores from the two different test settings and from the two age groups are not directly comparable on the same scale. Of course, there was considerable overlap between these two subgroups: Online subjects were on average older than those tested in the lab (see Table 2; Figure 2). From this point on, we chose to focus on the age groups rather than the test settings because it is plausible that the difference in test setting is due to the age difference. On one hand, it is reasonable to expect that subjects of different age groups may have differential experience with the car models on the test. In addition, there is evidence from a test with faces, in which the stimuli are not as clearly dated, that neither age nor test setting produces significant DIF (Cho et al., 2015). Therefore, we conducted a multigroup analysis according to age groups based on 42 items (Items 5, 12, 18, 19, and 39 were deleted because of 0 or negative item discrimination estimates) using empirical Bayesian analysis via WinBUGS 1.4.3. 
In a multigroup item response model, item parameters of groups are estimated simultaneously and connect the item parameter estimates of the groups to a common (latent) metric using non–DIF items (called anchor items in DIF literature). In the current analysis, we constrained the item parameters to be equal to the corresponding item parameters between the younger and older groups for anchor items. Based on DIF results in Step 2, Items 38 and 48 were used as the anchor items for which all three IRT detection methods yielded no DIF between the age groups. Item parameters for all DIF items (from Step 2 analyses) remained unconstrained. Consequently, the item parameter estimates of the anchor items were identical and those of 40 DIF items were different between the two groups. In the presence of the anchor items, the multigroup item response model can be identified by setting a standard normal distribution of the latent variable for the reference group. The mean and variance of the latent variable for the focal group can be estimated. 
Item characteristics
Table 6 presents the item parameter estimates and their standard errors of the multigroup item response model. The expected score standardized difference (ESSD) for each item is also included. A positive ESSD means that the item favored the old group, whereas a negative ESSD indicates that the item favored the young group. All items except Items 23 and 28 favored the young group. According to the ESSD, Items 23 and 28 led to better performance for the old group. These items shared the same target and item attributes: The target was an Acura RL 2009, and the foils were other luxury brands. 
Table 6
 
Multiple group analysis by age groups with 42 items. Items 38 and 48 were used as the anchor item. Estimates are posterior median, and standard errors are posterior standard deviation.
Table 6
 
Multiple group analysis by age groups with 42 items. Items 38 and 48 were used as the anchor item. Estimates are posterior median, and standard errors are posterior standard deviation.
For the young group, item discrimination parameter estimates for most items (ranging from 0.445 to 4.15) were acceptable. According to guidelines for the interpretation of item discrimination estimates by Baker (2001), three items (Items 11, 23, and 28) had a low item discrimination parameter estimate (<0.64). Item difficulty parameter estimates covered a wide range of ability levels (ranging from −2.366 to 2.334), which is ideal for measuring a wide range of ability. Two items (Items 21 and 48) had higher item guessing parameter estimates than is expected by chance (0.33). 
For the old group, item discrimination parameter estimates for all items except Item 11 (ranging from 0.062 to 4.554) were acceptable (Baker, 2001). One item (Item 11) had low item discrimination parameter estimates (0.062). The range of item difficulty parameter estimates was −1.379 to 2.452, which indicates that a majority of items (69%) covered middle and high ability levels (compared with the young group). Three items (Items 35, 47, and 48) had higher item guessing parameters than expected by chance. 
We found five patterns in item characteristic curves by age groups (see Figure 4 for each pattern as an example). The first pattern is that the younger group had uniformly higher probability of producing a correct response than the older group across all ability levels for 22 items (Items 1, 2, 4, 6, 7, 8, 9, 10, 11, 14, 16, 17, 24, 25, 27, 34, 36, 37, 40, 42, 44, and 45). All identical items but one (Item 3) showed this pattern. The second pattern is that the probability of producing a correct response was higher for the older group than for the younger group at the relatively low ability levels, whereas the probability was higher for the younger group than for the older group at the relatively high ability levels. This pattern was found for five items (Items 3, 20, 41, 46, and 47). The third pattern is the reverse of the second one and was found for six items (Items 23, 28, 31, 32, 33, and 35). The fourth pattern is that the probability of producing a correct response was higher for the younger group than for the older group in the middle range of ability. This pattern was found for five items (Items 13, 15, 22, 16, and 29). Fifth, the reverse of the fourth pattern was found for Item 21. 
Figure 4
 
Item characteristic curves by age groups.
Figure 4
 
Item characteristic curves by age groups.
Variability in item parameter estimates was investigated by item attributes (reported in Table 1) within the younger and older samples. A two-sample t test (with Welch's formula for unequal variance; alpha = 0.05) was implemented to test whether the mean of item parameter estimates without a specific item attribute (no; blank cells in Table 1) was higher than the mean with a specific item attribute (yes; “Y” in Table 1; H0: difference = mean no minus mean yes; H1: difference > 0). Table 7 presents the results of the two-sample t test for each item parameter by age groups. The following was observed. First, the mean of item discrimination and difficulty estimates for transfer items was significantly higher than that of identical items in the younger group. Second, for the older group, the means of all item parameter estimates were higher for transfer items than for identical items. Third, the mean of item guessing parameter estimates was higher for the “different” items than for the “same” items in both groups. Fourth, the mean of item difficulty estimates for “nonfront” items was significantly higher than that for “front” items in both groups. The front view usually includes distinctive features such as an emblem and the shape of headlights. 
Table 7
 
Mean and standard deviation of item parameter estimates by item attributes and p values from t tests. Notes: Values in parentheses indicate the standard deviation of item parameter estimates. Significance in bold.
Table 7
 
Mean and standard deviation of item parameter estimates by item attributes and p values from t tests. Notes: Values in parentheses indicate the standard deviation of item parameter estimates. Significance in bold.
IRT scale score and reliability
The mean and variance estimates of ability distribution for the older group were −0.635, 95% credibility interval [−0.962, −0.431], and 0.673, 95% credibility interval [0.465, 0.921], respectively. This result suggests that the mean ability level for the older group was lower than that for the younger group and that there is less variability in ability scores in the older group than in the younger group. The internal consistency reliability based on IRT scale scores (ranging from 0 to 1) for the younger group was 0.822 and for the older group was 0.795. The information curve for both the younger and older groups reveals that IRT scale scores were most accurate between 1 and 2 (see Figure 5). According to test characteristic curve by age group (shown in Figure 4), the IRT true scores of the younger group were uniformly higher than those of the older group across all ability levels at the test level. 
Figure 5
 
Test information curve for the young group (top) and the old group (bottom).
Figure 5
 
Test information curve for the young group (top) and the old group (bottom).
Subject fit and item fit
Based on results of the multigroup item response model, item fit and subject fit statistics were calculated to test how well the item response model for each age group represents each test item and each subject, respectively. Standardized residuals (Spiegelhalter, Thomas, Best, & Lunn, 2003) were used as a discrepancy measure. Item fit was calculated as the mean of the standardized residuals over subjects (Sinharay, 2005), and subject fit was calculated as the mean of the standardized residuals over items (Glas & Meijer, 2003). Misfit at the 5% level was considered when posterior predictive p values were smaller than 0.025 or larger than 0.975 extreme values. WinBUGS 1.4.3 was used to calculate the posterior predictive p value. The multigroup item response model yielded limited evidence of subject misfit: One subject out of 697 in the younger group had a posterior predictive p value larger than 0.975, and six subjects out of 728 in the older group had predictive p values that were smaller than 0.025 or larger than 0.975. There was one item with misfit (Item 28) for the younger group and two items with misfit (Items 20 and 45) for the older group. These subject and item fit results indicate that the (unidimensional 3PL) multigroup item response model fits the data well. 
Summary and discussion
Summary
In this article, we investigated whether the VETcar items functioned in the same way for groups of subjects divided according to three subgroups (i.e., lab vs. online, age, and gender) using IRT DIF detection methods. We first examined the dimensionality of the VETcar and found that one dominant dimension was sufficient to explain item variances of the test. Further, the unidimensional 3PL model was found to be the best-fitting model. Subsequently, DIF analyses were carried out using a unidimensional 3PL model (after confirming that the model fit well within each group). DIF analysis results suggested that DIF was not of concern with regard to gender. However, the large number of DIF and medium DIF effect sizes at the test level were found for samples and age groups, respectively. These results suggest that the VETcar does not measure the same dimension in the different sample or age groups. Because we cannot dissociate age and sample effects and because it is plausible that age differences cause the sample DIF, we reported the item characteristics and IRT scale scores by age group using a multigroup item response model. Results from the multigroup analysis suggested that all 42 items except two favored the younger group (according to ESSD), and IRT scale scores for the younger group were uniformly higher than those for the older group across all levels of visual recognition ability at the test level. 
Discussion
Online testing is an important new trend in psychology (Paolacci & Chandler, 2014; Peer et al., 2014), in part providing an answer to ongoing concerns of limited power in psychological studies (Schimmack, 2012; Stanley & Spence, 2014). This is perhaps most relevant for the study of individual differences, for which large samples are especially important and for which online samples can be more representative of a wider population than samples of undergraduate students. However, with increased variability in the subjects we study comes a greater likelihood that some measurements may be biased in favor of some groups of people, making it more difficult to compare individuals on a common scale. The use of full-scale aggregate scores ignores this problem, as such scales cannot address factors other than the trait that is measured on the test. Here we used IRT to assess the VETcar, and in particular we used DIF analysis to ask whether the test functions the same way in lab versus online settings and for different groups based on age and gender. We found evidence of considerable DIF as a function of testing method (lab vs. online) and of age group. Because the online sample was older on average than the lab sample, we assumed that the DIF is driven mainly by age differences, as it is plausible that people's familiarity with different models of cars could be highly dependent on age and influence what an item measures. Prior work showed that individuals with expertise for modern cars processed these cars more holistically (like faces) but that this did not apply to less familiar antique car models (Bukach, Phillips, & Gauthier, 2010). Our results suggest that more subtle differences in familiarity with car models could influence the measurement of recognition ability for this category. 
Interpretations of DIF
When a test is developed for normative comparisons with other subjects, finding DIF indicates that items are unfair to one of the groups. In the present case, when the effects of DIF are accumulated across items, the meaning and implications of the test scores from VETcar can be distorted. That is, the same total scores or IRT scale scores from the two different groups may represent different car recognition ability across groups in the presence of DIF. Importantly, if we are correct about the mechanisms driving these effects, many tests of individual differences in object recognition in familiar domains may lead to similar situations with DIF—not only across different age groups but possibly also for any groups that differ in their experience with some of the items. It would seem reasonable to expect that other tests with cars (e.g., Cambridge Car Memory Test; Dennett et al., 2012) may show DIFs similar to those observed in the VETcar, as might other subtests of the VET battery. A test of face recognition, the CFMT, did not exhibit a significant degree of DIF in a recent study (see Cho et al., 2015). This may be because it uses a single category of faces (adult Caucasian male faces) with which all subjects had a considerable amount of experience. 
Because DIF arises from the interaction of item and group properties, it may be difficult to revise the VETcar or similar tests to completely prevent such measurement problems. For instance, we could select a collection of items that do not present DIF for age groups, but the same items could show DIF for other groups (e.g., subjects from different parts of the country, or those who live in rural vs. urban settings). Unfortunately, studies often do not collect samples sufficiently large for DIF analyses, so authors may need to be especially careful with samples that may include subjects varying greatly in their experience. 
While we may be interested in skills that generalize from domain-general traits or in experience with instances of a category that would generalize to new instances (e.g., an experienced birder should learn a new bird species faster than a novice), a considerable amount of the variance in performance could be determined by experience at a subcategory level, which would be much harder to equate across individuals. The extent and generality of this problem will be assessed when more tests of object recognition ability are assessed for DIF. A possible solution may come from measuring domain-general ability in a format similar to the VET but with entirely novel objects. As it stands, our results reveal that in the case of the VETcar, the total scores or IRT scale scores cannot be used to compare subjects between age groups, although they can be used to compare subjects within an age group (including men and women). 
IRT scale score versus total score for within–age group comparisons with the VETcar
IRT scale scores can differentiate individuals who have the same total scores because they are based on item response pattern scoring. Based on the item discrimination parameter modeled in the unidimensional 3PL model, individuals who score correctly on items with higher item discriminations have higher IRT scale scores than those who perform correctly on items with lower item discriminations. A difference between total scores (unweighted scores) and IRT scale scores (weighted scores) becomes more noticeable when item discriminations are more varied across items; that is because when all items are equally discriminable, they contribute to IRT scale score with equal weights. There was a considerable amount of variation in item discrimination on the VETcar: The standard deviation of item discrimination estimates was 0.983 for the young group and 1.087 for the old group (see Table 6). As shown in Figure 6, there was also considerable variability in IRT scale scores for individuals of the same total unweighted score. This pattern was more evident at the lower end of the IRT ability scale in both age groups. These results imply that within an age group, IRT scale scores are more diagnostic than the total scores. However, the correlations between IRT scale scores and total scores by age group were 0.940 for the young group and 0.943 for the old group. This indicates that the relative ordering would not be changed when total scores are used. When one is concerned only with relative ordering, total unweighted scores may be sufficient (for within–age group comparisons). 
Figure 6
 
IRT scale score versus total scores for the young group (top) and the old group (bottom).
Figure 6
 
IRT scale score versus total scores for the young group (top) and the old group (bottom).
Methodological limitations
This study has a number of limitations. First, we used a marginal maximum likelihood estimation for detecting DIF items since IRT DIF detection methods have been developed mainly with the marginal maximum likelihood estimation. However, we implemented empirical Bayesian analysis for obtaining item parameter estimates, subject scores, and standard errors (technically speaking, they are standard deviation of posterior moment) because standard error was not obtained with a marginal maximum likelihood estimation with high item discrimination estimates. Because item parameter estimates were similar between the two different estimation methods, we expect that DIF results would not differ with Bayesian analysis. 
Second, the DIF detection methods we used assumed that a portion of items can be used as anchor items (i.e., non–DIF items). The iterative purification approach was used to search such items for detecting DIF items in the current study, as recommended by Lord (1980). That is, item purification iteratively removes the items flagged as DIF items to obtain the anchor items for the scale comparability between the two groups in detecting DIF items. In the presence of a large proportion of DIF items, as in the VETcar, the adequate power of the iterative purification approach is limited. To alleviate this problem, anchor items were used in multigroup analysis only when all three DIF detection methods suggested anchor items. 
Third, this study focused on detecting DIF items, not explaining DIF items. We can only speculate that a plausible explanation of DIF based on age groups is differential exposure to cars between the two age groups. When a more reliable and valid measure of car experience is available, it may be possible to test this account for age DIF results. In such a case, in the detection of DIF, both the (latent) ability levels and the car experience level can be used as the matching criteria. If car experience can account for age DIF, it is expected to have a small number of DIF items and the small DIF effect size. 
Last, age is a continuous variable, which we arbitrarily dichotomized into two age groups using a median split to increase the statistical power of DIF analysis using a (discrete) multigroup IRT. DIF as a function of a continuous variable such as age and its DIF effect size measures have not yet been fully developed in the IRT DIF literature. 
Acknowledgments
This work was supported by NSF (SBE-0542013) and by the Vanderbilt Vision Research Center (P30-EY008126). 
Commercial relationships: none. 
Corresponding author: Isabel Gauthier. 
Email: isabel.gauthier@vanderbilt.edu. 
Address: Department of Psychology, Vanderbilt University, Nashville, TN, USA. 
References
Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In Holland P. W. Wainer H. (Eds.) Differential item functioning (pp. 3–23). Hillsdale, NJ: Erlbaum.
Baker, F. (2001). The basics of item response theory. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.
Bentler P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin , 107 (2), 238–246.
Bock D. R., Zimowski M. F. (1997). The multiple group IRT. In van der Linden W. J. Hambleton. R. K. (Eds.) Handbook of modern item response theory (pp. 433–448). New York, NY: Springer-Verlag.
Bukach, C. M., Phillips S. W., Gauthier I. (2010). Limits of generalization between categories and implications for theories of category specificity. Attention, Perception, & Psychophysics, 72 (7), 1865–1874.
Cho S.-J., Wilmer J., Herzmann G., McGugin R., Fiset D., Van Gulick A. B., Gauthier I. (2015). Item response theory analyses of the Cambridge Face Memory Test (CFMT). Psychological Assessment, 27, 552–566.
Cohen J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Crump M. J., McDonnell J. V., Gureckis T. M. (2013). Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research. PloS One, 8 (3), e57410.
Dennett H. W., McKone E., Tavashmi R., Hall A., Pidcock M., Edwards M., Duchaine B. (2012). The Cambridge Car Memory Test: A task matched in format to the Cambridge Face dissociations from face memory, and expertise effects. Behavior Research Methods, 44 (2), 587–605, doi:10.3758/s13428-011-0160-2.
Duchaine B., Nakayama K. (2006). The Cambridge Face Memory Test: Results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants. Neuropsychologia , 44 (4), 576–585.
Gauthier I., McGugin R. W., Richler J. J., Herzmann G., Speegle M., Van Gulick A. E. (2014). Experience moderates overlap between object and face recognition, suggesting a common ability. Journal of Vision , 14 (8): 7, 1–12, doi:10.1167/14.8.7.
Germine L. T., Duchaine B., Nakayama K. (2011). Where cognitive development and aging meet: Face learning ability peaks after age 30. Cognition , 118 (2), 201–210.
Germine L., Nakayama K., Duchaine B. C., Chabris C. F., Chatterjee G., Wilmer J. B. (2012). Is the Web as good as the lab? Comparable performance from Web and lab in cognitive/perceptual experiments. Psychonomic Bulletin and Review, 19 (5), 847–857.
Glas C. A. W., Meijer R. R. (2003). A Bayesian approach to person fit analysis in item response theory models. Applied Psychological Measurement , 27 , 217–233.
Green B. F., Bock R. D., Humphreys L. G., Linn R. L., Reckase M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement , 21 , 347–360.
Hu L. T., Bentler P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6 (1), 1–55.
Jennrich R. I., Bentler P. M. (2011). Exploratory bi-factor analysis. Psychometrika , 76 (4), 537–549.
Jennrich R. I., Bentler P. M. (2012). Exploratory bi-factor analysis: The oblique case. Psychometrika, 77, 442–454.
Kim S. H., Cohen A. S., Alagoz C., Kim S. (2007). DIF detection and effect size measures for polytomously scored items. Journal of Educational Measurement , 44 (2), 93–116.
Lord F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Magis D., Beland S., Raiche G. (2013). difR: Collection of methods to detect dichotomous differential item functioning (DIF) in psychometrics (R Package version 4.5) [Computer software]. Retrieved from https://cran.r-project.org/web/packages/difR
McGugin R. W., Newton A. T., Gore J. C., Gauthier I. (2014). Robust expertise effects in right FFA. Neuropsychologia , 63 , 135–144.
McGugin R. W., Richler J. J., Herzmann G., Speegle M., Gauthier I. (2012). The Vanderbilt Expertise Test reveals domain-general and domain-specific sex effects in object recognition. Vision Research , 69 , 10–22.
McGugin R. W., Van Gulick A. E., Tamber-Rosenau B. J., Ross D. A., Gauthier I. (2014). Expertise effects in face-selective areas are robust to clutter and diverted attention, but not to competition. Cerebral Cortex, 25, 2610–2622.
Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology , 95 (4), 728–743.
Muthén L. K., Muthén B. O. (1998). Mplus [Computer program]. Los Angeles, CA: Muthén & Muthén.
Paolacci G., Chandler J. (2014). Inside the Turk: Understanding Mechanical Turk as a participant pool. Current Directions in Psychological Science, 23 (3), 184–188.
Partchev I. (2014). Irtoys: Simple interface to the estimation and plotting of IRT models (R Package Version 0.1.7). Retrieved from http://CRAN.R-project.org/package=irtoys
Peer E., Vosgerau J., Acquisti A. (2014). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavior Research Methods, 46 (4), 1023–1031.
Raju N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement , 14 , 197–207.
Reise S., Moore T., Haviland M. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment , 92 , 544–559.
Schimmack U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17 (4), 551–566.
Sinharay S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement , 42 (4), 375–394.
Spiegelhalter D. J., Thomas A., Best N. G., Gilks W. R. (1996). BUGS: Bayesian inference using Gibbs sampling, version .05. Cambridge, UK: MRC Biostatistics Unit.
Spiegelhalter D. J., Thomas A., Best N. G., Lunn D. (2003). WinBUGS user manual. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.
Stanley D. J., Spence J. R. (2014). Expectations for replications: Are yours realistic? Perspectives on Psychological Science, 9 (3), 305–318.
Steiger J. H., Lind J. C. (1980, May). Statistically based tests for the number of common factors. Paper presented at the Annual Spring Meeting of the Psychometric Society, Iowa City, IA.
Thissen D. (2001). IRTLRDIF (Version 2.0b) [Computer software]. Chapel Hill, NC: L. L. Thurstone Psychometric Laboratory.
Thissen D., Steinberg L., Wainer H. (1988). Use of item response theory in the study of group difference in trace lines. In Wainer H. Braun H. (Eds.) Test validity (pp. 147–149). Hillsdale, NJ: Erlbaum.
Tucker, L., Lewis C. (1973). The reliability coefficient for maximum likelihood factor analysis. Psychometrika , 38 , 1–10.
Wilhelm O., Herzmann G., Kunina O., Danthiir V., Schacht A., Sommer W. (2010). Individual differences in perceiving and recognizing faces—One element of social cognition. Journal of Subjectality and Social Psychology , 99 (3), 530–548.
Wilmer J. B., Germine L., Chabris C. F., Chatterjee G., Gerbasi M., Nakayama K. (2012). Capturing specific abilities as a window into human individuality: The example of face recognition. Cognitive Neuropsychology, 29 (5–6), 360–392.
Wilmer J. B., Germine L., Chabris C. F., Chatterjee G., Williams M., Loken E., Duchaine B. (2010). Human face recognition ability is specific and highly heritable. Proceedings of the National Academy of Sciences, USA, 107 (11), 5238–5241.
Yu C. Y. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes (Unpublished doctoral dissertation). University of California, Los Angeles.
Zimowski M. F., Muraki E., Mislevy R. J., Bock R. D. (1996). BILOG-MG: Multiple-group IRT analysis and test maintenance for binary items [computer program]. Chicago, IL: Scientific Software International.
Footnotes
1  BI-GEOMIN rotated standardized loading is available from the first author upon request.
Footnotes
2  The following priors on item parameters were imposed in WinBUGS: a lognormal distribution with a mean of 0 and a variance of 0.25 for item discriminations (as default prior in BILOG-MG IRT software; Zimowski et al., 1996), a standard normal distribution for item difficulty, and a beta distribution with a mean of 0.33 and a variance of 0.06.
Figure 1
 
The six car target images.
Figure 1
 
The six car target images.
Figure 2
 
Age distribution of each sample group.
Figure 2
 
Age distribution of each sample group.
Figure 3
 
Item characteristic curve (line) and observed subject's responses (dots) of Item 43.
Figure 3
 
Item characteristic curve (line) and observed subject's responses (dots) of Item 43.
Figure 4
 
Item characteristic curves by age groups.
Figure 4
 
Item characteristic curves by age groups.
Figure 5
 
Test information curve for the young group (top) and the old group (bottom).
Figure 5
 
Test information curve for the young group (top) and the old group (bottom).
Figure 6
 
IRT scale score versus total scores for the young group (top) and the old group (bottom).
Figure 6
 
IRT scale score versus total scores for the young group (top) and the old group (bottom).
Table 1
 
Items and item attributes. Notes: “Y” indicates yes; blank cell indicates no.
Table 1
 
Items and item attributes. Notes: “Y” indicates yes; blank cell indicates no.
Table 2
 
Descriptive statistics of samples for two studies (N = 1425).
Table 2
 
Descriptive statistics of samples for two studies (N = 1425).
Table 3
 
Results of fit indices. Notes: Values in parentheses are lower and upper limits of 95% confidence interval for root-mean-square error of approximation. Selected EFA solution and results of fit indices for the solution were in bold.
Table 3
 
Results of fit indices. Notes: Values in parentheses are lower and upper limits of 95% confidence interval for root-mean-square error of approximation. Selected EFA solution and results of fit indices for the solution were in bold.
Table 4
 
Item parameter estimates (standard errors) of a three-parameter unidimensional item response model with 47 items. Results were based on 47 items after Item 43 was excluded because of negative item discrimination estimates in initial analyses.
Table 4
 
Item parameter estimates (standard errors) of a three-parameter unidimensional item response model with 47 items. Results were based on 47 items after Item 43 was excluded because of negative item discrimination estimates in initial analyses.
Table 5
 
DIF test results. Notes: Blank cells indicate that DIF results are not statistically significant at the 5% level. NA = items were excluded for DIF analysis. Significance in bold.
Table 5
 
DIF test results. Notes: Blank cells indicate that DIF results are not statistically significant at the 5% level. NA = items were excluded for DIF analysis. Significance in bold.
Table 6
 
Multiple group analysis by age groups with 42 items. Items 38 and 48 were used as the anchor item. Estimates are posterior median, and standard errors are posterior standard deviation.
Table 6
 
Multiple group analysis by age groups with 42 items. Items 38 and 48 were used as the anchor item. Estimates are posterior median, and standard errors are posterior standard deviation.
Table 7
 
Mean and standard deviation of item parameter estimates by item attributes and p values from t tests. Notes: Values in parentheses indicate the standard deviation of item parameter estimates. Significance in bold.
Table 7
 
Mean and standard deviation of item parameter estimates by item attributes and p values from t tests. Notes: Values in parentheses indicate the standard deviation of item parameter estimates. Significance in bold.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×